John Thickstun | publications

2025

ISMIR

Aligning Text-to-Music Evaluation with Human Preferences

Huang, Yichen, Novack, Zachary, Saito, Koichi, Shi, Jiatong, Watanabe, Shinji, Mitsufuji, Yuki, Thickstun, John, and Donahue, Chris

In International Society for Music Information Retrieval 2025

Abs arXiv Code

Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).
arXiv

Esoteric Language Models

Sahoo, Subham Sekhar, Yang, Zhihan, Akhauri, Yash, Liu, Johnna, Singh, Deepansha, Cheng, Zhoujun, Liu, Zhengzhong, Xing, Eric, Thickstun, John, and Vahdat, Arash

2025

Abs arXiv Code Website

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features–most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches.

2024

ISMIR-LBD

Hookpad Aria: A Copilot for Songwriters

Donahue, Chris, Wu, Shih-Lun, Kim, Yewon, Carlton, Dave, Miyakawa, Ryan, and Thickstun, John

In International Society for Music Information Retrieval Late Breaking Demos 2024

Abs arXiv Website

Recent progress in generative models for music has highlighted the need for interactive, controllable systems that service the goals of musicians. In this paper, we introduce a system that integrates a compositional assistant and live improvisational partner directly into the modern music producer’s toolkit–the digital audio workstation. Our system design is guided by 1) integration with modern music production software, 2) non-linear songwriting workflows, and 3) enabling generative AI to support live improvisation tasks. We find that anticipatory transformer models are well suited for these goals, and present a method for adapting an anticipatory model for live improvisation. We call on future work to further explore human-AI co-performance by designing systems to be accessible and integrated into the workflows of domain experts.
arXiv

Constrained Diffusion Implicit Models

Jayaram, Vivek, Kemelmacher-Shlizerman, Ira, Seitz, Steven M, and Thickstun, John

2024

Abs arXiv Code Website

This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose constrained diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconstrained DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.
GenAICHI

Designing Live Human-AI Collaboration for Musical Improvisation

Becker, Nic, Louie, Ryan, Thickstun, John, and Liang, Percy

In CHI Workshop on Generative AI and HCI 2024

Abs PDF

Recent progress in generative models for music has highlighted the need for interactive, controllable systems that service the goals of musicians. In this paper, we introduce a system that integrates a compositional assistant and live improvisational partner directly into the modern music producer’s toolkit–the digital audio workstation. Our system design is guided by 1) integration with modern music production software, 2) non-linear songwriting workflows, and 3) enabling generative AI to support live improvisation tasks. We find that anticipatory transformer models are well suited for these goals, and present a method for adapting an anticipatory model for live improvisation. We call on future work to further explore human-AI co-performance by designing systems to be accessible and integrated into the workflows of domain experts.
TMLR

Robust distortion-free watermarks for language models

Kuditipudi, Rohith, Thickstun, John, Hashimoto, Tatsunori, and Liang, Percy

Transactions on Machine Learning Research 2024

Abs arXiv Code Website Talk

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers – which we compute using a randomized watermark key – to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models – OPT-1.3B, LLaMA-7B and Alpaca-7B – to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text (p≤0.01) from 35 tokens even after corrupting between 40-50% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around 25% of the responses – whose median length is around 100 tokens – are detectable with p≤0.01, and the watermark is also less robust to certain automated paraphrasing attacks we implement.
TMLR

Anticipatory music transformer

Thickstun, John, Hall, David, Donahue, Chris, and Liang, Percy

Transactions on Machine Learning Research 2024

Abs arXiv Code Website Media Talk

We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by problems arising in the control of symbolic music generation. We focus on infilling control tasks, whereby the controls are a subset of the events themselves, and conditional generation completes a sequence of events given the fixed control events. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted music generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip.

2023

JMLR

MAUVE Scores for Generative Models: Theory and Practice

Pillutla, Krishna, Liu, Lang, Thickstun, John, Welleck, Sean, Swayamdipta, Swabha, Zellers, Rowan, Oh, Sewoong, Choi, Yejin, and Harchaoui, Zaid

Journal of Machine Learning Research 2023

Abs arXiv Code

Generative AI has matured to a point where large-scale models can generate text that seems indistinguishable from human-written text and remarkably photorealistic images. Automatically measuring how close the distribution of generated data is to the target real data distribution is a key step in diagnosing existing models and developing better models. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore four approaches to statistically estimate these scores: vector quantization, non-parametric estimation, classifier-based estimation, and parametric Gaussian approximations. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of f-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We conclude the paper by demonstrating its applications to other AI domains and discussing practical recommendations.
ACL Outstanding Paper

Backpack language models

Hewitt, John, Thickstun, John, Manning, Christopher D., and Liang, Percy

In Proceedings of the Association for Computational Linguistics 2023

Abs arXiv Code Website

We present Backpacks: a new neural architecture that marries strong modeling performance with an interface for interpretability and control. Backpacks learn multiple non-contextual sense vectors for each word in a vocabulary, and represent a word in a sequence as a context-dependent, non-negative linear combination of sense vectors in this sequence. We find that, after training, sense vectors specialize, each encoding a different aspect of a word. We can interpret a sense vector by inspecting its (non-contextual, linear) projection onto the output space, and intervene on these interpretable hooks to change the model’s behavior in predictable ways. We train a 170M-parameter Backpack language model on OpenWebText, matching the loss of a GPT-2 small (124M parameter) Transformer. On lexical similarity evaluations, we find that Backpack sense vectors outperform even a 6B-parameter Transformer LM’s word embeddings. Finally, we present simple algorithms that intervene on sense vectors to perform controllable text generation and debiasing. For example, we can edit the sense vocabulary to tend more towards a topic, or localize a source of gender bias to a sense vector and globally suppress that sense.
TMLR

Evaluating Human-Language Model Interaction

Lee, Mina, Srivastava, Megha, Hardy, Amelia, Thickstun, John, Durmus, Esin, Paranjape, Ashwin, Gerard-Ursin, Ines, Li, Xiang Lisa, Ladhak, Faisal, Rong, Frieda, Wang, Rose E., Kwon, Minae, Park, Joon Sung, Cao, Hancheng, Lee, Tony, Bommasani, Rishi, Bernstein, Michael, and Liang, Percy

Transactions on Machine Learning Research 2023

Abs arXiv Code

Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI’s GPT-3 and AI21’s J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.

2022

ISMIR

Melody transcription via generative pre-training

Donahue, Chris, Thickstun, John, and Liang, Percy

In International Society for Music Information Retrieval 2022

Abs arXiv Code Website

Despite the central role that melody plays in music perception, it remains an open challenge in music information retrieval to reliably detect the notes of the melody present in an arbitrary music recording. A key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles - existing strategies work well for some melody instruments or styles but not all. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio, thereby improving performance on melody transcription by 20% relative to conventional spectrogram features. Another obstacle in melody transcription is a lack of training data - we derive a new dataset containing 50 hours of melody transcriptions from crowdsourced annotations of broad music. The combination of generative pre-training and a new dataset for this task results in 77% stronger performance on melody transcription relative to the strongest available baseline. By pairing our new melody transcription approach with solutions for beat detection, key estimation, and chord recognition, we build Sheet Sage, a system capable of transcribing human-readable lead sheets directly from music audio.
bioRxiv

Reconstruction of visual images from mouse retinal ganglion cell spiking activity using convolutional neural networks

Benster, Tyler, Babino, Darwin, Thickstun, John, Hunt, Matthew, Liu, Xiyang, Harchaoui, Zaid, Oh, Sewoong, and Gelder, Russell N Van

2022

Abs PDF Code

All visual information in mammals is encoded in the aggregate pattern of retinal ganglion cell (RGC) firing. How this information is decoded to yield percepts remains incompletely understood. We have trained convolutional neural networks with multielectrode array-recorded murine RGC responses to projected images. The trained model accurately reconstructed novel facial images solely from RGC firing data. In this model, subpopulations of cells with faster firing rates are largely sufficient for accurate reconstruction, and ON- and OFF-cells contribute complementary and overlapping information to image reconstruction. Information content for reconstruction correlates with overall firing rate, and locality of information contributing to reconstruction varies substantially across the image and retina. This model demonstrates that artificial neural networks are capable of learning multicellular sensory neural encoding, and provides a viable model for understanding visual information encoding.
Neurips Oral Presentation

Diffusion-LM improves controllable text generation

Li, Xiang Lisa, Thickstun, John, Gulrajani, Ishaan, Liang, Percy, and Hashimoto, Tatsunori B.

In Advances in Neural Information Processing Systems 2022

Abs arXiv Code Slides

Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.

2021

Dissertation

Leveraging generative models for music and signal processing

Thickstun, John

University of Washington 2021

PDF
Neurips Outstanding Paper

MAUVE: measuring the gap between neural text and human text using divergence frontiers

Pillutla, Krishna, Swayamdipta, Swabha, Zellers, Rowan, Thickstun, John, Welleck, Sean, Choi, Yejin, and Harchaoui, Zaid

In Advances in Neural Information Processing Systems 2021

Abs arXiv Code Slides

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space. Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.
ICML

Parallel and flexible sampling from autoregressive models via Langevin dynamics

Jayaram, Vivek, and Thickstun, John

In International Conference on Machine Learning 2021

Abs arXiv Code Poster Website

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the model. Instead, we propose a sampling procedure that initializes a sequence with white noise and follows a Markov chain defined by Langevin dynamics on the global log-likelihood of the sequence. This approach parallelizes the sampling process and generalizes to conditional sampling. Using an autoregressive model as a Bayesian prior, we can steer the output of a generative model using a conditional likelihood or constraints. We apply these techniques to autoregressive models in the visual and audio domains, with competitive results for audio source separation, super-resolution, and inpainting.
L4DC

Faster policy learning with continuous-time gradients

Ainsworth, Samuel K., Lowrey, Kendall, Thickstun, John, Harchaoui, Zaid, and Srinivasa, Siddhartha S.

In Learning for Dynamics and Control 2021

Abs arXiv Code

We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators.

2020

arXiv

Rethinking evaluation methodology for audio-to-score alignment

Thickstun, John, Brennan, Jennifer, and Verma, Harsh

arXiv preprint arXiv:2009.14374 2020

Abs arXiv Code

This paper offers a precise, formal definition of an audio-to-score alignment. While the concept of an alignment is intuitively grasped, this precision affords us new insight into the evaluation of audio-to-score alignment algorithms. Motivated by these insights, we introduce new evaluation metrics for audio-to-score alignment. Using an alignment evaluation dataset derived from pairs of KernScores and MAESTRO performances, we study the behavior of our new metrics and the standard metrics on several classical alignment algorithms.
EMNLP

An information bottleneck approach for controlling conciseness in rationale extraction

Paranjape, Bhargavi, Joshi, Mandar, Thickstun, John, Hajishirzi, Hannaneh, and Zettlemoyer, Luke

In Conference on Empirical Methods in Natural Language Processing 2020

Abs arXiv Code Slides

Decisions of complex language understanding models can be rationalized by limiting their inputs to a relevant subsequence of the original text. A rationale should be as concise as possible without significantly degrading task performance, but this balance can be difficult to achieve in practice. In this paper, we show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective. Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale. Using IB, we derive a learning objective that allows direct control of mask sparsity levels through a tunable sparse prior. Experiments on ERASER benchmark tasks demonstrate significant gains over norm-minimization techniques for both task performance and agreement with human rationales. Furthermore, we find that in the semi-supervised setting, a modest amount of gold rationales (25% of training examples) closes the gap with a model that uses the full input.
ICML

Source separation with deep generative priors

Jayaram, Vivek, and Thickstun, John

In International Conference on Machine Learning 2020

Abs arXiv Code Poster

Despite substantial progress in signal source separation, results for richly structured data continue to contain perceptible artifacts. In contrast, recent deep generative models can produce authentic samples in a variety of domains that are indistinguishable from samples of the data distribution. This paper introduces a Bayesian approach to source separation that uses generative models as priors over the components of a mixture of sources, and noise-annealed Langevin dynamics to sample from the posterior distribution of sources given a mixture. This decouples the source separation problem from generative modeling, enabling us to directly use cutting-edge generative models as priors. The method achieves state-of-the-art performance for MNIST digit separation. We introduce new methodology for evaluating separation quality on richer datasets, providing quantitative evaluation of separation results on CIFAR-10. We also provide qualitative results on LSUN.

2019

ISMIR

Convolutional composer classification

Verma, Harsh, and Thickstun, John

In International Society for Music Information Retrieval 2019

Abs arXiv Code Poster

This paper investigates end-to-end learnable models for attributing composers to musical scores. We introduce several pooled, convolutional architectures for this task and draw connections between our approach and classical learning approaches based on global and n-gram features. We evaluate models on a corpus of 2,500 scores from the KernScores collection, authored by a variety of composers spanning the Renaissance era to the early 20th century. This corpus has substantial overlap with the corpora used in several previous, smaller studies; we compare our results on subsets of the corpus to these previous works.
ISMIR

Coupled recurrent models for polyphonic music composition

Thickstun, John, Harchaoui, Zaid, Foster, Dean P, and Kakade, Sham M

In International Society for Music Information Retrieval 2019

Abs arXiv Code Poster

This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music. We propose an efficient new conditional probabilistic factorization of musical scores, viewing a score as a collection of concurrent, coupled sequences: i.e. voices. To model the conditional distributions, we borrow ideas from both convolutional and recurrent neural models; we argue that these ideas are natural for capturing music’s pitch invariances, temporal structure, and polyphony. We train models for single-voice and multi-voice composition on 2,300 scores from the KernScores dataset.

2018

ICASSP Oral Presentation

Invariances and data augmentation for supervised music transcription

Thickstun, John, Harchaoui, Zaid, Foster, Dean P, and Kakade, Sham M

In International Conference on Acoustics, Speech and Signal Processing 2018

Abs arXiv Code

This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation. This class of models shares parameters in the log-frequency domain, which exploits the frequency invariance of music to reduce the number of model parameters and avoid overfitting to the training data. All models in this paper were trained with supervision by labeled data from the MusicNet dataset, augmented by random label-preserving pitch-shift transformations.

2017

MIREX

Frequency domain convolutions for multiple F0 estimation

Thickstun, John, Harchaoui, Zaid, Foster, Dean P, and Kakade, Sham M

2017

Abs PDF

This document describes the THK1 submission in the 2017 MIREX Multi-F0 competition. The model is a convolutional neural network, trained using the MusicNet labels. Its input is a bank of logarithmically-spaced frequency filters. These filters exhibit translation invariance along the log-frequency axis, which is captured in this model by one-dimensional convolutions along the frequency axis. The model fully connects across the time axis to capture temporal dependencies. The training data is augmented by pitch-shifting the original data by up to 5 semitones and applying small (up to 1/10 semitone) continuous pitch jitter to the input.
ICLR

Learning features of music from scratch

Thickstun, John, Harchaoui, Zaid, and Kakade, Sham M

In International Conference on Learning Representations 2017

Abs arXiv Code Poster Website

This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.