Evaluating Human-Language Model Interaction
Lee, Mina,
Srivastava, Megha,
Hardy, Amelia,
Thickstun, John,
Durmus, Esin,
Paranjape, Ashwin,
Gerard-Ursin, Ines,
Li, Xiang Lisa,
Ladhak, Faisal,
Rong, Frieda,
Wang, Rose E.,
Kwon, Minae,
Park, Joon Sung,
Cao, Hancheng,
Lee, Tony,
Bommasani, Rishi,
Bernstein, Michael,
and Liang, Percy
Transactions on Machine Learning Research
2023
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI’s GPT-3 and AI21’s J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.