Evaluating Human-Language Model Interaction
Lee, Mina,
Srivastava, Megha,
Hardy, Amelia,
Thickstun, John,
Durmus, Esin,
Paranjape, Ashwin,
Gerard-Ursin, Ines,
Li, Xiang Lisa,
Ladhak, Faisal,
Rong, Frieda,
Wang, Rose E.,
Kwon, Minae,
Park, Joon Sung,
Cao, Hancheng,
Lee, Tony,
Bommasani, Rishi,
Bernstein, Michael,
and Liang, Percy
2022
Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI’s GPT-3 and AI21’s J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.