The AI Teacher Test
May 16, 2022 · View on GitHub
.. image:: https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg :target: https://creativecommons.org/licenses/by-nc-sa/4.0/ :alt: License: CC BY-NC-SA 4.0
.. image:: https://img.shields.io/badge/version-1.0.0-blue :target: https://github.com/anaistack/ai-teacher-test/tree/main :alt: package version
This repository contains the code and data for the paper:
:Title:
The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues <https://anaistack.github.io/papers/tack_ai_2022/>_
:Authors: Anaïs Tack & Chris Piech
:Abstract: How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports on a first attempt at an AI teacher test. We built a solution around the insight that you can run conversational agents in parallel to human teachers in real-world dialogues, simulate how different agents would respond to a student, and compare these counterpart responses in terms of three abilities: speak like a teacher, understand a student, help a student. Our method builds on the reliability of comparative judgments in education and uses a probabilistic model and Bayesian sampling to infer estimates of pedagogical ability. We find that, even though conversational agents (Blender in particular) perform well on conversational uptake, they are quantifiably worse than real teachers on several pedagogical dimensions, especially with regard to helpfulness (Blender: ∆ ability = −0.75; GPT-3: ∆ ability = −0.93).
Dependencies
Code
The code in this repository depends on the `ParlAI <https://parl.ai>`_ framework, the `OpenAI API <https://openai.com/api/>`_, the `Hugging Face <https://huggingface.co>`_ transformers library, and the `Stan <https://mc-stan.org/users/interfaces/pystan.html>`_ library.
.. code:: bash
pip install -r src/requirements.txt
Data
The data in this repository depends on student-teacher utterances coming from two datasets.
Because of copyright reasons, these texts were removed from the repository and replaced by the tag {COPYRIGHTED-TEXT}.
In order to repopulate the data, you must:
-
Download the
Teacher-Student Chatroom Corpus <https://aclanthology.org/2020.nlp4call-1.2.pdf>. Put the*.tsvfiles intodata/0_datasets/tscc/ <data/0_datasets/tscc>. -
Download the
Educational Uptake Dataset <https://github.com/ddemszky/conversational-uptake>. Putuptake_data.csvintodata/0_datasets/uptake/ <data/0_datasets/uptake>. -
Run the following commands to repopulate the data with missing utterances and prompts.
.. code:: bash
python -m src.utils.repopulate -t TSCC -d data/0_datasets/tscc python -m src.utils.repopulate -t EduUptake -d data/0_datasets/uptake
.. note::
Please cite both datasets when using the data in your research. See data/0_datasets/tscc/ <data/0_datasets/tscc>_ and data/0_datasets/uptake/ <data/0_datasets/uptake>_.
Method
Simulating Agent Responses
1. Download the pre-trained models into ``downloads/models/``.
.. code:: bash
python -m src.parlai.scripts.download_models downloads/ blender/blender_90M blender/blender_400Mdistill blender/blender_3B blender/blender_9B
2. Run a Blender model on the data. For example:
.. code:: bash
python -m src.parlai.scripts.run -t TSCC -d data/0_datasets/tscc/ -M downloads/models -m blender/blender_9B -O results/
python -m src.parlai.scripts.run -t EduUptake -d data/0_datasets/uptake/ -M downloads/models -m blender/blender_9B -O results/
3. Run a GPT-3 model on the data. For example:
.. code:: bash
python -m src.parlai.scripts.run -m src.parlai.models.gpt3:GPT3Davinci -o src/parlai/opts/gpt3.json -t TSCC -d data/0_datasets/tscc/ -O results/
python -m src.parlai.scripts.run -m src.parlai.models.gpt3:GPT3Davinci -o src/parlai/opts/gpt3.json -t EduUptake -d data/0_datasets/uptake/ -O results/
Measuring Pedagogical Ability
-
Detect outliers among human raters.
.. code:: bash
python -m src.stan.bradley_terry data/2_comparisons/items.jsonl --per-rater
-
Estimate pedagogical abilities after outlier removal.
.. code:: bash
python -m src.stan.bradley_terry data/2_comparisons/items.jsonl --outliers data/2_comparisons/outliers.yaml
Citation
More information can be found in this paper <https://anaistack.github.io/assets/pdf/tack_ai_2022.pdf>_.
When using the data or code in your research or publication, please cite this paper as well.
.. code:: bibtex
@inproceedings{tack_ai_2022, title = {The {{AI Teacher Test}}: {{Measuring}} the {{Pedagogical Ability}} of {{Blender}} and {{GPT-3}} in {{Educational Dialogues}}}, booktitle = {The 15th {{International Conference}} on {{Educational Data Mining}}}, author = {Tack, Ana{"i}s and Piech, Chris}, year = {2022}, pages = {accepted}, copyright = {All rights reserved} }
Acknowledgments
This research was funded by a fellowship of the BAEF (Belgian American Educational Foundation) <https://www.baef.be>_ and by a grant from Stanford HAI <https://hai.stanford.edu>_.
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog <https://keepachangelog.com/en/1.0.0/>,
and this project adheres to Semantic Versioning <https://semver.org/spec/v2.0.0.html>.
[1.0.0] - 2022-05-10
Added
- Publication of data and code for the EDM 2022 conference
.. |copy| unicode:: U+000A9 .. COPYRIGHT SIGN