๐ About
August 6, 2024 ยท View on GitHub
This is the source code for building a structured assessment guided by pedagogy theories to evaluate model ability for each test objective across multiple cognitive levels and critical concepts, rather than relying on the correctness of a single test instance. Specifically, StructEval consists of two modules which deepen and broaden current evaluation respectively. Given a seed instance, the first module identifies its underlying test objective, and then generates multiple test instances around this test objective which are aligned with the six cognitive levels outlined in Bloomโs Taxonomy. Meanwhile, the second module extracts the key concepts that must be understood to answer the seed question, and then develop a series of instances revolving around these concepts.

๐ฅ Quick Start
๐ป Environment Preparation
- unpack the environment file
mkdir ~/anaconda3/envs/structeval
tar -xzvf asset/structeval.tar.gz -C ~/anaconda3/envs/structeval
- check installation
conda info -e
conda activate structeval
- Edit the OPENAI API keys in common_utils/constant.py
๐ Data Preparation
Convert your data to the following format and name it as 0_{split}_with_idx.json, then place it in the processed_data/{benchmark_name} folder.
If the benchmark consists of multiple sub-tasks, place the json file in processed_data/{benchmark_name}/{task_name}
{"question": "Which of the following is true regarding reflexes?", "subject": "clinical_knowledge", "choices": ["A positive babinski reflex is the same as a normal flexor response in the assessment of the plantar reflex", "An extensor plantar response indicates a lower motor neurone lesion", "The root value of the ankle reflex is S1", "The root value of the knee reflex is L1, L2"], "answer": 2, "idx": 0}
We also prepared an example in processed_data/demo to facilitate your usage. A complete running example can be found in processed_data/example.
If you want to run the code completely, you also need to build the ElasticSearch index for Wikipedia. Please refer to
Notesfor details.
๐จ Structured Benchmark Generation
1. Bloom-based Generation
The following script would first identify the underlying test objective of each test instance and then generates multiple relevant instances covering six cognitive levels of Bloomโs Taxonomy.
bash scripts/run_bloom_generate.bash {benchmark_name} {split}
For instance,
bash scripts/run_bloom_generate.bash demo test
2. Concept-based Generation
The following script would extract the essential concepts that must be understood and develops a series of extended questions around these concepts.
bash scripts/run_concept_generation.bash demo test
3. Data Combine
The following script would combine the generated test instances from the above two modules.
bash scripts/run_data_combine.bash demo test
If everything goes smoothly, you can find the generated structural benchmark under the struct_data/demo folder, and you can see all the intermediate process data under the processed_data/demo folder.
We also provided a complete running example in processed_data/example, so that you can easily check the results in each step.
๐ Notes
-
Due to our discovery that the information in Wikipedia is richer compared to Wikidata, it can lead to generating higher quality questions. Therefore, in this version of the code, concept-related instances are generated based on Wikipedia, which differs from what is mentioned in the paper.
-
Due to the need to retrieve the text of Wikipedia for generation, we have built a regularly updated ElasticSearch Index for retrieval. Because its scale, we are unable to share the original ES index in this repository, so please set up the corresponding ES index in your environment and configure it as follows, then edit
config/es_config.yamlaccordingly. If you encounter any problems during this process, feel free to communicate with us.
{
"wikipedia-monthly-enwiki" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "text",
},
"text" : {
"type" : "text",
},
"title" : {
"type" : "text",
},
"url" : {
"type" : "text",
}}}}
}