ASearcher Evaluation

August 7, 2025 · View on GitHub

A comprehensive framework for testing and evaluating LLM Agent performance. Supports multiple agent architectures, search engine integrations, and evaluation methods.

Preparation

Step 1: Prepare the runtime environment.

Please refer to https://inclusionai.github.io/AReaL/tutorial/installation.html#runtime-environment for Runtime Environment.

Step 2: download test data from ASearcher-test-data.

Evaluate a Search Agent

We can evaluate different agent workflows by specifying the agent-type and search-client-type.

python3 -m evaluation.search_eval_async \
    ...
    --prompt_type ${PROMPT_TYPE} \
    --agent-type ${AGENT_TYPE} \
    --search_client_type ${SEARCH_CLIENT_TYPE} \
    ...

We list several examples as follows:

cd evaluation/

MODEL_PATH=/path/to/models 
DATA_DIR=/path/to/test_set # Could be downloaded from [https://huggingface.co/datasets/inclusionAI/ASearcher-test-data]

DATA_NAMES=GAIA,xbench-deepsearch,Frames
AGENT_TYPE=asearcher-reasoning
PROMPT_TYPE=asearcher-reasoning
SEARCH_CLIENT_TYPE=async-web-search-access

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" &> /dev/null && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"

PYTHONPATH="${PROJECT_ROOT}:$PYTHONPATH" \
SERPER_API_KEY=${your_serper_api_key} \
JINA_API_KEY=${your_jina_api_key} \
TOKENIZERS_PARALLELISM=false \
python3 search_eval_async.py \
    --data_names ${DATA_NAMES} \
    --model_name_or_path ${MODEL_PATH}  \
    --output_dir ${MODEL_PATH} \
    --data_dir ${DATA_DIR} \
    --prompt_type $PROMPT_TYPE \
    --agent-type ${AGENT_TYPE} \
    --search_client_type ${SEARCH_CLIENT_TYPE} \
    --tensor_parallel_size 4 \
    --temperature 0.6 \
    --parallel-mode seed \
    --seed 1 \
    --use-jina \
    --llm_as_judge \
    --pass-at-k 1 \ # if you want get more stable result, please increase it
cd evaluation/

MODEL_PATH=/path/to/models 
DATA_DIR=/path/to/test_set # Could be downloaded from [https://huggingface.co/datasets/inclusionAI/ASearcher-test-data]

DATA_NAMES=GAIA,xbench-deepsearch,Frames
AGENT_TYPE=asearcher
PROMPT_TYPE=asearcher
SEARCH_CLIENT_TYPE=async-web-search-access

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" &> /dev/null && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"

PYTHONPATH="${PROJECT_ROOT}:$PYTHONPATH" \
SERPER_API_KEY=${your_serper_api_key} \
JINA_API_KEY=${your_jina_api_key} \
TOKENIZERS_PARALLELISM=false \
python3 search_eval_async.py \
    --data_names ${DATA_NAMES} \
    --model_name_or_path ${MODEL_PATH}  \
    --output_dir ${MODEL_PATH} \
    --data_dir ${DATA_DIR} \
    --prompt_type $PROMPT_TYPE \
    --agent-type ${AGENT_TYPE} \
    --search_client_type ${SEARCH_CLIENT_TYPE} \
    --tensor_parallel_size 4 \
    --temperature 0.6 \
    --parallel-mode seed \
    --seed 1 \
    --use-jina \
    --llm_as_judge \
    --pass-at-k 1 \ 

C. Evaluate Search-R1 with Local Knowledge Base

Step 0. Build the image and start the container following evaluation/Dockerfile

Step 1. Setup Environment Variable

export RAG_SERVER_ADDR_DIR=PATH_TO_DUMP_LOCAL_SERVER_ADDRESS
export PORT=8000

Here RAG_SERVER_ADDR_DIR is the directory to dump the address of the launched local RAG server, which will be loaded during training.

Step 2. Set up and launch the local RAG server

bash scripts/launch_local_server.sh $PORT $RAG_SERVER_ADDR_DIR

Step 3:

cd evaluation/

MODEL_PATH=/path/to/models 
DATA_DIR=/path/to/test_set # Could be downloaded from [https://huggingface.co/datasets/inclusionAI/ASearcher-test-data]

DATA_NAMES=Bamboogle,NQ
AGENT_TYPE=search-r1
PROMPT_TYPE=search-r1
SEARCH_CLIENT_TYPE=async-search-access

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" &> /dev/null && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"

PYTHONPATH="${PROJECT_ROOT}:$PYTHONPATH" \
SERPER_API_KEY=${your_serper_api_key} \
JINA_API_KEY=${your_jina_api_key} \
TOKENIZERS_PARALLELISM=false \
python3 search_eval_async.py \
    --data_names ${DATA_NAMES} \
    --model_name_or_path ${MODEL_PATH}  \
    --output_dir ${MODEL_PATH} \
    --data_dir ${DATA_DIR} \
    --prompt_type $PROMPT_TYPE \
    --agent-type ${AGENT_TYPE} \
    --search_client_type ${SEARCH_CLIENT_TYPE} \
    --tensor_parallel_size 4 \
    --temperature 0.6 \
    --parallel-mode seed \
    --seed 1 \
    --use-jina \
    --llm_as_judge \
    --pass-at-k 1 \ 

⚙️ Configuration Parameters

Core Parameters

ParameterDescriptionOptions
--data_namesTest dataset namesGAIA,HotpotQA,NQ,TriviaQA, etc.
--agent-typeAgent typesearch-r1, asearcher-reasoning, asearcher
--search_client_typeSearch client typeasync-search-access, async-web-search-access
--model_name_or_pathLLM model pathLocal model path or HuggingFace model name
--pass-at-kCount of evaluationFor multiple tests (serial)

Model Generation Parameters

ParameterDescriptionDefault
--temperatureGeneration temperature0
--top_pTop-p sampling1
--top_kTop-k sampling-1
--max-tokens-per-callMaximum tokens to generate4096

Evaluation Parameters

ParameterDescriptionDefault
--num_test_sampleNumber of test samples-1 (all)
--concurrentNumber of concurrent requests128
--llm_as_judgeEnable LLM-as-Judge evaluationFalse
--judge-promptLLM Judge prompt typedefault

🔧 Evaluation Metrics

The framework supports the following evaluation metrics:

  • EM (Exact Match): Exact matching
  • F1 Score: F1 score
  • CEM (Cover Exact Match): Cover exact matching
  • LLM-as-Judge: Using LLM as evaluator

🌐 Search Integration

  • Serper API: For web search
  • Jina API: For web content extraction and processing
  • Custom Search Clients: Support for extending other search engines