TAG_AD: Text-Attributed Graph Anomaly Detection
November 16, 2025 · View on GitHub
Overview
TAG_AD is an integrated framework for generating, detecting, and analyzing anomalies in text-attributed graphs. It supports the creation of challenging benchmark datasets, baseline evaluations, and advanced LLM-powered detection pipelines for both contextual and structural anomaly detection tasks.
Installation
Prerequisites
conda create -n ad_env python=3.11
conda activate ad_env
pip install -r requirements.txt
API Configuration
export OPENAI_API_KEY="OPENAI_API_KEY"
export DEEPINFRA_API_KEY="DEEPINFRA_API_KEY"
export DEEPSEEK_API_KEY="DEEPSEEK_API_KEY"
export TEMPERATURE=0
Datasets
We use the following 4 datasets from LLMGNN repository
Dataset Statistics
| Dataset Name | #Nodes | #Edges | Task Description | Classes |
|---|---|---|---|---|
| CORA | 2,708 | 5,429 | Given the title and abstract, predict the category of this paper | Rule Learning, Neural Networks, Case-Based, Genetic Algorithms, Theory, Reinforcement Learning, Probabilistic Methods |
| CITESEER | 3,186 | 4,277 | Given the title and abstract, predict the category of this paper | Agents, Machine Learning, Information Retrieval, Database, Human Computer Interaction, Artificial Intelligence |
| PUBMED | 19,717 | 44,335 | Given the title and abstract, predict the category of this paper | Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, Diabetes Mellitus Type 2 |
| wiki-cs | 11,701 | 43,1726 | Given the title and abstract, predict the category of this paper | Computational linguistics, Databases, Operating systems, Computer architecture, Computer security, Internet protocols, Computer file systems, Distributed computing architecture, Web technology, Programming language topics |
Dataset Setup
To set up the datasets:
- Download: Get the archive
datasets.tar.gzfrom Google Drive. - Extract: Unpack the archive:
tar -xzvf datasets.tar.gz - Move files: Place all extracted files and folders into the
data/rawdirectory (create it if it doesn't exist).
Your folder structure should look like:
TAG_AD/
└── data/
└── raw/
├── <dataset files>
└── ...
Usage
1. Generate Synthetic Anomalies
Use make_anomaly.py to create datasets with injected anomalies. Specify the anomaly type, count, and other settings as needed:
python make_anomaly.py \
--dataset_name <dataset_name> \
--anomaly_type <1|2|3|4|5> \
--anomaly_num <number_of_anomalies> \
--data_dir <data/raw> \
--output_dir <data/generated> \
--is_map_label <True|False>
Parameters:
<dataset_name>: Name of the dataset (e.g.,pubmed_fixed_sbert_5_290)<anomaly_type>:- 1: Dummy anomaly
- 2: LLM-generated contextual anomaly
- 3: Traditional contextual anomaly
- 4: Global anomaly
- 5: Structural anomaly
<anomaly_num>: Number of anomalous nodes to generate<data_dir>: (Optional; default:data/raw) Source data directory<output_dir>: (Optional; default:data/generated) Output data directory for generated data<is_map_label>: UseTruewhen generating anomalies for a fresh dataset, otherwiseFalse(for additional runs on the same dataset)
Example:
Generate 100 LLM-generated contextual anomalies on the PubMed dataset:
python make_anomaly.py \
--dataset_name pubmed_fixed_sbert \
--anomaly_type 2 \
--anomaly_num 100 \
--is_map_label True
Tips:
- For generating multiple anomaly types on the same dataset, set
--is_map_labeltoTruefor the first type, thenFalsefor subsequent types. - Output files will be saved in the directory specified by
--output_dir.
2. Evaluate Synthetic Anomalies using PyGOD
Use pygod_baseline.py to evaluate anomaly detection baselines on your generated datasets.
python pygod_baseline.py \
--data_dir <data/generated> \
--dataset_name <dataset_name> \
--random_seed <seed> \
--experiment_num <repeat_times> \
--output_file <output_json> \
--k <top_k> \
--is_structural <False|True>
Parameters:
<data_dir>: (Optional; default:data/generated) Directory where generated datasets are stored<dataset_name>: Name of the dataset to evaluate (e.g.,pubmed_fixed_sbert_2_100)<random_seed>: (Optional; default: 42) Random seed for reproducibility<experiment_num>: (Optional; default: 1) Number of runs to repeat experiment for statistics<output_file>: (Optional; default:results.json) Where to save evaluation results<k>: (Optional; default: 20) Number of anomalies used for precision-at-k/recall-at-k<is_structural>: Set toTrueif evaluating structural anomaly datasets, otherwiseFalse
Example:
Evaluate a generated PubMed anomaly dataset, running 3 times for statistical robustness:
python pygod_baseline.py \
--data_dir data/generated \
--dataset_name pubmed_fixed_sbert_2_100 \
--experiment_num 3 \
--output_file result_pubmed_2_100.json \
--k 20
After running this command, the evaluation results will be saved to the specified output JSON file, including metrics such as AUC, Average Precision, F1 score, Precision@k, and Recall@k for a variety of baselines.
3. Create an Analysis Framework for LLM-based Anomaly Detection using RAG
To generate an analysis framework leveraging Retrieval-Augmented Generation (RAG) for text-attributed graph anomaly detection:
-
Add Papers:
Place relevant text-attributed graph anomaly detection papers (in PDF form) into thepapers/directory. These papers will serve as the knowledge base for the RAG pipeline. -
Run the Framework Generation Script:
You can use the provided shell script (rag_generate.sh) or run the Python command directly. This will generate an analysis framework based on your collection of papers using an LLM, with controllable temperature and other parameters.Recommended (using
rag_generate.sh):bash rag_generate.shEnsure your OpenAI API key is properly set in your environment. You can edit
rag_generate.shto customize arguments or the key.Or direct usage:
python analysis_framework_generator.py \ --type_of_anomaly "Contextual Anomaly" \ --temperature 0.7 \ --paper_directory "papers" \ --output_file "analysis_framework_contextual.txt" \ --api_key <OPENAI_API_KEY>
Notes:
- The RAG process will use the contents of all papers found in the
papers/directory. - Adjust temperature or anomaly type (e.g., "Structural Anomaly", "Contextual Anomaly", "Mixed Anomaly") as needed using the
--type_of_anomalyand--temperatureoptions.
4. Run LLM-Powered Anomaly Detection Pipeline
Once you have prepared your dataset and generated an analysis framework (see sections above), you can apply large language models (LLMs) to the anomaly detection task on text-attributed graph data.
This pipeline supports the following LLM backends:
- DeepInfra API:
deepseek-ai/DeepSeek-V3-0324Qwen/Qwen3-14Bgoogle/gemma-3-27b-it
- DeepSeek API:
deepseek-chat
- OpenAI API:
gpt-4o-mini
Specify the desired model name using the --model_name argument when running LLM_ad_detection.py. Make sure the corresponding API key is set as an environment variable (OPENAI_API_KEY, DEEPINFRA_API_KEY, or other as required).
Step-by-step Usage
-
Prepare Required Files and Environment
- Ensure your dataset (such as
"pubmed_fixed_sbert_2_100.pt") is available. - Have an analysis framework file generated (e.g.,
"analysis_framework_contextual.txt"). - Confirm your API keys for any required LLM (OpenAI, DeepInfra, DeepSeek, etc.) are set as environment variables, e.g.:
export OPENAI_API_KEY="your-openai-api-key"
- Ensure your dataset (such as
-
Run the Detection Script
- The main detection pipeline is implemented in
LLM_ad_detection.py. Run directly:
python LLM_ad_detection.py \ --anomaly_type "Contextual Anomaly" \ --analysis_framework_path "analysis_framework_contextual.txt" \ --dataset_file "pubmed_fixed_sbert_2_100.pt" \ --output_dir "LLM_results" \ --output_file "pubmed_fixed_sbert_2_100_gemma.json" \ --model_name "google/gemma-3-27b-it" \ --max_nodes 1000Key Arguments:
--anomaly_type: One of"Contextual Anomaly","Structural Anomaly", or"Mixed Anomaly".--analysis_framework_path: Path to the framework file generated in the previous step.--dataset_file: Processed dataset file (.pt file as produced by the data preparation step).--output_dir: Directory to save results.--output_file: Name for the output results JSON.--model_name: The model to use, e.g.,"google/gemma-3-27b-it","deepseek-ai/DeepSeek-V3-0324", etc.--max_nodes: Optional, maximum number of nodes (samples) to process.--use_human_designed_analysis_framework: Use the provided experted designed framework, rather than LLM-generated.--use_dummy: When this flag is set, the script ignores any loaded analysis framework file and instead uses a minimal, placeholder (dummy) analysis framework prompt for each node. You can read the corresponding dummy framework content indetector_prompts.pyunderANALYSIS_FRAMEWORK_DUMMY. This is designed for ablation studies without domain-specific logic.
- The main detection pipeline is implemented in
-
Results
- The script will save detailed per-node anomaly scores and predictions in the specified output file (in JSON format), in the provided output directory.
- You can use this output for further result analysis or benchmarking.
Example: Using a Human-Designed Analysis Framework
This mode uses the expert-designed, interpretable analysis framework provided in detector_prompts.py (see ANALYSIS_FRAMEWORK_CONTEXTUAL_HUMAN_DESIGNED or ANALYSIS_FRAMEWORK_STRUCTURAL_HUMAN_DESIGNED for reference), which is injected at runtime. Be sure to specify the appropriate anomaly type and framework path.
python LLM_ad_detection.py \
--anomaly_type "Contextual Anomaly" \
--analysis_framework_path "analysis_framework_contextual.txt" \
--dataset_file "pubmed_fixed_sbert_2_100.pt" \
--output_dir "LLM_results" \
--output_file "pubmed_fixed_sbert_2_100_gemma_human_designed.json" \
--model_name "google/gemma-3-27b-it" \
--max_nodes 1000 \
--use_human_designed_analysis_framework
- When
--use_human_designed_analysis_frameworkis specified, the model will use the corresponding human-designed analysis framework for the selected anomaly type. The actual content can be found indetector_prompts.py. The--analysis_framework_pathfile is still needed for input, but it will be overridden by the hard-coded human-designed framework (unless--use_dummyis also used).
Example: Dummy Framework Mode (Ablation)
python LLM_ad_detection.py \
--anomaly_type "Contextual Anomaly" \
--analysis_framework_path "analysis_framework_contextual.txt" \
--dataset_file "pubmed_fixed_sbert_2_100.pt" \
--output_dir "LLM_results" \
--output_file "pubmed_fixed_sbert_2_100_gemma_human_designed_dummy.json" \
--model_name "google/gemma-3-27b-it" \
--max_nodes 1000 \
--use_human_designed_analysis_framework \
--use_dummy
For more customizations and an overview of all available arguments, review the LLM_ad_detection.py file and refer to comments in experiment.sh.
Note:
- Make sure to use the correct API key(s) for your chosen LLM provider.
- The analysis framework file and data file paths must correspond to your setup.
For more advanced multi-run experiments or evaluation of different models, modify or extend experiment.sh as needed.