Zochi: The World's First Artificial Scientist

August 23, 2025 · View on GitHub

Zochi

Homepage Twitter Follow
License

Project Lead: Andy Zhou

Core Contributors: Ron Arel, Soren Dunn, Nikhil Khandekar

1. Introduction

Updated May 27, 2025: The technical report and code cover an earlier version of Zochi. Zochi’s capabilities have greatly expanded, culminating in acceptance to ACL 2025!

Zochi is an artificial scientist system capable of end-to-end scientific discovery, from hypothesis generation through experimentation to peer-reviewed publication. Unlike previous systems that automate isolated aspects of scientific research, Zochi demonstrates comprehensive capabilities across the complete research lifecycle.

We present empirical validation through multiple peer-reviewed publications accepted at ICLR 2025 workshops and ACL 2025, each containing novel methodological contributions and state-of-the-art experimental results. These include Compositional Subspace Representation Fine-tuning (CS-ReFT), which achieved a 93.94% win rate on the AlpacaEval benchmark on Llama-2-7b while using only 0.0098% of model parameters, the Tempest (formerly Siege) framework, a state-of-the-art jailbreak which identified critical vulnerabilities in language model safety measures through multi-turn adversarial testing.


Figure 1: Comparative analysis of automated reviewer ratings across AI research systems. Zochi achieves an average score of 7.67 on the NeurIPS guidelines scale, significantly exceeding the human acceptance threshold of 6, while previous systems fall below this threshold.

2. Research Publications

Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search (ACL 2025)

Note: This is an updated version of Siege, which was accepted to ICLR 2025 BuildingTrust

Tempest represents a significant advancement in safety testing methodology by formalizing how minor policy breaches can accumulate over successive conversation turns and by employing beam search to explore multiple attack strategies in parallel. The framework treats each conversation state as a node in a search tree, with the central innovation being a sophisticated partial compliance tracking mechanism that identifies and exploits incremental policy leaks.

Results

ModelMethodAttemptsSuccess (%)Queries
GPT-3.5Cresendo140.06
GPT-4Cresendo131.76
Llama-3.1Crescendo128.06
GPT-3.5Cresendo1080.460
GPT-4Cresendo1070.960
Llama-3.1Crescendo1077.060
GPT-3.5GOAT155.76
GPT-4GOAT146.66
Llama-3.1GOAT155.06
GPT-3.5GOAT1091.660
GPT-4GOAT1087.960
Llama-3.1GOAT1091.060
GPT-3.5Tempest1100.044.4
GPT-4Tempest197.084.2
Llama-3.1Tempest197.051.8

CS-ReFT: Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models (ICLR 2025 SCOPE)

CS-ReFT embodies a fundamentally different paradigm compared to existing approaches. While methods like LoRA implement orthogonality constraints at the weight level, CS-ReFT applies these constraints directly to hidden-state representations. This innovation allows each task to have its dedicated subspace transformation, which eliminates interference while still enabling composition through a lightweight router mechanism. Part of codebase adapted from ReFT.

Results

ModelWin Rate (%)PE (%)
Reference Models
GPT-3.5 Turbo86.30---
Llama-2 13B81.10---
Llama-2 7B71.40---
Parameter-Efficient (Llama-2 7B)
Full Fine-tuning80.93100.00
LoRA81.480.1245
RED81.690.0039
DiReFT84.850.0039
LoReFT85.600.0039
CS-ReFT (Ours)93.940.0098

3. Automated Review Scores

Our evaluation framework is built on an automated reviewer system from the AI Scientist that processes research papers based on the NeurIPS conference review guidelines, assigning numerical scores for soundness, presentation, contribution, and overall quality. The scoring scale ranges from 1 to 10, with 6 representing the acceptance threshold at top machine learning conferences.

SystemDomainPaper TitleScore
ZochiAI SafetySiege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search8
ZochiPEFTCompositional Subspace Representation Fine-tuning for Adaptive Large Language Models8
ZochiBioinformaticsProtein-Nucleic Acid Binding Site Prediction with Modular Feature Fusion and E(3)-Equivariant GNNs7
AI Scientist v2Neural NetworksCompositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization4
AI Scientist v2AgricultureReal-World Challenges in Pest Detection Using Deep Learning: An Investigation into Failures and Solutions3
AI Scientist v2Deep LearningUnveiling the Impact of Label Noise on Model Calibration in Deep Learning4
Agent LaboratoryComputer VisionResearch Report: Robustness and Accuracy of Image Matching Under Noise Interference4
CarlAI SafetyWhen to Refuse: Early Indicators of Refusal in LLMs3
CarlRoboticsTowards Deviation-Resilient Multi-Agent Alignment for Robot Coordination4
AI Scientist2D DiffusionDualScale Diffusion: Adaptive Feature Balancing for Low-Dimensional Generative Models5
AI Scientist2D DiffusionMulti-scale Grid Noise Adaptation: Enhancing Diffusion Models For Low-dimensional Data4
AI Scientist2D DiffusionGAN-Enhanced Diffusion: Boosting Sample Quality and Diversity3
AI Scientist2D DiffusionDualDiff: Enhancing Mode Capture in Low-dimensional Diffusion Models via Dual-expert Denoising5
AI ScientistNanoGPTStyleFusion: Adaptive Multi-style Generation in Character-Level Language Models5
AI ScientistNanoGPTAdaptive Learning Rates for Transformers via Q-Learning3
AI ScientistGrokkingUnlocking Grokking: A Comparative Study of Weight Initialization Strategies in Transformer Models5
AI ScientistGrokkingGrokking Accelerated: Layer-wise Learning Rates for Transformer Generalization4
AI ScientistGrokkingGrokking Through Compression: Unveiling Sudden Generalization via Minimal Description Length3
AI ScientistGrokkingAccelerating Mathematical Insight: Boosting Grokking Through Strategic Data Augmentation5

4. Reproducing Zochi's Results

Zochi generated the code and main results for both papers, starting from publicly available repositories that were retrieved from baseline methods. Some baseline results were retrieved from existing papers using the same experimental setting. The final codebase was cleaned up and modified to remove traces of Zochi's intermediate research process and allow for easier reproducibility.

CS-ReFT

# Clone the repository
git clone https://github.com/zochi-ai/zochi.git
cd zochi/csreft

# Install dependencies
pip install -r requirements.txt

# Train CS-ReFT on Llama-2-7B and get outputs for AlpacaEval
python csrf_train_instruct.py --output_dir <output_dir> --run_eval

Tempest

# Move back to the repository root
cd ..

# Install Tempest dependencies
pip install -r tempest/requirements.txt

# Enter the Tempest directory
cd tempest

# Run Tempest on a target model
python tempest_pipeline.py --target_model <target_model> --pipeline_model <pipeline_model> --results_json <results_json>

# Evaluate results
python get_metrics.py <results_json>

Using a Local Ollama Model

If you run a model locally with Ollama, prefix the model name with local/ and ensure the Ollama server is running. For example:

# Use a locally hosted DeepSeek model
python tempest_pipeline.py --target_model local/deepseek-llm-r1-8b --pipeline_model local/deepseek-llm-r1-8b --results_json results.json

Set the OLLAMA_BASE_URL environment variable if your server is not available at http://localhost:11434.

6. Citation

@article{zochi2025,
  title={Zochi Technical Report},
  author={Intology},
  journal={arXiv},
  year={2025}
}

7. License

This repository is released under the MIT License. See the LICENSE file for more details.