Datasets
September 10, 2025 · View on GitHub
Latxa-Instruct:
Basque Instruct Models and Evaluation Arena
Latxa-Instruct is an open-source project for reproducible training, evaluation, and benchmarking of Basque instruction-tuned language models. It provides:
- Open-source Basque LLMs: Instruction-tuned models based on Llama 3.1 (8B and 70B parameters)
- Synthetic instruction datasets: Large-scale generated instruction-response pairs in both Basque and English for reproducible training
- Human preference dataset: The first preference dataset in Basque with real user prompts, model responses, and 12,890 preference annotations from 1,285 participants
- Complete training pipeline: End-to-end scripts for data preprocessing, model training, and instruction-tuning on the CINECA Leonardo supercomputer
- Benchmark evaluation framework: Benchmarking scripts and results across 29 datasets in Basque, English, and Spanish
- Human evaluation framework: Full frontend and backend implementation for community-driven human evaluation of language models
All models, datasets, and evaluation tools are released under open licenses.
- 📒 Blog Post: TBA
- 📖 Paper: Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
- 🤗 Models and Data: HiTZ/Latxa-Instruct
Datasets
🚧 Note: The codebase is currently work in progress. Meanwhile, all datasets are available on Hugging Face.
- Corpus: Basque corpus of 4.3M documents
- Instructions: Magpie-generated Basque/English instructions
- Preferences: Human preferences on model outputs from 21 different LLMs, in response to real Basque prompts
Model training
🚧 Note: The codebase is currently work in progress and will require adaptation to your specific environment and setup.
Training is performed on the CINECA Leonardo cluster using Axolotl (a HuggingFace-based framework).
See model_training/README.md for full instructions, environment setup, and troubleshooting.
Quicksteps
- Prepare the data
Preprocess instruction datasets and Basque corpora using scripts in model_training/. For example:
This will tokenize and cache datasets as specified in the YAML configs.sbatch model_training/train_scripts/prepare_data.sh - Train the models
Launch training with the provided SLURM scripts:sbatch model_training/train_scripts/Latxa-Llama-3.1-70B-Instruct-exp_2_101.sh - Merge and transfer checkpoints
After training, merge distributed checkpoints and transfer to the target server usingmerge_weights.shandrsync_weights.sh.
Training Configs
- All configs are in model_training/train_configs/.
- Datasets are in JSONL format, with user/assistant conversations.
- Example config: exp_1_010_fixed.yaml
Evaluation
Our evaluation framework combines two complementary approaches to provide comprehensive model assessment: benchmark-based evaluation and human preferences.
See evaluation/README.md for full instructions, environment setup, and troubleshooting.
Static Benchmarks
- 29 benchmarks across Basque, English, and Spanish
- 6 categories: reading comprehension, common sense, linguistic proficiency, knowledge, math & reasoning, and bias
- Automated evaluation using our fork of LM Evaluation Harness
- Full configuration files and results included
Human Evaluation: Ebaluatoia arena
- Community-driven evaluation with 1,285 Basque speakers
- 12,890 preference annotations collected over 14 days
- Multi-dimensional assessment: content quality, linguistic quality, and overall preference
- 21 models evaluated including our variants, GPT-4o, and Claude 3.5 Sonnet
- Complete arena platform implementation provided
A note on model variants and their nomenclature
We use different naming conventions in the paper versus the codebase for clarity and simplicity respectively.
Paper Nomenclature
In the paper, we use descriptive names like:
- = Llama 3.1 Instruct + Basque Corpus + English Instructions
- = Latxa Base + Basque Corpus + English & Basque Instructions
Codebase Nomenclature
In the actual code, models, and evaluation results, we use a simplified format
exp_{BACKBONE}_{CORPUS_EU}{INST_EU}{INST_EN}
where:
-
BACKBONE:
0= (Llama 3.1 8B)1= (Latxa 3.1 8B)2= (Llama 3.1 8B Instruct)
-
CORPUS_EU / INST_EU / INST_EN:
0= not used1= used
Examples:
exp_1_101= + Basque Corpus + English Instructionsexp_2_111= + Basque Corpus + Basque Instructions + English Instructions
Model Repositories
All 8B model variants are available as branches in the HuggingFace repository:
- Main model: https://huggingface.co/HiTZ/Latxa-Llama-3.1-8B-Instruct
- Example variant: https://huggingface.co/HiTZ/Latxa-Llama-3.1-8B-Instruct/tree/exp_0_010
Citation
If you use Latxa-Instruct, please cite:
@misc{sainz2025instructing,
title = {Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque},
author = {Oscar Sainz and Naiara Perez and Julen Etxaniz and Joseba Fernandez de Landa and Itziar Aldabe and Iker García-Ferrero and Aimar Zabala and Ekhi Azurmendi and German Rigau and Eneko Agirre and Mikel Artetxe and Aitor Soroa},
year = {2025},
eprint = {TBP},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
Acknowledgements
This work has been partially supported by the Basque Government (IKER-GAITU project), the Ministerio para la Transformación Digital y de la Función Pública (EU – NextGenerationEU, 2022/TL22/00215335), and trained on the Leonardo supercomputer at CINECA under EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.