PROJECT_CONTEXT.md
May 21, 2026 ยท View on GitHub
This file gives contributors and AI assistants the shortest accurate map of the current SigerLM repository.
Summary
SigerLM is a custom Python LLM framework built around a State Space Model/Mamba-like architecture. The project is moving toward a general LM workflow while using Bahasa Lampung as a domain adapter and training testbed.
The system currently supports:
- custom SSM language model
- modality-agnostic backbone entrypoint through projected embeddings
- hybrid tokenizer selection
- base training pipeline
- config-driven LoRA fine-tuning
- one-command LoRA curriculum runner
- unified instruction corpus builder
- Lampung dataset extraction and instruction generation
- lookup-first Lampung inference
- general/domain routing in CLI
- config-driven engineering evaluation harness
- evaluation and optimization scaffolding
Architecture Snapshot
data sources
-> domain builders or dataset registry
-> unified instruction corpus
-> tokenizer
-> base model / LoRA training / LoRA curriculum
-> merged checkpoint
-> inference router
-> general chat or Lampung tools
Core Principle
The model architecture is general. Lampung is implemented as a domain layer:
- data:
data/lampung/,tools/build_lampung_dataset.py - instruction tasks:
tools/build_instruction_dataset.py - lookup/rules:
retrieval/ - runtime pipeline:
inference/lampung_pipeline.py - auto-routing:
inference/router.py
Do not put domain-specific Lampung behavior into model/.
Do not put modality-specific preprocessing, decoder heads, or task losses into the SSM core. Non-text inputs should be projected by modalities/ adapters and passed into SigerLM.forward_hidden(inputs_embeds=...). The roadmap is in docs/MODALITY_AGNOSTIC_BACKBONE.md.
Important Modules
config/model_config.py
model/siger_model.py
model/ssm_block.py
model/ssm_core.py
modalities/base.py
modalities/registry.py
tokenizer/hybrid_tokenizer.py
training/dataset.py
training/dataset_registry.py
tools/build_instruction_corpus.py
lora/config.py
lora/dataset.py
lora/run_lora.py
inference/generator.py
inference/chat.py
inference/router.py
inference/lampung_pipeline.py
retrieval/instruction_lookup.py
retrieval/compositional_translator.py
evaluation/run_harness.py
evaluation/harness/runner.py
Config Files
configs/datasets/lampung_instruction.json
configs/datasets/general_instruction.json
configs/datasets/curriculum_stage1_foundation.json
configs/datasets/curriculum_stage2_general.json
configs/datasets/curriculum_stage3_advanced.json
configs/datasets/curriculum_stage4_full.json
configs/training/lampung_lora.json
configs/training/general_lora.json
configs/training/lora_curriculum.json
configs/training/curriculum_stage*_lora.json
configs/evaluation/harness_smoke.json
configs/evaluation/harness_dataset_only.json
Current Dataset State
Latest verified build:
data/lampung/processed/percakapan_1000_pairs.jsonl: 3100 rows
data/lampung/processed/compositional_pairs.jsonl: 1968 rows
data/lampung/final/train.jsonl: 4325 rows
data/lampung/final/valid.jsonl: 541 rows
data/lampung/final/test.jsonl: 541 rows
data/lampung/final/train_augmented_instruction.jsonl: 32059 rows
data/corpus/lampung_instruction_train.jsonl: 30701 rows
data/corpus/general_instruction_train.jsonl: 30704 rows
data/corpus/kaggle_local_inputs_train.jsonl: 51969 rows
data/corpus/curriculum_stage1_foundation_train.jsonl: 84605 rows
data/corpus/curriculum_stage2_general_train.jsonl: 186672 rows
data/corpus/curriculum_stage3_advanced_train.jsonl: 218596 rows
data/corpus/curriculum_stage4_full_train.jsonl: 218596 rows
general_instruction_train.jsonl is currently still dominated by Lampung because local general text files are small. To make SigerLM more general, add larger instruction/chat/text sources to configs/datasets/general_instruction.json.
Main Commands
Build Lampung data:
python tools\extract_percakapan_pdf.py
python tools\build_compositional_lampung_dataset.py
python tools\build_lampung_dataset.py
python tools\build_instruction_dataset.py
Build unified corpora:
python tools\build_instruction_corpus.py --registry configs\datasets\lampung_instruction.json
python tools\build_instruction_corpus.py --registry configs\datasets\general_instruction.json
Train LoRA:
python lora\run_lora.py --config configs\training\lampung_lora.json
python lora\run_lora.py --config configs\training\general_lora.json
python train_pipeline.py --mode lora-curriculum
Automatic Dense -> MoE -> LoRA pipeline:
python train_pipeline.py --mode auto --dry-run
python train_pipeline.py --mode auto --dense-profile moe_dense_base --moe-profile small_moe
The default auto pipeline uses moe_dense_base (d_model=384, n_layers=10) before small_moe so Dense -> MoE warm-start has compatible tensor shapes. Do not swap in siger_medium for the dense stage unless the MoE profile is also changed to the same base shape.
Preview automatic LoRA curriculum without training:
python train_pipeline.py --mode lora-curriculum --dry-run
Run CLI:
python chat_cli.py
Run engineering harness:
python evaluation\run_harness.py --config configs\evaluation\harness_smoke.json --only dataset_fixture_audit
python evaluation\run_harness.py --config configs\evaluation\harness_smoke.json --checkpoint checkpoints\lora\model_general_merged.pt
CLI modes:
0 auto/general router
1 Lampung O -> Indonesia
2 Indonesia -> Lampung O
3 Lampung O -> English
4 Lampung reasoning
5 general chat
6 Lampung word order
Latest Smoke Result
Mode: 0
Input: Nyak haga mengan manuk di warung paghek jalan
Assistant: aku mau makan ayam di warung dekat jalan
Route: lampung_to_id
Source: exact instruction lookup
Current Priorities
- Expand general instruction/chat corpora.
- Keep Lampung domain as an adapter, not the whole architecture.
- Train/evaluate the easy-to-hard LoRA curriculum once base checkpoint/tokenizer are aligned.
- Add small automated tests for corpus builder, router, and lookup.
- Improve evaluation coverage for Lampung ID/EN, general chat, and code generation through the harness.
- Keep CPU/VPS memory use under control.