PROJECT_CONTEXT.md

May 21, 2026 ยท View on GitHub

This file gives contributors and AI assistants the shortest accurate map of the current SigerLM repository.

Summary

SigerLM is a custom Python LLM framework built around a State Space Model/Mamba-like architecture. The project is moving toward a general LM workflow while using Bahasa Lampung as a domain adapter and training testbed.

The system currently supports:

  • custom SSM language model
  • modality-agnostic backbone entrypoint through projected embeddings
  • hybrid tokenizer selection
  • base training pipeline
  • config-driven LoRA fine-tuning
  • one-command LoRA curriculum runner
  • unified instruction corpus builder
  • Lampung dataset extraction and instruction generation
  • lookup-first Lampung inference
  • general/domain routing in CLI
  • config-driven engineering evaluation harness
  • evaluation and optimization scaffolding

Architecture Snapshot

data sources
  -> domain builders or dataset registry
  -> unified instruction corpus
  -> tokenizer
  -> base model / LoRA training / LoRA curriculum
  -> merged checkpoint
  -> inference router
  -> general chat or Lampung tools

Core Principle

The model architecture is general. Lampung is implemented as a domain layer:

  • data: data/lampung/, tools/build_lampung_dataset.py
  • instruction tasks: tools/build_instruction_dataset.py
  • lookup/rules: retrieval/
  • runtime pipeline: inference/lampung_pipeline.py
  • auto-routing: inference/router.py

Do not put domain-specific Lampung behavior into model/.

Do not put modality-specific preprocessing, decoder heads, or task losses into the SSM core. Non-text inputs should be projected by modalities/ adapters and passed into SigerLM.forward_hidden(inputs_embeds=...). The roadmap is in docs/MODALITY_AGNOSTIC_BACKBONE.md.

Important Modules

config/model_config.py
model/siger_model.py
model/ssm_block.py
model/ssm_core.py
modalities/base.py
modalities/registry.py
tokenizer/hybrid_tokenizer.py
training/dataset.py
training/dataset_registry.py
tools/build_instruction_corpus.py
lora/config.py
lora/dataset.py
lora/run_lora.py
inference/generator.py
inference/chat.py
inference/router.py
inference/lampung_pipeline.py
retrieval/instruction_lookup.py
retrieval/compositional_translator.py
evaluation/run_harness.py
evaluation/harness/runner.py

Config Files

configs/datasets/lampung_instruction.json
configs/datasets/general_instruction.json
configs/datasets/curriculum_stage1_foundation.json
configs/datasets/curriculum_stage2_general.json
configs/datasets/curriculum_stage3_advanced.json
configs/datasets/curriculum_stage4_full.json
configs/training/lampung_lora.json
configs/training/general_lora.json
configs/training/lora_curriculum.json
configs/training/curriculum_stage*_lora.json
configs/evaluation/harness_smoke.json
configs/evaluation/harness_dataset_only.json

Current Dataset State

Latest verified build:

data/lampung/processed/percakapan_1000_pairs.jsonl: 3100 rows
data/lampung/processed/compositional_pairs.jsonl: 1968 rows
data/lampung/final/train.jsonl: 4325 rows
data/lampung/final/valid.jsonl: 541 rows
data/lampung/final/test.jsonl: 541 rows
data/lampung/final/train_augmented_instruction.jsonl: 32059 rows
data/corpus/lampung_instruction_train.jsonl: 30701 rows
data/corpus/general_instruction_train.jsonl: 30704 rows
data/corpus/kaggle_local_inputs_train.jsonl: 51969 rows
data/corpus/curriculum_stage1_foundation_train.jsonl: 84605 rows
data/corpus/curriculum_stage2_general_train.jsonl: 186672 rows
data/corpus/curriculum_stage3_advanced_train.jsonl: 218596 rows
data/corpus/curriculum_stage4_full_train.jsonl: 218596 rows

general_instruction_train.jsonl is currently still dominated by Lampung because local general text files are small. To make SigerLM more general, add larger instruction/chat/text sources to configs/datasets/general_instruction.json.

Main Commands

Build Lampung data:

python tools\extract_percakapan_pdf.py
python tools\build_compositional_lampung_dataset.py
python tools\build_lampung_dataset.py
python tools\build_instruction_dataset.py

Build unified corpora:

python tools\build_instruction_corpus.py --registry configs\datasets\lampung_instruction.json
python tools\build_instruction_corpus.py --registry configs\datasets\general_instruction.json

Train LoRA:

python lora\run_lora.py --config configs\training\lampung_lora.json
python lora\run_lora.py --config configs\training\general_lora.json
python train_pipeline.py --mode lora-curriculum

Automatic Dense -> MoE -> LoRA pipeline:

python train_pipeline.py --mode auto --dry-run
python train_pipeline.py --mode auto --dense-profile moe_dense_base --moe-profile small_moe

The default auto pipeline uses moe_dense_base (d_model=384, n_layers=10) before small_moe so Dense -> MoE warm-start has compatible tensor shapes. Do not swap in siger_medium for the dense stage unless the MoE profile is also changed to the same base shape.

Preview automatic LoRA curriculum without training:

python train_pipeline.py --mode lora-curriculum --dry-run

Run CLI:

python chat_cli.py

Run engineering harness:

python evaluation\run_harness.py --config configs\evaluation\harness_smoke.json --only dataset_fixture_audit
python evaluation\run_harness.py --config configs\evaluation\harness_smoke.json --checkpoint checkpoints\lora\model_general_merged.pt

CLI modes:

0 auto/general router
1 Lampung O -> Indonesia
2 Indonesia -> Lampung O
3 Lampung O -> English
4 Lampung reasoning
5 general chat
6 Lampung word order

Latest Smoke Result

Mode: 0
Input: Nyak haga mengan manuk di warung paghek jalan
Assistant: aku mau makan ayam di warung dekat jalan
Route: lampung_to_id
Source: exact instruction lookup

Current Priorities

  1. Expand general instruction/chat corpora.
  2. Keep Lampung domain as an adapter, not the whole architecture.
  3. Train/evaluate the easy-to-hard LoRA curriculum once base checkpoint/tokenizer are aligned.
  4. Add small automated tests for corpus builder, router, and lookup.
  5. Improve evaluation coverage for Lampung ID/EN, general chat, and code generation through the harness.
  6. Keep CPU/VPS memory use under control.