PocketGuide
Offline-capable travel assistant language model built with evaluation-first discipline, synthetic data generation, and production-ready architecture. Currently in active development with core infrastructure complete and data pipeline operational.
TL;DR
- Offline-first LLM for travel guidance (visa, customs, budgeting, itineraries) with no external dependencies
- Evaluation-driven development with 72-example benchmark suite and objective metrics
- Synthetic data pipeline with OpenRouter teacher models, rate limiting, and cost-controlled fallback chain
- Structured output contracts with schema validation (v0 envelope + v1 payloads)
- Production engineering with 169 passing tests, Makefile commands, and reproducible provenance tracking
Status: Foundation and evaluation infrastructure complete. Currently generating synthetic training data for model adaptation.
What I Built
Project Foundation
Established clean repository structure with stable Makefile commands, configuration management, and end-to-end pipeline placeholders. The architecture supports deterministic workflows with environment-based configuration via .env files and typed error handling throughout.
Evaluation Framework
Built comprehensive evaluation system with 72-example benchmark suite covering 7 travel categories (visa, customs, health, budget, itinerary, safety, local culture) across 3 difficulty levels. The framework includes schema-first validation, strict/lenient parsing modes, and automated report generation with metrics summary, failure analysis, and curated examples. Baseline evaluation of pre-adaptation model established reference metrics for all future improvements.
Behavioral Contracts
Implemented standard envelope schema (v0) for consistent response structure and specialized payload schemas (v1) for different travel query types. The validation engine enforces objective contract compliance with deterministic pass/fail criteria, enabling measurable improvements during model adaptation.
Synthetic Data Engine
Designed and implemented teacher-driven data generation pipeline with versioned prompt templates covering 4 payload types. The system uses OpenRouter as the backend with cost-controlled fallback chain (2 free models → 1 paid model) and 15 RPM rate limiting with exponential backoff. Dataset spec targets 120 examples across categories and difficulty levels with deterministic prompt planner CLI.
Features include dry-run mode for testing, resume support for interrupted runs, append-only JSONL output with full provenance tracking (config snapshots, run metadata, reproducible hashing), and pass-rate statistics by type/category/difficulty/error. Teacher provider includes typed error handling, full observability (tokens, latency, fallback tracking), and optional fallback_to_paid flag for cost control.
Tech Stack
- Python 3.11+: Core language with type hints and Google-style docstrings
- OpenRouter: Teacher model backend for synthetic data generation with multi-model fallback
- Ruff: Linting and formatting for code quality
- pytest: Testing framework with 169 tests covering data generation, validation, and evaluation
- JSONL: Append-only data format with provenance metadata and schema versioning
Current Status
✅ Completed (Milestones 0-3)
- Foundation: Clean repository, stable commands, configuration files, end-to-end pipeline
- Baseline Evaluation: Benchmark suite, evaluation framework, base model metrics, automated reporting
- Behavioral Contracts: Envelope schema, payload schemas, validation engine, compliance measurement
- Synthetic Data Engine: Prompt templates, dataset spec, teacher provider with rate limiting and fallback, draft generation CLI with provenance tracking and pass-rate statistics
🔄 In Progress (Milestone 4)
- Data Quality & Splits: Implementing deduplication, balancing, rejection filters, and leakage prevention. Creating clean held-out benchmark split separate from training data.
📋 Planned (Milestones 5-9)
- Model Adaptation (M5): LoRA/QLoRA fine-tuning on cleaned synthetic dataset with experiment tracking
- Rigorous Evaluation (M6): Base vs adapted model comparison with objective metrics and qualitative examples
- Evidence-Driven Iteration (M7): Targeted fixes based on failure analysis, retrain, re-evaluate
- Deployment Realism (M8): Model quantization, packaging for local/offline inference, resource constraints documentation
- Portfolio Finalization (M9): Polish documentation, demo, results summary, limitations, safety considerations
Engineering Decisions
Evaluation-First Development
The project follows rigorous evaluation discipline before any model adaptation. Baseline metrics from open-source model establish reference point. Synthetic data generation is validated against schemas before training. All improvements will be measured objectively against held-out benchmarks with reproducible metrics and provenance tracking.
Cost-Controlled Data Generation
Teacher provider implements fallback chain from free models to paid models with explicit fallback_to_paid flag for cost control. Rate limiting (15 RPM) and exponential backoff prevent API throttling. Dry-run mode enables testing without API costs. All generation runs include token usage tracking and fallback statistics for budget visibility.
Reproducibility & Provenance
Every generated example includes full provenance: config snapshot, prompt hash, teacher model ID, generation timestamp, token counts. Append-only JSONL format prevents data loss. Deterministic prompt planning ensures consistent category/difficulty distribution. All evaluation runs saved with metadata for exact reproduction.
Clean Architecture & Testing
Modular separation: inference, evaluation, data generation, teacher providers. 169 tests cover validation logic, data pipeline, and evaluation framework. Makefile provides stable commands (make test, make eval, make data) for consistent developer experience. Type hints and docstrings throughout.
Development Workflow
Setup
git clone <repository-url>
cd pocket-guide
make env # Create virtual environment
source .venv/bin/activate
make test # Run test suite
cp .env.example .env # Configure API keys
Synthetic Data Generation
# Dry-run (no API calls)
python -m pocketguide.teachers.smoke --dry-run
# Real generation with free models only
python -m pocketguide.data_generation.draft
# Allow paid fallback if free models fail
python -m pocketguide.data_generation.draft --fallback_to_paid
Evaluation
make eval # Run full benchmark suite
python -m pocketguide.eval.report_base_v0 --run_dir runs/eval/<timestamp>
Code Quality
make lint # Check linting
make format # Auto-format
Results (Baseline)
Baseline evaluation completed on open-source student model pre-adaptation. Metrics provide reference point for measuring improvements post-fine-tuning.
Detailed results and failure analysis available in evaluation reports under runs/eval/. Full comparison with adapted model will be documented once Milestone 5 (Model Adaptation) is complete.
Safety & Limitations
Current Limitations
This is a demonstration system in active development. The model has not been adapted yet and operates with baseline open-source capabilities. Output quality and reliability will improve significantly after synthetic data fine-tuning in Milestone 5.
Intended Use
PocketGuide provides general travel guidance and should not replace official government sources for visa requirements, health advisories, or safety warnings. Users must verify critical information independently.
Future Hardening
Planned safety improvements include evidence attribution, uncertainty quantification, and explicit abstention when information reliability is low. Production deployment (Milestone 8) will include additional safeguards and resource constraint documentation.