Deep LearningEvaluationMachine LearningData GenerationAPIPython

PocketGuide

Offline-capable travel assistant language model built with evaluation-first discipline, synthetic data generation, and production-ready architecture. Currently in active development with core infrastructure complete and data pipeline operational.

TL;DR

Offline-first LLM for travel guidance (visa, customs, budgeting, itineraries) with no external dependencies
Evaluation-driven development with 72-example benchmark suite and objective metrics
Synthetic data pipeline with OpenRouter teacher models, rate limiting, and cost-controlled fallback chain
Structured output contracts with schema validation (v0 envelope + v1 payloads)
Production engineering with 169 passing tests, Makefile commands, and reproducible provenance tracking

Status: Foundation and evaluation infrastructure complete. Currently generating synthetic training data for model adaptation.

What I Built

Project Foundation

Established clean repository structure with stable Makefile commands, configuration management, and end-to-end pipeline placeholders. The architecture supports deterministic workflows with environment-based configuration via .env files and typed error handling throughout.

Evaluation Framework

Built comprehensive evaluation system with 72-example benchmark suite covering 7 travel categories (visa, customs, health, budget, itinerary, safety, local culture) across 3 difficulty levels. The framework includes schema-first validation, strict/lenient parsing modes, and automated report generation with metrics summary, failure analysis, and curated examples. Baseline evaluation of pre-adaptation model established reference metrics for all future improvements.

Behavioral Contracts

Implemented standard envelope schema (v0) for consistent response structure and specialized payload schemas (v1) for different travel query types. The validation engine enforces objective contract compliance with deterministic pass/fail criteria, enabling measurable improvements during model adaptation.

Synthetic Data Engine

Designed and implemented teacher-driven data generation pipeline with versioned prompt templates covering 4 payload types. The system uses OpenRouter as the backend with cost-controlled fallback chain (2 free models → 1 paid model) and 15 RPM rate limiting with exponential backoff. Dataset spec targets 120 examples across categories and difficulty levels with deterministic prompt planner CLI.

Features include dry-run mode for testing, resume support for interrupted runs, append-only JSONL output with full provenance tracking (config snapshots, run metadata, reproducible hashing), and pass-rate statistics by type/category/difficulty/error. Teacher provider includes typed error handling, full observability (tokens, latency, fallback tracking), and optional fallback_to_paid flag for cost control.

Tech Stack

Python 3.11+: Core language with type hints and Google-style docstrings
OpenRouter: Teacher model backend for synthetic data generation with multi-model fallback
Ruff: Linting and formatting for code quality
pytest: Testing framework with 169 tests covering data generation, validation, and evaluation
JSONL: Append-only data format with provenance metadata and schema versioning

Current Status

✅ Completed (Milestones 0-3)

Foundation: Clean repository, stable commands, configuration files, end-to-end pipeline
Baseline Evaluation: Benchmark suite, evaluation framework, base model metrics, automated reporting
Behavioral Contracts: Envelope schema, payload schemas, validation engine, compliance measurement
Synthetic Data Engine: Prompt templates, dataset spec, teacher provider with rate limiting and fallback, draft generation CLI with provenance tracking and pass-rate statistics

🔄 In Progress (Milestone 4)

Data Quality & Splits: Implementing deduplication, balancing, rejection filters, and leakage prevention. Creating clean held-out benchmark split separate from training data.

📋 Planned (Milestones 5-9)

Model Adaptation (M5): LoRA/QLoRA fine-tuning on cleaned synthetic dataset with experiment tracking
Rigorous Evaluation (M6): Base vs adapted model comparison with objective metrics and qualitative examples
Evidence-Driven Iteration (M7): Targeted fixes based on failure analysis, retrain, re-evaluate
Deployment Realism (M8): Model quantization, packaging for local/offline inference, resource constraints documentation
Portfolio Finalization (M9): Polish documentation, demo, results summary, limitations, safety considerations

Engineering Decisions

Evaluation-First Development

The project follows rigorous evaluation discipline before any model adaptation. Baseline metrics from open-source model establish reference point. Synthetic data generation is validated against schemas before training. All improvements will be measured objectively against held-out benchmarks with reproducible metrics and provenance tracking.

Cost-Controlled Data Generation

Teacher provider implements fallback chain from free models to paid models with explicit fallback_to_paid flag for cost control. Rate limiting (15 RPM) and exponential backoff prevent API throttling. Dry-run mode enables testing without API costs. All generation runs include token usage tracking and fallback statistics for budget visibility.

Reproducibility & Provenance

Every generated example includes full provenance: config snapshot, prompt hash, teacher model ID, generation timestamp, token counts. Append-only JSONL format prevents data loss. Deterministic prompt planning ensures consistent category/difficulty distribution. All evaluation runs saved with metadata for exact reproduction.

Clean Architecture & Testing

Modular separation: inference, evaluation, data generation, teacher providers. 169 tests cover validation logic, data pipeline, and evaluation framework. Makefile provides stable commands (make test, make eval, make data) for consistent developer experience. Type hints and docstrings throughout.

Development Workflow

Setup

git clone <repository-url>
cd pocket-guide
make env              # Create virtual environment
source .venv/bin/activate
make test             # Run test suite
cp .env.example .env  # Configure API keys

Synthetic Data Generation

# Dry-run (no API calls)
python -m pocketguide.teachers.smoke --dry-run

# Real generation with free models only
python -m pocketguide.data_generation.draft

# Allow paid fallback if free models fail
python -m pocketguide.data_generation.draft --fallback_to_paid

Evaluation

make eval  # Run full benchmark suite
python -m pocketguide.eval.report_base_v0 --run_dir runs/eval/<timestamp>

Code Quality

make lint    # Check linting
make format  # Auto-format

Results (Baseline)

Baseline evaluation completed on open-source student model pre-adaptation. Metrics provide reference point for measuring improvements post-fine-tuning.

Detailed results and failure analysis available in evaluation reports under runs/eval/. Full comparison with adapted model will be documented once Milestone 5 (Model Adaptation) is complete.

Safety & Limitations

Current Limitations

This is a demonstration system in active development. The model has not been adapted yet and operates with baseline open-source capabilities. Output quality and reliability will improve significantly after synthetic data fine-tuning in Milestone 5.

Intended Use

PocketGuide provides general travel guidance and should not replace official government sources for visa requirements, health advisories, or safety warnings. Users must verify critical information independently.

Future Hardening

Planned safety improvements include evidence attribution, uncertainty quantification, and explicit abstention when information reliability is low. Production deployment (Milestone 8) will include additional safeguards and resource constraint documentation.

View on GitHub Read Blog Post