Machine LearningLLMEvaluationFine-tuningStructured Outputs

Building a Domain-Adapted LLM for Travel Guidance

PocketGuide started from a simple frustration: most travel assistants sound helpful until you need dependable answers in a real situation. They can be fluent, but they are often inconsistent about structure, overconfident about uncertain details, and unusable when you lose connectivity. I wanted to build something narrower but more reliable.

So this project was never about making the biggest or most general assistant. It was about building a domain-adapted LLM that behaves predictably, communicates uncertainty, and can run offline on consumer hardware.

Why travel guidance

Travel planning is a strong domain for testing structured generation because the output types are concrete: itineraries, checklists, decision steps, and procedures. If the format breaks, it is obvious. If uncertainty is hidden, it matters.

It is also practical. A lot of tools assume stable internet and low stakes. Real trips do not. You may have poor connectivity, expensive roaming, and questions where a wrong answer is costly. PocketGuide was designed with that reality in mind.

Evaluation before training

I intentionally did this in the opposite order from many LLM projects. Before training anything, I built the evaluation harness and wrote strict output contracts.

Every response had to fit a JSON envelope with required fields (summary, assumptions, uncertainty notes, verification steps), and each query type had a typed payload schema. Then I fixed a 20-prompt benchmark to track parse success, schema compliance, and uncertainty-marker presence across every training iteration.

The untuned 7B baseline gave me a clear starting point: 80% parse success, 85% uncertainty-marker presence, and almost no envelope compliance. Those numbers were not impressive, but they were stable and measurable, which is exactly what I needed.

Synthetic data and teacher models

Instead of chasing volume, I focused on data quality. I generated a synthetic instruction set via teacher models (through OpenRouter), intentionally covering visa/customs/health/budget/itinerary scenarios across regions and difficulty tiers.

The first dataset only had 120 examples, which is small for LLM fine-tuning, but each sample was filtered, deduplicated, and tracked with provenance. That tradeoff was deliberate: better signal density over raw scale.

Five iterations of adaptation

I ran five LoRA iterations on Llama-2-7B, and each pass was guided by benchmark failures rather than intuition.

Across iterations, parse success moved from 80% -> 95% -> 100%. Uncertainty markers reached 100% early and stayed there. Envelope compliance improved more slowly, peaking around 20% and then fluctuating.

That pattern was useful. It showed that better data and measured iteration can dramatically improve reliability signals, but full schema adherence likely needs architectural support (for example constrained decoding), not just longer fine-tuning.

What the system does now

Today the system runs in two practical modes: HF+PEFT for experimentation and llama.cpp (Q4_K_M) for offline usage. Runtime validation supports strict and lenient parsing, and evaluation reports are timestamped and reproducible. In other words, results are not just better, they are traceable.

Why this is a good stopping point

PocketGuide reached the milestone I set for this phase: a coherent, measurable, offline-capable system with 100% JSON parse success and reliable uncertainty signaling. Full contract compliance is still partial, but now the limitation is clear and specific rather than vague.

What comes next

The next steps are straightforward: constrained decoding, objectives that weight required fields, and service-level packaging. But even now, the project demonstrates the core point I wanted to prove: evaluation-first domain adaptation, explicit output contracts, and iteration based on evidence rather than guesswork.

View on GitHub View Project Page