Enter the password to view this report.
This project fine-tuned OpenAI Whisper models on Air Traffic Control (ATC) audio from the LDC ATC Corpus to build specialized speech recognition for aviation communications. The system is designed for offline, on-device deployment on iPad via WhisperKit.
Whisper Small (242M params) matches the much larger Whisper Large v3 Turbo (809M params) at the same WER level, demonstrating that dataset quality matters as much as model size for this domain. All models use decoder-only LoRA fine-tuning for WhisperKit compatibility on iOS.
| Model | Parameters | Baseline WER | Fine-tuned WER | HuggingFace |
|---|---|---|---|---|
| Whisper Small | 242M | 49% | 27.5% | Trelis/whisper-small-...-6772 |
| Whisper Large v3 Turbo | 809M | 49% | 27.54% | Trelis/whisper-large-v3-turbo-...-filtered |
Models trained on rewritten transcripts (sentence case with numeric digits). These models produce output in the rewritten format but have slightly higher error rates than the lowercase-trained models above.
| Model | Parameters | WER (client test set) | HuggingFace |
|---|---|---|---|
| Whisper Small (rewritten) | 242M | 29.08% | Trelis/whisper-small-...-rewritten-7628 |
| Whisper Large v3 Turbo (rewritten) | 809M | 30.92% | Trelis/whisper-large-v3-turbo-...-rewritten-1048 |
WER measured on the primary client test set (pilotgpt-test-0.5s). The rewrite approach converts spoken numbers to digits and text to sentence case. The slight WER increase (~1.5–3.5%) versus the lowercase models is likely due to a format mismatch with Whisper's pre-training data.
| Dataset | Type | Duration / Size | HuggingFace |
|---|---|---|---|
| Training Dataset | Cleaned (lowercase) | 22 hrs | pilotgpt-unified-all-raw-no-pack-1s-merged-filtered |
| Training Dataset (rewritten) | Cleaned + Rewritten | 22 hrs | pilotgpt-...-filtered-rewritten |
| Test Set | Evaluation | 50 samples / 4.6 min | pilotgpt-test-0.5s |
| Test Set (rewritten) | Evaluation | 50 samples | pilotgpt-test-0.5s-rewritten |
A formatting prompt for converting raw ATC transcripts to sentence case with numeric digits, suitable for post-processing model outputs. Attached separately.
Source: LDC ATC Corpus (LDC94S14A) covering Boston Logan, Reagan National, and Dallas-Fort Worth airports.
From ~72 hours of raw recordings, the pipeline produced ~22 hours of clean, aligned training audio (28,506 samples).
Qwen3-ASR-0.6B (600M params, Apache 2.0 license) is a state-of-the-art ASR model released January 2026. It supports 52 languages, streaming/offline unified inference, and long-audio transcription. It may outperform Whisper on ATC audio even before fine-tuning, and fine-tuning on the existing datasets could yield further gains. At 600M parameters, it is smaller than Whisper Turbo (809M) and would likely use less energy on-device, though it is larger than Whisper Small (242M) so energy usage should be validated before committing to deployment. The larger Qwen3-ASR-1.7B variant is also available if on-device constraints allow.
We have done preliminary work fine-tuning Moonshine Base (61.5M params) on the ATC data, achieving 32.8% WER in initial experiments. At roughly 4× smaller than Whisper Small, a tuned Moonshine model could offer a significant reduction in energy consumption on iPad — potentially around 5×. We could deliver a fine-tuned Moonshine model fairly quickly if energy efficiency is a priority, though this is not included in the current project scope.
Additional ATC datasets could further reduce WER. The UWB-ATCC corpus (available for commercial license) offers different accents and conditions that would improve model robustness. Mixing in some standard English speech data could also help prevent degradation on non-ATC portions of transmissions.
The current training data comes from professional LDC recordings. Adding noise augmentation during training — real-world ATC radio noise profiles, static, heterodyning — could improve robustness in actual deployment conditions.
Comparison of model sizes relevant to on-device iPad deployment. Smaller models consume less battery and memory.
| Model | Parameters | ATC WER | Status | Notes |
|---|---|---|---|---|
| Moonshine Base | 61.5M | 32.8% (initial tuning) | Preliminary | ~4× smaller than Whisper Small; ONNX ready |
| Whisper Small | 242M | 27.5% | Delivered | Best WER-to-size ratio currently |
| Qwen3-ASR-0.6B | 600M | — | Not yet evaluated | State-of-the-art general ASR; smaller than Turbo |
| Whisper Large v3 Turbo | 809M | 27.54% | Delivered | Marginal improvement over Small |