PilotGPT Report

Enter the password to view this report.

Incorrect password. Please try again.

PilotGPT

Air Traffic Control Speech Recognition — Project Report

February 2026 • Prepared by Trelis Research

1. Executive Summary

This project fine-tuned OpenAI Whisper models on Air Traffic Control (ATC) audio from the LDC ATC Corpus to build specialized speech recognition for aviation communications. The system is designed for offline, on-device deployment on iPad via WhisperKit.

    27.5%
    Best Word Error Rate (WER)
  

    44%
    Relative Error Reduction (from 49% baseline)
  

    22 hrs
    Training Audio (from 72 hrs raw)
  

Whisper Small (242M params) matches the much larger Whisper Large v3 Turbo (809M params) at the same WER level, demonstrating that dataset quality matters as much as model size for this domain. All models use decoder-only LoRA fine-tuning for WhisperKit compatibility on iOS.

2. Deliverables

Trained Models

Model	Parameters	Baseline WER	Fine-tuned WER	HuggingFace
Whisper Small	242M	49%	27.5%	Trelis/whisper-small-...-6772
Whisper Large v3 Turbo	809M	49%	27.54%	Trelis/whisper-large-v3-turbo-...-filtered

Trained Models (Rewritten Transcripts)

Models trained on rewritten transcripts (sentence case with numeric digits). These models produce output in the rewritten format but have slightly higher error rates than the lowercase-trained models above.

Model	Parameters	WER (client test set)	HuggingFace
Whisper Small (rewritten)	242M	29.08%	Trelis/whisper-small-...-rewritten-7628
Whisper Large v3 Turbo (rewritten)	809M	30.92%	Trelis/whisper-large-v3-turbo-...-rewritten-1048

WER measured on the primary client test set (pilotgpt-test-0.5s). The rewrite approach converts spoken numbers to digits and text to sentence case. The slight WER increase (~1.5–3.5%) versus the lowercase models is likely due to a format mismatch with Whisper's pre-training data.

Datasets

Dataset	Type	Duration / Size	HuggingFace
Training Dataset	Cleaned (lowercase)	22 hrs	pilotgpt-unified-all-raw-no-pack-1s-merged-filtered
Training Dataset (rewritten)	Cleaned + Rewritten	22 hrs	pilotgpt-...-filtered-rewritten
Test Set	Evaluation	50 samples / 4.6 min	pilotgpt-test-0.5s
Test Set (rewritten)	Evaluation	50 samples	pilotgpt-test-0.5s-rewritten

Rewrite Prompt

A formatting prompt for converting raw ATC transcripts to sentence case with numeric digits, suitable for post-processing model outputs. Attached separately.

3. Data Pipeline

Source: LDC ATC Corpus (LDC94S14A) covering Boston Logan, Reagan National, and Dallas-Fort Worth airports.

Local transcript parsing — Extracted text and timestamps from S-expression formatted LDC transcripts; normalized to lowercase; removed UNINTELLIGIBLE segments; converted audio to 16kHz mono WAV.
Cloud alignment & segmentation (Trelis Studio) — Neural VAD for speech detection; forced alignment to match transcript words to audio timestamps; quality filtering; segmentation into training-ready chunks.
Dataset creation — Uploaded processed data to HuggingFace Hub.

From ~72 hours of raw recordings, the pipeline produced ~22 hours of clean, aligned training audio (28,506 samples).

4. Recommendations for Further Improvement

ACCURACY

Evaluate Qwen3-ASR for Improved Accuracy

Qwen3-ASR-0.6B (600M params, Apache 2.0 license) is a state-of-the-art ASR model released January 2026. It supports 52 languages, streaming/offline unified inference, and long-audio transcription. It may outperform Whisper on ATC audio even before fine-tuning, and fine-tuning on the existing datasets could yield further gains. At 600M parameters, it is smaller than Whisper Turbo (809M) and would likely use less energy on-device, though it is larger than Whisper Small (242M) so energy usage should be validated before committing to deployment. The larger Qwen3-ASR-1.7B variant is also available if on-device constraints allow.

EFFICIENCY

Moonshine for Energy-Efficient On-Device Inference

We have done preliminary work fine-tuning Moonshine Base (61.5M params) on the ATC data, achieving 32.8% WER in initial experiments. At roughly 4× smaller than Whisper Small, a tuned Moonshine model could offer a significant reduction in energy consumption on iPad — potentially around 5×. We could deliver a fine-tuned Moonshine model fairly quickly if energy efficiency is a priority, though this is not included in the current project scope.

DATA

Expand Training Data

Additional ATC datasets could further reduce WER. The UWB-ATCC corpus (available for commercial license) offers different accents and conditions that would improve model robustness. Mixing in some standard English speech data could also help prevent degradation on non-ATC portions of transmissions.

ROBUSTNESS

Add Real-World Noise Augmentation

The current training data comes from professional LDC recordings. Adding noise augmentation during training — real-world ATC radio noise profiles, static, heterodyning — could improve robustness in actual deployment conditions.

5. Model Size Comparison

Comparison of model sizes relevant to on-device iPad deployment. Smaller models consume less battery and memory.

Model	Parameters	ATC WER	Status	Notes
Moonshine Base	61.5M	32.8% (initial tuning)	Preliminary	~4× smaller than Whisper Small; ONNX ready
Whisper Small	242M	27.5%	Delivered	Best WER-to-size ratio currently
Qwen3-ASR-0.6B	600M	—	Not yet evaluated	State-of-the-art general ASR; smaller than Turbo
Whisper Large v3 Turbo	809M	27.54%	Delivered	Marginal improvement over Small