A macOS-only Python application that acts as a system-wide hold-to-dictate tool. The user holds the right Command key, speaks, and on release the spoken audio is transcribed (and optionally translated to English) by a local MLX-Whisper model. The result is pasted into whatever text field is currently focused. A small floating overlay pill shows status throughout, and a menu bar icon provides runtime controls.
Specification for building a macOS background dictation and translation utility powered by MLX-Whisper on Apple Silicon.
A macOS-only Python application that acts as a system-wide hold-to-dictate tool. The user holds the right Command key, speaks, and on release the spoken audio is transcribed (and optionally translated to English) by a local MLX-Whisper model. The result is pasted into whatever text field is currently focused. A small floating overlay pill shows status throughout, and a menu bar icon provides runtime controls.
The entire app lifecycle is governed by three states:
Legal transitions: IDLE → RECORDING (key press), RECORDING → PROCESSING (key release with enough audio), RECORDING → IDLE (cancellation if recording is too short), PROCESSING → IDLE (transcription finished or failed). All other transitions are illegal and must raise an error.
hatchling as the build backend with a pyproject.toml and src/ layout.mlx-whisper (speech recognition engine), numpy (audio array manipulation), sounddevice (microphone capture), pyobjc (macOS framework bindings for Quartz, AppKit, Cocoa).pytest for unit tests.The app requires three macOS permissions. Detect when each is missing and emit clear, actionable error messages pointing to the relevant System Settings path.
CGEventTapCreate returns None, this permission is missing.sounddevice.Organize the code into clearly separated modules with single responsibilities. The boundaries are:
Implement a thread-safe finite state machine with the three states and transition rules described above. Use an enum for states. Protect state access with a threading.Lock. Accept an optional on_change(old_state, new_state) callback that fires outside the lock after every successful transition (to prevent deadlocks if the callback reads state). Raise a custom error on illegal transitions. Provide convenience predicates (is_idle(), is_recording(), is_processing()).
Define a dataclass with all configurable fields and sensible defaults:
| Field | Type | Default | Purpose |
|---|---|---|---|
| model | str | A large MLX Whisper model repo id | Local path or HF repo id for the Whisper model |
| translate | bool | True | Translate to English (True) vs. transcribe only (False) |
| language | str or None | None | ISO-639-1 source language code; None means auto-detect |
| sample_rate | int | 16000 | Audio sample rate (Whisper's native rate) |
| min_duration | float | 0.3 | Reject recordings shorter than this (seconds) |
| interim_interval | float | 1.0 | Minimum seconds between interim decode runs |
| interim_preview | bool | False | Show interim transcription in the overlay while recording |
| in_field_preview | bool | False | Paste interim text directly into the focused field |
| restore_clipboard | bool | True | Restore previous clipboard contents after paste |
| log_level | str | INFO | Logging level (DEBUG, INFO, WARNING, ERROR) |
| foreground | bool | False | Run as a foreground app with dock icon instead of background |
Parsing precedence (lowest to highest priority):
~/Library/Application Support/reverse-babelfish/preferences.json)RB_, e.g. RB_MODEL, RB_TRANSLATE)CLI arguments must default to None internally so "not specified" is distinguishable from "explicitly set" — only override lower layers when the user explicitly passed a flag.
Boolean environment variable parsing: "0", "false", "no" (case-insensitive) map to False; everything else maps to True.
Provide a function to persist user preferences (translate, language) to the JSON file — called when the user changes settings via the status bar menu so they survive restarts.
CLI flags to support: --model, --translate/--no-translate, --language, --min-duration, --interim-interval, --interim-preview/--no-interim-preview, --in-field-preview/--no-in-field-preview, --no-restore-clipboard, --foreground, --log-level.
Detect the right Command key (keyCode 54) globally using a Quartz CGEventTap.
Critical macOS detail: Right Command is a modifier key. It does NOT fire kCGEventKeyDown/kCGEventKeyUp. It ONLY fires kCGEventFlagsChanged. The detection logic must:
kCGEventFlagsChanged events only.kCGKeyboardEventKeycode from the event and verify it equals 54 (right Command).kCGEventFlagMaskCommand in the event flags to determine whether the key went down (flag set, key not previously held) or up (flag cleared, key was held)._held boolean to detect edges and ignore auto-repeat.Run the CGEventTap on a private CFRunLoop in a daemon thread so it never blocks the main thread. Accept on_press and on_release callbacks. Always return the event from the tap callback — failing to return blocks all keyboard input system-wide. Wrap the callback body in try/except so errors never kill the tap.
Use sounddevice.InputStream for non-blocking microphone capture. Record mono float32 at 16 kHz with a blocksize of ~1024 samples (~64ms). Accumulate audio chunks in a list protected by a threading.Lock. Provide methods to start, stop, get the concatenated audio as a 1-D numpy array, and query the buffered duration. Accept an optional on_chunk callback for interim decode — invoke it from the sounddevice callback thread, wrapped in try/except.
Wrap mlx_whisper in a pipeline class that handles model resolution, lazy loading, and concurrency.
Model resolution logic:
~), use it directly./ (looks like a Hugging Face repo id), resolve from the local HF cache only using huggingface_hub.snapshot_download(repo_id=..., local_files_only=True). Raise a clear error if the model is not cached — do not download.Lazy loading: Import mlx_whisper only when loading the model (not at module import time) to keep startup fast. Provide a load() method called once at startup.
Inference methods: transcribe(audio) for final full-utterance decode after key release, and transcribe_chunk(audio) for interim partial decode. Both call mlx_whisper.transcribe() under the hood.
Runtime option updates: Provide set_options(task, language) to change translate/transcribe mode and source language at runtime (called when the user toggles settings in the menu bar).
Concurrency guard: MLX-Whisper is NOT safe for concurrent inference calls. Use a threading.Lock (_infer_lock) around ALL mlx_whisper.transcribe() calls. Interim and final decodes must be serialized.
Implement clipboard-based text pasting into the active app:
NSPasteboard.kCGEventFlagMaskCommand flags, posting both via CGEventPost.Live Preview Injector: Implement a separate class for in-field preview during recording. It tracks whether interim text has been pasted and uses Cmd+Z to undo the previous interim paste before inserting updated text. Methods: update(text) to replace interim text, finalize(text) to undo interim and inject final text, cancel() to undo interim and clean up. Use a lock for thread safety.
Build a floating status pill using PyObjC/AppKit:
NSPanel with borderless + non-activating style mask.NSFloatingWindowLevel + 1), non-activating (must NEVER steal keyboard focus), mouse-ignoring (setIgnoresMouseEvents_(True)), and visible on all Spaces/desktops.NSOperationQueue.mainQueue().Create a macOS menu bar item with runtime controls:
NSStatusBar.systemStatusBar() to create a status item."mic" for idle, "mic.fill" for recording, "waveform" for processing. Fall back to text titles if SF Symbols are unavailable.NSMenu with these items:
NSObject subclass for menu action targets (required by PyObjC's Objective-C runtime).Provide a simple setup_logging(level) function that configures the root logger with a stderr handler, a timestamp + level + module name format, and clears existing handlers on re-call. All modules use logging.getLogger(__name__).
The main module wires all subsystems together in an App class.
By default, the CLI spawns a detached child process and the parent exits immediately:
RB_BACKGROUND_CHILD=1) to detect the child. If the variable is set, this is the child — proceed with normal app startup.--foreground is passed, skip detach and run in the calling terminal.start_new_session=True and stdin/stdout/stderr redirected to devnull for complete detachment.The App class constructor creates all subsystems and wires their callbacks:
on_change callback that logs transitions and updates the status bar icon.NSApplication instance.NSApplicationActivationPolicyAccessory for background mode or NSApplicationActivationPolicyRegular for foreground.finishLaunching().SIGINT handler for clean Ctrl+C shutdown.NSApp.run(). It blocks Python's signal handling. Instead, manually pump events in a while loop using nextEventMatchingMask_untilDate_inMode_dequeue_ with a ~200ms timeout. Check a _running flag each iteration to support clean shutdown.When the right Command key is pressed:
When the right Command key is released:
min_duration. If too short, transition RECORDING → IDLE (cancel), hide the overlay, cancel live preview if active, and return.When interim_preview is enabled, the audio chunk callback drives throttled interim decodes:
interim_interval, with a hard minimum floor of ~0.25s) and no interim decode is in-flight, snapshot the buffer and spawn a one-shot interim thread.pipeline.transcribe_chunk(), then:
Runs after key release on a background thread:
pipeline.transcribe(audio).pipeline.set_options(), persist preferences.pipeline.set_options(), persist preferences.In-field preview implies interim preview. Enabling in-field must also enable interim. Disabling interim must also disable in-field.
The stop() method disables the hotkey listener, stops audio capture, stops the status bar, and sets _running = False to exit the event loop. It is triggered by Ctrl+C (SIGINT) or the Quit menu item.
The app uses multiple threads that must be carefully coordinated:
| Thread | Role | Constraints |
|---|---|---|
| Main thread (AppKit) | Event dispatch, overlay UI, status bar | All UI mutations must happen here |
| Hotkey listener (daemon) | CFRunLoop for CGEventTap | Must not block — callbacks return quickly |
| sounddevice callback | Audio chunk delivery | Real-time audio thread — must not block or allocate |
| ASR worker | Full transcription after key release | One-shot per utterance |
| Interim ASR worker | Throttled partial decode during recording | One-shot, generation-gated |
All locks and their purpose:
Never hold two locks simultaneously. Fire callbacks outside locks to prevent deadlocks.
Right Command is a modifier key. Never use keyDown/keyUp for it. Only kCGEventFlagsChanged fires for modifier keys on macOS.
Always return the event from the CGEventTap callback. Failing to return blocks all keyboard input system-wide.
Overlay must never steal focus. Use NSWindowStyleMaskNonactivatingPanel and setIgnoresMouseEvents_(True). If the overlay steals focus, the user's cursor leaves their text field and the paste target changes.
Do not use NSApp.run(). It blocks Python's signal handling. Manually pump events with nextEventMatchingMask_untilDate_inMode_dequeue_ in a loop.
MLX inference is not thread-safe. Always serialize with a lock. Concurrent interim + final transcribe calls will crash or corrupt.
Clipboard timing matters. After placing text on the clipboard, pause ~50ms before sending Cmd+V so the target app's event loop sees the new content. After pasting, pause ~150ms before restoring the original clipboard so the app finishes consuming it.
Stale interim results. Use a generation counter that increments on every new recording and on key release. Discard results from prior generations silently.
Config precedence. CLI must override env, env must override saved preferences, saved preferences must override defaults. CLI args that weren't passed must remain None so they don't clobber lower layers.
In-field preview implies interim preview. Enabling in-field must auto-enable interim. Disabling interim must auto-disable in-field.
Live preview uses Cmd+Z. The injector undoes previous interim pastes before inserting updated text. This works with standard Cocoa text fields but may not work in all apps — this is an accepted limitation.
Background detach. The default launch forks a detached child. The child uses an env var guard to prevent infinite spawn loops. --foreground bypasses this.
State machine callbacks fire outside the lock. This is deliberate — it prevents deadlocks when the callback needs to read or transition state.
Model path resolution is offline-only. If the model is a Hugging Face repo id, resolve from the local cache. Never download. Fail with a clear message if the model is missing.
Write tests using pytest. Keep tests deterministic, fast, and free of macOS framework dependencies — mock or avoid AppKit/Quartz in unit tests.
on_change callback receives correct (old, new) pairs in order.tmp_path + monkeypatch) to avoid reading real preference files during tests.Set up a virtual environment, install the package in editable mode, run tests, then launch:
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
pytest
reverse-babelfish --foreground --log-level DEBUG
Use --foreground during development to keep the app attached to the terminal for log visibility. Use --log-level DEBUG to see state transitions and hotkey press/release events.
Build incrementally so each layer can be tested before wiring the next: