Why this matters
Most plugin chord detectors live in the cloud. Your audio goes to a server, gets processed there, comes back as a chord name. That’s a brittle dependency. When the company gets acquired, when the API changes, when the subscription model “evolves,” your tool stops working. We weren’t going to ship a plugin that depended on someone else’s server.
So Harmony Wheel runs entirely on your machine. The model is a 2 MB file bundled inside the plugin. No internet required after activation. No audio ever leaves your computer.
The history (what we tried)
FFT template matching. The first version used a plain Fast Fourier Transform — count the frequencies in the signal, match them to chord templates. This works on a piano in a quiet room. It falls apart the moment you add drums, vocals, or a guitarist who plays slightly out of tune. The FFT sees harmonics and noise as just more frequency content; the template matcher can’t tell the difference.
Spotify’s Basic Pitch. The second version tried Spotify’s open-source note transcription model. It transcribes audio into MIDI notes pretty well. It also got disqualified on principle. And anyway, a note transcription model is the wrong tool for the job: you don’t need every note in a song to identify a chord, and a single missed note can change everything. Transcribe C – E – G and lose the E and you’ve got a power chord, not C major.
Chroma vectors. A chroma vector compresses all 88 keys of a piano into 12 pitch classes — C, C#, D, etc. — by summing the energy across octaves. Faster than FFT template matching, simpler than full transcription. Still gets confused by drums and overtones. Still doesn’t know that some notes in the signal aren’t part of the chord.
Harmonic-Percussive Source Separation (HPSS). HPSS is a pre-processing filter that tries to strip percussion from a mix before chord detection looks at it. Helps in some full-mix recordings. Hurts solo piano, because BP-style note detection depends on the percussive attack transients HPSS is throwing away. We couldn’t default it on or off — and asking the user to know was a design failure.
Each one improved one thing and broke two. They all shared the same root problem: they were trying to identify chords without ever having been taught what a chord is.
What we shipped
The version that ships uses CREMA — Convolutional Recurrent Estimation of Musical Attributes — an academic model published by Brian McFee at NYU’s Music and Audio Research Lab. It’s a small convolutional recurrent neural network (CRNN). Released under the open-source ISC license, bundled inside our plugin as a 2 MB file.
A CRNN is two ideas stacked. The convolution layers slide little pattern-detectors across the audio’s frequency spectrum, learning to recognize “this looks like a major triad” or “this looks like a dominant seventh.” The recurrent layers run over time, so the model knows that what you’re playing right now is partly defined by what you played a half-second ago — a G chord that resolves to C reads differently than a G chord that resolves to D minor.
What’s it trained on? Annotated chord transcriptions of pop music — the Isophonics Beatles set, the McGill Billboard set, the RWC dataset. These are academic music-information-retrieval corpora: chord labels written by musicologists, aligned to commercial recordings. The model learns the relationship between audio frequencies and chord identity. It doesn’t contain or reproduce the source recordings. It has no idea what a large-language-model training scrape is. It just learned, from labeled examples, what a chord looks like in audio.
The audio path before CREMA sees it: we compute an HCQT — Harmonic Constant-Q Transform. A Constant-Q transform is like an FFT, but the bins are spaced logarithmically instead of linearly. Linear spacing makes sense for engineers — even Hz steps. Logarithmic spacing makes sense for musicians — even semitone steps. The “Harmonic” part stacks multiple octaves on top of each other so the model can see relationships between a note’s fundamental and its overtones in one shot. CREMA takes the HCQT, runs it through its convolution and recurrent layers, and outputs a chord label every ~200 ms.
On top of that, we added:
- A sustain envelope so you can click a wedge and hold a chord — synth voice fades in, holds, fades out on release.
- The wheel’s harmonic context — the engine biases CREMA’s output toward chord functions that fit the current keyspace, so a borrowed iv chord lights the right wedge instead of just showing as out-of-key.
- A real-time inference loop running about five times per second on a background thread, so the audio thread doesn’t stutter and the chord display feels responsive.
The improvement over our earlier attempts wasn’t subtle. It was a different kind of accurate.
Credits: CREMA is © Brian McFee, NYU MARL, ISC license. ONNX Runtime is © Microsoft, MIT license. Harmony Wheel is built on the open-source JUCE framework↗.