Why this matters

Most plugin chord detectors live in the cloud. Your audio goes to a server, gets processed there, comes back as a chord name. That’s a brittle dependency. When the company gets acquired, when the API changes, when the subscription model “evolves,” your tool stops working. We weren’t going to ship a plugin that depended on someone else’s server.

So Harmony Wheel runs entirely on your machine. The model is a 2 MB file bundled inside the plugin. No internet required after activation. No audio ever leaves your computer.

The history (what we tried)

FFT template matching. The first version used a plain Fast Fourier Transform — count the frequencies in the signal, match them to chord templates. This works on a piano in a quiet room. It falls apart the moment you add drums, vocals, or a guitarist who plays slightly out of tune. The FFT sees harmonics and noise as just more frequency content; the template matcher can’t tell the difference.

Spotify’s Basic Pitch. The second version tried Spotify’s open-source note transcription model. It transcribes audio into MIDI notes pretty well — and yes, handing Spotify any credit grates. The saving grace: it’s open-source and runs entirely on your machine, so no Spotify server ever touches your audio. But a note-transcription model is the wrong tool for naming a chord: you don’t need every note in a song to identify a chord, and a single missed note changes everything. Transcribe C – E – G, lose the E, and you’ve got a power chord, not C major. So it never became the namer — it survives in the shipped app only as a note front-end, backed up by chroma.

Chroma vectors. A chroma vector compresses all 88 keys of a piano into 12 pitch classes — C, C#, D, etc. — by summing the energy across octaves. Faster than FFT template matching, simpler than full transcription. Still gets confused by drums and overtones. Still doesn’t know that some notes in the signal aren’t part of the chord.

Harmonic-Percussive Source Separation (HPSS). HPSS is a pre-processing filter that tries to strip percussion from a mix before chord detection looks at it. Helps in some full-mix recordings. Hurts solo piano, because BP-style note detection depends on the percussive attack transients HPSS is throwing away. We couldn’t default it on or off — and asking the user to know was a design failure.

Each one improved one thing and broke two. They all shared the same root problem: they were trying to identify chords without ever having been taught what a chord is.

What we shipped

The version that ships uses CREMA — Convolutional Recurrent Estimation of Musical Attributes — an academic model published by Brian McFee at NYU’s Music and Audio Research Lab. It’s a small convolutional recurrent neural network (CRNN). Released under the open-source ISC license, bundled inside our plugin as a 2 MB file.

A CRNN is two ideas stacked. The convolution layers slide little pattern-detectors across the audio’s frequency spectrum, learning to recognize “this looks like a major triad” or “this looks like a dominant seventh.” The recurrent layers run over time, so the model knows that what you’re playing right now is partly defined by what you played a half-second ago — a G chord that resolves to C reads differently than a G chord that resolves to D minor.

What’s it trained on? Annotated chord transcriptions of pop music — the Isophonics Beatles set, the McGill Billboard set, the RWC dataset. These are academic music-information-retrieval corpora: chord labels written by musicologists, aligned to commercial recordings. The model learns the relationship between audio frequencies and chord identity. It doesn’t contain or reproduce the source recordings. It has no idea what a large-language-model training scrape is. It just learned, from labeled examples, what a chord looks like in audio.

The audio path before CREMA sees it: we compute an HCQT — Harmonic Constant-Q Transform. A Constant-Q transform is like an FFT, but the bins are spaced logarithmically instead of linearly. Linear spacing makes sense for engineers — even Hz steps. Logarithmic spacing makes sense for musicians — even semitone steps. The “Harmonic” part stacks multiple octaves on top of each other so the model can see relationships between a note’s fundamental and its overtones in one shot. CREMA takes the HCQT, runs it through its convolution and recurrent layers, and outputs a chord label every ~200 ms.

On top of that, we added:

  • A sustain envelope so you can click a wedge and hold a chord — synth voice fades in, holds, fades out on release.
  • The wheel’s harmonic context — the engine biases CREMA’s output toward chord functions that fit the current keyspace, so a borrowed iv chord lights the right wedge instead of just showing as out-of-key.
  • A real-time inference loop running about five times per second on a background thread, so the audio thread doesn’t stutter and the chord display feels responsive.

The improvement over our earlier attempts wasn’t subtle. It was a different kind of accurate.

The methods, defined

CREMA carries the naming, but the shipped detector runs several methods at once and routes between them depending on the sound in front of it. Here’s the whole toolkit, end to end.

Basic Pitch — Spotify’s open-source note-transcription model. We don’t love giving Spotify the credit, but the model is open-source and runs entirely on your machine — no Spotify server, no account, no telemetry, nothing leaves your computer. It writes the audio out as individual notes: which pitches are sounding, and when. Sharp on clear attacks; on a soft, few-overtone source — a vintage electric piano — it under-reports, catching one note of a three-note chord. So we back it with chroma.

Chroma — the twelve pitch classes (C, C♯, D … B) with every octave folded together. Rather than name exact notes, it measures how much energy sits on each of the twelve. That makes it robust to timbre: it reads the chord tones when a note-by-note model misses them. We run two flavours — a classic FFT chromagram, and a Goertzel variant (a sharper version tuned to exact note frequencies) that fills in the notes Basic Pitch drops. It’s our own DSP, not a bundled library.

CREMA — Convolutional Recurrent Estimation of Musical Attributes, the academic model described above (Brian McFee, NYU). It reads the spectrum and names the chord directly, with a little memory of what came just before. Strongest on dense, full-mix material.

NNLS — Non-Negative Least Squares. Not machine learning — it’s math: it explains the spectrum as a stack of note-and-overtone templates whose amounts can’t go negative, then reads the chord from the closest fit. Because it separates a note’s fundamental from its overtones, it’s strongest on sustained, harmonically-rich solo sources — organ, piano, electric piano — where plain chroma gets fooled by the overtones. It also reports how well it fit (its “residual”), which drives the router.

HPSS — Harmonic-Percussive Source Separation. A pre-processing filter, not a chord namer: it splits the spectrum into sustained/pitched content versus drums and transients, and strips the percussion so the detector sees a stable harmonic signal. Useful on a full mix; off by default, because it smears clean solo instruments.

The router — the glue. It reads NNLS’s residual to decide who names the chord: a low residual means a solo instrument it can decompose cleanly, so NNLS wins; a high one means a dense mix, so it defers to CREMA. Then the wheel’s own engine maps whatever name comes back onto the current key and scale degree.

Two of these are machine-learned (Basic Pitch and CREMA); chroma and NNLS are plain DSP; HPSS is a DSP pre-filter. None of them phone home.


Credits: CREMA is © Brian McFee, NYU MARL, ISC license. ONNX Runtime is © Microsoft, MIT license. Harmony Wheel is built on the open-source JUCE framework↗.