*Spectrogram of swelling trumpet sound*

Art+Logic’s Incubator project has made a lot of progress. In a previous post I mentioned that Dr. Scott Hawley’s technique to classify audio involved converting audio to an image and using a Convolutional Neural Network (CNN) to classify the audio based on this image. That image is a spectrogram. I’m going to go into some detail about what we do to create one, and why to the best of my ability.

## Basics

The spectrogram shows the energy within a set of frequency ranges. The vertical axis represents the frequencies, the horizontal axis represents time. When represented as a grayscale image, the brighter the pixel, the more energy in that frequency range (`y`

) at that time range (`x`

).

## The Base — The FFT

To create the spectrogram we have to extract that frequency data from the audio signal. This is done with a Fourier Transform algorithm, commonly implemented as the Fast Fourier Transform (FFT). The variation we need to use is the Short Time Fourier Transform (STFT) so we can see how the frequency information changes over the time. In general, a Fourier Transform takes a signal, for us a buffer full of discrete samples taken of a signal over time (i.e. audio), isolates contiguous frequency ranges into what are called bins, then measures the energy present in each bin over some length of time. More on [STFT] later.

JUCE provides an implementation of an FFT that we can use to implement the STFT in its `dsp`

module which it documents as not necessarily fast but good enough for many uses. So far it works for us. We may be lucky in that we’ve been developing on macOS, where JUCE uses Apple’s Accelerate library.

The FFT class is initialized with an `order`

argument. The order is the exponent such that the number of samples processed by the FFT will be two raised to that number, called `fftSize`

. It wasn’t immediately obvious to me what the size was when I started on this code. Is it the number of bins, or is it the number of samples the FFT will produce? For JUCE and Accelerate it’s both.

I thought I would be able to simply pass it the number of bins I wanted, and define the length separately. Using `order`

this way makes sense when you consider that the FFT works best when the number of samples it will process is a power of two. Additionally, the Inverse Fourier Transform can be performed, in which case the input is the bins and there would need to be `fftSize`

bins.

The total number of bins, in our case 2048, includes positive and negative frequency values. The two halves will be mirrored values, and we’ll eventually only keep the positive half of them.

```
/// The class to make spectrograms from audio file data
class ASpectroMaker {
//...
dsp::FFT fFft;
//...
};
ASpectroMaker::ASpectroMaker(int fftBins /*...*/ ): fFft(std::log2(fftBins))
// ...
{}
```

## The Iteration — The STFT

In STFT The *Short Time* part means we will take overlapping subsets of the signal and perform the transform on each chunk. So each column of the spectrogram represents the Fourier transform output from one of these chunks. The distance between chunks is called the hop size. The hop size should be less than the chunk size.

```
// One audio channel of FFT data over time, really 2-dimensional
using Spectrum = std::vector < float > ;
ASpectrogram ASpectroMaker::STFT(const AudioSampleBuffer & signal, size_t hop) {
const float * data = signal.getReadPointer(0);
const size_t dataCount = signal.getNumSamples();
// fftSize will be the number of bins we used to initialize the ASpectroMaker.
ptrdiff_t fftSize = fFft.getSize();
// See below for more about numHops
ptrdiff_t numHops = 1 L + static_cast < long > ;
((dataCount - fftSize) / hop);
//...
// Ignore the negative frequency information but leave the center bin.
size_t numRows = 1 UL + (fftSize / 2 UL);
// fFft works on the data in place, and needs twice as much space as the input size.
std::vector < float > fftBuffer(fftSize * 2 UL);
// While data remains
{
std::memcpy(fftBuffer.data(), data, fftSize * sizeof(float));
// prepare fft data...
fFft.performFrequencyOnlyForwardTransform(fftBuffer.data());
// ...copy the frequency information from fftBuffer to the spectrum
// Next chunk
data += hop;
}
//...
}
```

## Windowing

Because we are breaking apart our signal into chunks, the first and last values in each buffer will probably not be zero. When that’s true, the FFT will produce noise in the output due to the discontinuity of the signal. Imagine a flat signal with just zero before and after the chunk. We need to smooth the begining and end to make sure they transition smoothly to the full amplitude of the signal. To do this we use a windowing function. There are many choices for windowing functions with various tradeoffs, and JUCE provides several options.

The Hann function provides good performance and meets our requirements (and it was what Dr. Hawley used). Notice that we initialize it with the `fftSize + 1`

, not the hop size. This threw me for a minute during development, but because we are windowing we have to overlap the chunks, moving by hop size, but processing `fftSize`

samples per chunk, where `hop < fftSize`

. Looking at the Hann function it only allows the maximum signal amplitude near the center, the rest of the time it attenuates the signal. So to represent the original signal more accurately we overlap the chunks so that at some iteration the full amplitude will be well represented and in others it still contributes some to the calculations.

```
/// The class to make spectrograms from audio file data
class ASpectroMaker {
//...
dsp::FFT fFft;
dsp::WindowingFunction < float > fWindow;
//...
};
ASpectroMaker::ASpectroMaker(int fftBins /*...*/ ): fFft(std::log2(fftBins)), fWindow(fFft.getSize() + 1, dsp::WindowingFunction < float >::hann, false)
// ...
{}
```

Now apply the windowing to the chunk of samples before passing it to the FFT.

```
ASpectrogram ASpectroMaker::STFT(const AudioSampleBuffer& signal, size_t hop)
{
// ...Initialize variables as before...
// While data remains
{
std::memcpy(fftBuffer.data(), data, fftSize * sizeof(float));
fWindow.multiplyWithWindowingTable(fftBuffer.data(), fftSize);
fFft.performFrequencyOnlyForwardTransform(fftBuffer.data());
// ...copy the frequency information from fftBuffer to the spectrum
// Next chunk start
data += hop;
}
//...
}
```

I’ll skip how the data is read into the `Spectrum`

where `ASpectrogram`

keeps per-channel information, but it is straightforward. Just remember to discard the bins corresponding to negative frequencies. And we flip the data vertically so that higher frequencies are at the top, lower frequencies are at the bottom.

There’s a more interesting thing we need to do after that.

## We Don’t Hear Hertz

The FFT bins are ranges of Hertz (Hz). Hertz are a linear representation of frequency, but our hearing’s perception is logarithmic. That means we don’t perceive much difference between 10,000 Hz and 10,500 Hz, but by 20,000 Hz we will. It also means we hear a lot of change between 20 Hz and 500 Hz. A lot of the resolution in the linear spacing of the Hz bins is wasted given our perception.

In order that our classification system finds features in the spectrogram that are more likely to correspond to the way we hear the sound, we should convert from Hertz to another measurement that matches our hearing. For the method Dr. Hawley brought us that is… well it’s commonly called Mels. I won’t presume to understand or explain everything that is going on in the conversion to Mels, but it boils down to a matrix multiplication with a specially constructed conversion matrix that stretches and squeezes information from our evenly spaced linear Hertz bins into bins that better represent our perception of the sound’s frequencies.

The conversion matrix ends up being `fftSize x mels`

. Multiplying the frequency data which is `n x fftsize`

with it will produce the `n x mels`

desired output where `n`

is the width of our spectrogram. And we reduce the total size of the spectrogram data by as much. In our application we go from the Hz representation `1025 x 259`

to `96 x 259`

in the Mels representation.

## The Final Transforms

Once we have converted to Mels, we make a final transformation from amplitudes to deciBels (db) while normalizing and clamping to a maximum dB. To use the *power to dB* formula we need powers rather than amplitudes. This just means squaring the signal, i.e. multiply it by itself. Notice that in the `power-to-db`

code it again squares the values. It needs both.

```
// TransformChannels will modify the data in each channel
// (aka Spectrum) in the spectrogram in place.
// Power to dB while limiting to a minimum magnitude.
TransformChannels(magnitudes, [](float mag) {
constexpr double kMin = 1e-5 * 1e-5;
constexpr double kReference = 1.0 f * 1.0 f;
return 10.0 * std::log10(std::max(kMin, (double) mag * mag)) -
10.0 * std::log10(std::max(kMin, kReference));
});
// Limit to maximum dB
auto maxDb = magnitudes.MinMax().second - kMaxDb;
TransformChannels(magnitudes, [maxDb](float dB) {
return std::max(dB, maxDb);
});
```

## View From The Top

Here is the top-level `ASpectroMaker`

code snippet that goes from audio file to the final spectrogram. When it finishes `spectrogram`

contains the final Mel frequency information.

```
// Read a little more than 3 seconds to get the amount of data needed
// to produce 259 columns in the spectrogram
auto signal = this->ReadAudioSignal(file, 3.1);
auto hzFreqs = this->STFT(ToMono(signal).buffer, kHopLength);
TransformChannels(hzFreqs, [](auto x) { return x * x; });
spectrogram.AddChannel(ASpectroMaker::HzToMels(hzFreqs, 0, signal.sampleRate, fNumMelBins));
spectrogram.Reshape(fNumMelBins, hzFreqs.Shape().second);
ASpectroMaker::AmplitudesToDbs(spectrogram);
```

## Concluding

You now have a two-dimensional spectrogram represented as a set of floating point values in a vector. This is suitable to export, say as a NumPy array, or conversion to an actual image representation in color. We do both.

Hopefully this article has let you get a little better understanding of the how and the why, and you can avoid some of the confusion I experienced during development.

Some notes.

- We discard the phase information the FFT can produce; this means we cannot reliably convert back to audio from the spectrogram
- Dr. Hawley’s Python implementation used librosa and NumPy for much of the audio processing; we based our implementation on that code, and used his parameters
- The JUCE spectrogram demo was also very helpful