// Vibrary – an Incubator Lab Project by Art+Logic

Vibrary

Machine Learning

Neural Networks

Audio Classification

// Overview
In early 2018, Art+Logic called for applicants to our first Incubator Lab. We selected Dr. Scott Hawley’s idea to use machine learning to classify audio based on the user’s preferences.

Dr. Hawley had already developed a Python-based application with a graphical interface, and he was interested in creating one with a standard desktop application experience that would be easier to use for a wider audience. That meant it would need to be easily distributable and installable, along with a polished GUI.

What are Spectrograms?

 

Image of Fortissimo_Trumpet_Ensemble_Matrix

Image of Spectrogram

Dr. Hawley’s classifier library uses a machine-learning (ML) technique called convolutional neural networks (CNN) that takes an image produced from an audio file, called a spectrogram, as input, and outputs a set of values that is used to identify the type of the audio.

// About Neural Networks
A neural network represents a function -– mathematical or otherwise -– that can be tailored to a particular task through a process called training. It is called a network because each factor in the function, called a node, can be connected to other nodes. As such, the information from the source node is fed into the target node. The network is generally organized into layers where one layer’s nodes feed only into the next layer’s nodes. Nodes can perform any operation on its input, transforming it before feeding it to the next layer.

Training is an automated process performed by using a set of examples that have been paired with the correct result for each being fed through the network. The network’s output is compared to the correct answer, and the function’s parameters are adjusted to try to improve the results. This is repeated until the accuracy of the function stops improving.

Dr. Hawley’s Panotti convolutional neural network takes an image as an input. The network’s first few layers extract features from the image. Subsequent layers use that information to decide upon the classification of that image as the final output.

// Solution
Art+Logic analyzed Dr. Hawley’s application and created a UI/UX prototype that we then used to work out ways to hide the complexity of using neural networks to classify audio. This prototype served as the spec for the C++ application, which performs three major functions: training, classification, and retrieval.

Training

Training involves helping the user to find a set of examples for each type of sound that they wish to identify, uploading them to the training server, then downloading the trained network’s parameters for storage and use on the user’s computer.

Training can be computationally expensive. Vibrary supports training on a remote computer so that the user isn’t required to have expensive GPU hardware which can accelerate training significantly.

Classification

Classification allows the user to assign the results of the neural network to audio files beyond what was used in the training. Results are stored in a local database.

Retrieval

Retrieval lets the user search for audio files matching a type by searching in the local database of classified files.

// Goals + Objectives

Ease Of Use

Using a neural network, especially training, is complicated. Vibrary needed to make this easy as possible but still be useful to users at all skill levels.

To that end Vibrary:

  • Allows the user to focus on only their audio, hiding CNN details
  • Provides a simple UI for handling large sets of audio files
  • Gives guidance on creating training data sets
  • Uses a simple tag metaphor as the search criteria
  • Auto-completes tag names to prevent entry errors

 

MVP Demo

The target feature set was kept small, ignoring or postponing anything not directly related to finding audio with specific qualities. There are many ways we could have allowed users to have more control over the whole process. We favored simplicity and a single paradigm.

Quick Development

Without creating future headaches, we kept the development bare-bones. To that end, the remote training server particularly was left to work as it did in Dr. Hawley’s application rather than creating something like a REST API available through a web interface.

// Challenges

TensorFlow

TensorFlow is Google’s powerful and useful open-source machine learning library. Managing custom builds of TensorFlow for C++ and its dependencies in a way that allowed the easy distribution of the application was outside its typical use-case.

Quick Development vs Ease Of Use

Hiding and managing the details of the following tasks exists in direct tension with the goal of quick development:

  • Creating data sets
  • Uploading to the server
  • Triggering remote training
  • Locally creating the neural network for classification
  • Error recovery

It takes work to hide the complexity of those tasks while still keeping users informed of progress and protecting them from loss of work in error conditions.

// Technology Stack

Application

JUCE framework,
Tensorflow, SQLite, C++

Server

AWS Ubuntu Linux server instance,
Panotti, Tensorflow, Python

// The Team

Dr. Scott Hawley

Chief Scientist

Brett Porter

Project Manager

Jason Bagley

Developer

Daisey Traynham

UI/UX

Paul Hershenson

Project Supervisor

Jonathan Gonzalez

DB Consultant

John Tlusty

Tester

Britt Traynham

Consultant & Tester

J. Carlos Perez

Marketing

// Results

Art+Logic took the project to nearly feature-complete status, developing an application that can handle the three major features: training, classification, and retrieval.

Art+Logic is now open-sourcing Vibrary, encouraging the community to take over and expand it in ways we might not imagine so that it can be useful to as many people as need it.