Moviegoer: Audio Features — Voice Tone

Tim Lee
3 min readAug 24, 2020

--

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

After extracting visual features from movie frames, we turn toward the audio track. A film’s audio mix is made of three components: the score, the diegetic sound effects, and the dialogue. We’ll focus on the dialogue for now. We’ll begin to use the powerful library for audio analysis, librosa, and then use a pre-trained model for analyzing voice tone.

Audio Visualization

Audio analysis is heavily rooted in signal processing — these concepts should be familiar to electrical engineers. To process audio, we’ll have to convert audio files into usable representations in both the time domain and frequency domain. Luckily, we can use the librosa library. Created by LabROSA at Columbia University, this library contains all sorts of tools for audio processing.

librosa makes it easy to create two fundamental audio visualizations: the waveform plot and the spectrogram.

The waveform plots amplitude vs. time. We can think of the amplitude as power, or volume. Here’s a waveform of a female character in Booksmart speaking a sentence for eight seconds.

Waveplot: Amplitude vs. Time

Since this is a stereo track, the left channel amplitude is above the axis and the right channel is below. The loudest portions of the sentence are denoted by the peaks.

Next, we have the spectrogram, which represents frequency over time. This plot tells us the strongest frequencies of the audio signal at each timeframe. The frequency defines the profile of the audio signal. As an example, we can look at two spectrograms for a male voice and a female voice. Male and female voices have different frequencies: men between 80–180 Hz, and women between 1670–250 Hz.

Female Spectogram: Intensity of Frequencies vs. Time
Male Spectogram: Intensity of Frequencies vs. Time. The frequencies of his voice are lower than the female’s

Voice Tone

Since the eventual goal of the project is to quantify emotion, we’ll want to measure emotion in voice tone. Since voice emotional analysis is a very popular subject, there are plenty of pre-trained models available. We used a model from GitHub user MITESHPUTHRANNEU. This model was trained on the famous RAVDESS dataset, which has male and female actors deliver lines with various emotions.

We were able to use the model on a plug-and-play basis using Keras’ models.load_model(). By loading a 2.5 second clip, it was able to identify gender, and one of five emotions. Of course, we can train our own models using the RAVDESS dataset, but it’s been done satisfactorily many times already, and we can gratefully use this work to further progress in the Moviegoer project.

Wanna see more?

--

--

Tim Lee
Tim Lee

Written by Tim Lee

Unlocking the emotional knowledge hidden within the world of cinema.

No responses yet