Moviegoer: Unifying Features

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

We’ve pored over the three basic elements of a movie file: the visuals, audio, and subtitles, and now it’s time to bring them all together. We’ve been extracting features from each of the three streams of data, but if we tie them together, they can reinforce one another. We’re getting closer to building a small prototype of a machine that can “watch” a movie.

In Booksmart (2019), this karaoke scene has red as a dominant RGB color, only contains musical dialogue, and has audio distinctly shaped as “in-universe” sound, tinnier than a music track overlaid on the soundtrack

Let’s take a look at the four categories of comprehension we’ve previously defined — these are four types of knowledge needed to understand a film. We can see how we can use features from all three data streams in each.


A film can be broken down into individual, self-contained scenes. There are a few clues to that will tip us off to scene boundaries. There’s usually a bit of “breathing room” after the previous scene. It may start with an establishing shot, such as a shot of a restaurant exterior if the scene will take place inside. This establishing shot might be accompanied by sound effects (and labeled subtitles), like “crickets chirping” or “indistinct chatter”. We can also look for stretches of time without dialogue. If the subtitles go ten seconds without any dialogue, this might be a scene boundary. We can also look for the building block of movies, the two-character dialogue scene, by looking for shot/reverse-shot patterns with facial recognition.


We can use facial recognition and speech recognition libraries to identify faces and voices. With proper clustering techniques, we can identify these faces and voices throughout the entire film. We can use the subtitles to assign character names to these faces and voices, looking for things like self-introductions “My name is Adam” or direct-addresses “How are you, Violet?” Advanced facial recognition libraries can even tell us age, race, and emotional state of these characters.

Plot and Events

We can attribute dialogue to characters, with clever use of subtitle and onscreen visual clues. Once we figure out what each character is saying, we can determine what they’re talking about, which might be of importance to them. We can also look for clues of where the current scene takes place, like “elevator dings” in the subtitles, or talk about ordering with a waiter in a restaurant.

Emotional and Style Features

We can guess the emotional energy of a scene by looking at the visuals and listening to the audio. Stylized scenes might be full of flashy lights and specific color schemes. Montages might be free of dialogue, and feature rapid cuts. A scene’s music might have clues, and we may look into identifying its tempo, scale, and tone.

Wanna see more?

Unlocking the emotional knowledge hidden within the world of cinema.