Moviegoer: Can a machine watch a movie?

4 min readJul 6, 2020

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

If we wanted to teach robots to understand human behavior, we would have them watch movies. Cinema is the medium with the closest approximation to reality. At their simplest, films portray mimicry of basic human behavior. At their deepest, films convey emotional reactions to psychological experiences. But movies are incredibly difficult for a machine to interpret. Whether we realize it or not, there are many filmmaking conventions which we take for granted (e.g. passage of time between scenes, dramatic music over a conversation, montages). We humans can understand how these affect a film, but can a robot?

The Moviegoer project has the lofty goal of unlocking the enormous wealth of emotional data within cinema by turning films into structured data. Rather than create a single, all-encompassing model, Moviegoer will be pieced together iteratively, with various co-reinforcing modules based in transfer learning or filmmaking domain knowledge. There’s been much research in the fields of emotional facial recognition, speech analysis, etc, and there are plenty of pre-trained models freely available. But none of these can’t be applied to cinema until we can decode a movie’s near-infinite possibilities into structured data.

Though portions of this project will be heavily reliant on pre-trained models and off-the-shelf libraries, I’ll be training original models and developing original algorithms, rooted in my expertise in filmmaking, cinematography, and film editing. (I’ve produced, directed, shot, and edited a feature-length movie, with another on the way.) Throughout this project, I’ll be providing visual explanations or examples from famous movies — I hope to enlighten the reader on various aspects of filmmaking as I justify my design decisions.

Why movies?

Movies are chronologically linear, with clear cause-and-effect. Characters are established, emotional experiences happen to them, and they change. Consider a small-scale example: we see a character smiling, and two seconds later he’s frowning. What happened in those two seconds? Someone has said to him “I hate you”. We recognize this specific piece of dialogue is the antecedent, or stimulus to his emotional change.

Though some films are freeform, open-ended, and experimental, many follow a specific recipe (think Acts 1, 2, and 3), and can be interpreted according to this formula. Interpreting a film as a holistic piece of work is a future goal, but for now, we’ll start smaller.

As a starting point, the project scope is limited to the two-character dialogue scene, the fundamental building block of nearly every film. From an information theory perspective, two-character dialogue scenes are very dense. No distractions, just two characters speaking and advancing the plot. In the future, advancements made in certain modules will pave the way for widening the scope beyond two-character dialogue scenes.

At the moment, there are three main modules, each focused on a specific task in turning films into structured data.

Shot Recognition — CNN Image Recognition

There are a handful of very common cinematography (photography) shots used in most movies. This type of recognition can aid in identifying types of scenes, or certain cause-and-effect beats. The first type of shot recognized was the medium close-up, a shot commonly used in two-character dialogue scenes. We’ve trained a convolutional neural network (using an original, hand-labeled dataset of 11,000+ frames) to recognize these types of images.

Typical Medium Close-Up Shots

Scene Boundary Identification — Shot Clustering and Original Algorithm

Movies can be broken down into individual scenes, self-contained units of dialogue and action. In the current scope, we’re looking to identify two-character dialogue scenes. These scenes have a distinct pattern: we see character A speak, then character B speak, back to A, then B, etc. Using the VGG16 image recognition model and HAC clustering, we’ve grouped frames into shots, and then created an original algorithm to look for these A/B/A/B patterns of shots (assisted by the CNN image classifier above).

Dialogue Attribution — Voice Clustering, Facial Analysis, and Subtitle Parsing

With scene boundaries identified, we can analyze individual scenes. The biggest task is dialogue attribution: determining which character is speaking. A scene contains three tracks of data: visual, audio, and subtitles. We need to be able to tie the onscreen characters (faces) in the frames, with the voices in the audio, with the written dialogue in the subtitles. We’ll glean clues from each of the three data tracks to attribute dialogue.

Identifying Voices with Speaker Diarization

Stay tuned for more updates! I’ll be posting regular updates on individual aspects of the project, diving deeper into both the data science and film theory that drive development.

Wanna see more?

Repository: Moviegoer
Next Story: Cinematography Shot Modeling

Moviegoer: Can a machine watch a movie?

Written by Tim Lee