Moviegoer: Vision Features — Faces

3 min readOct 12, 2020

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

We’ve been using computer vision to draw conclusions on many aspects of a film’s visuals, but we can really learn a lot from faces. Faces contain lots of “data”, the most apparent being a character’s current emotional state. They also contain basic demographic information of characters (age, race, gender). We can also infer a few cinematography and structure-related features based on how big, or where a face is in the image. Let’s take a look at what we can learn.

In Plus One (2019), we can recognize every time Ben’s onscreen because he’s introduced himself

Face Encoding

We’ve previously figured out how to find self-introductions, like “I’m Ben”. When we read a self-introduction like this in the subtitles, we can generally assume that the onscreen face is Ben. Using the Python library face_recognition, we can save his facial encoding and recognize Ben’s face whenever he’s onscreen.

Face Clustering

Even if we don’t have a self-introduction, we can still identify when a unique face appears in multiple frames. Using hierarchical agglomerative clustering (HAC) with Keras/TensorFlow, we can cluster similar face encodings, and find the frames in which it appears. In the below example, we don’t know Alice, but we know that her face appears in about half of the frames of the scene, across from Ben. (He mentions her name several times during this scene, but more on that in a future NLP post.)

We don’t know who this is, but her face appears in about half of the scene (opposite Ben)

Face Counting and Primary Character

We can count the number of faces found in a frame. We can also define if a face is the frame’s “primary character”. If there’s only one face, it’s the primary character. If there are multiple faces, we check their sizes — if any face is significantly larger than the others, then we designate that the primary character of the frame.

Mirrored Shots (Shot/Reverse-Shot)

Generally, two-character conversations use the shot/reverse-shot model. We see a medium close-up of Character A on the left side of the screen, and then we cut to a medium close-up of Character B on the right side of the screen. Then we cut back and forth between A and B. The shots are usually mirrored, Character A’s face is the same size as Character B’s face. We can take advantage of this convention by looking for pairs of shots which features two different characters with roughly equal face sizes, with one in the left rule-of-thirds alignment point, and the other in the right alignment point.

Compared to the above image, the camera has zoomed out, but for both characters — their faces are still mirrored, about the same distance from center and equally-sized

Mouth Open

To assist with dialogue attribution, we can measure if a character’s mouth is open or not.

Emotion and Demography

The deepface library can automatically predict a face’s age, gender, race, and emotional state. This is very resource intensive, so we’re using it sparingly for now.

Wanna see more?

Repository: Moviegoer
Project Intro: Can a Machine Watch a Movie?
Previous Story: Unifying Features
Next Story: Character Identification with a Self-Introduction

Moviegoer: Vision Features — Faces

Written by Tim Lee

No responses yet