Moviegoer: Subtitle Features — Character Enumeration

3 min readSep 21, 2020

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

As audiences, we go into films not knowing any of the characters. We don’t know the names of any onscreen characters. There aren’t any onscreen labels of character names (with the exception of documentaries and docu-dramatizations). Instead, we have to pay attention to the dialogue, listening for the mention of character names. Luckily, all dialogue has already been transcribed in easily readable subtitles.

In seeking to have a machine understand a film, we need to have it process and remember character names. We can feed in the subtitle file and use NER, named entity recognition, to identify any potential names.

Dialogue-Based Namedrops

When writing a screenplay, writers understand that the audience needs to know character names. Thus, their names are usually spoken when introduced onscreen. This typically comes in the form of another character addressing them by name, like “good to see you again, Michael” or with the character introducing themselves, such as “my name is Jessica, and I’ll be taking care of you tonight”. Since these lines are dialogue, they’re directly transcribed in the subtitle file.

In Little Women (2019), Amy repeatedly yells Laurie’s name when the audience is first introduced to him

We can use the NLP library spaCy to identify names. We can have it perform NER on all subtitle text, and identify all named entities, such as names, companies, organizations, and countries. For now, we’ll use it to identify all possible character names, count the most common, and build a pool of possible character names.

A NER search of Booksmart (2019) gives us a list of potential names. We can later hard-code a list of names we don’t want to consider like “Dude” and “Jesus Christ”

Off-Screen Speaker Labeling in Subtitles

We have one more source of character names in subtitles. Since subtitles are designed for the hard-of-hearing, they contain clarifications to help the hearing-impaired understand the film. Most subtitles are just the direct transcription of dialogue spoken by the onscreen character. This is readily apparent to the audience — the onscreen character is moving his or her lips. However, if someone is speaking from offscreen, this causes confusion. Audiences should be able to recognize the voice to identify the speaker, but hearing-impaired people don’t have this luxury. So subtitles usually contain the speaker name for offscreen characters, in the form of “AMY: Am I driving?”.

In The Grand Budapest Hotel (2014), an offscreen voice gives us both a character name and the definitive identify of the speaker

We were able to identify and create a list of these in our previous post. These have the added bonus of definitively identifying the speaker — speaker identification is a big hurdle of this project. But because certain lines have the speaker listed, we can create a subset of dialogue lines that we know were spoken by specific characters.

Wanna see more?

Repository: Moviegoer
Project Intro: Can a Machine Watch a Movie?
Previous Story: Subtitle Features — Data Cleaning
Next Story: Subtitle Features — Subtitle Data Structures

Moviegoer: Subtitle Features — Character Enumeration

Written by Tim Lee

No responses yet