Moviegoer: Subtitle Features — Data Cleaning

4 min readSep 14, 2020

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

Subtitles are not traditionally part of the filmmaking process. They do not appear when screened in movie theaters, and are only added after film production, localized in different languages for home video or streaming services. There is little creativity or decision-making in their creation — they’re simply a ground-truth transcription of the spoken dialogue and descriptions of other audio. But that’s the power of being able to read the subtitles: we can determine exactly what’s being said onscreen.

We’ll eventually want to apply NLP (natural-language processing) to interpret these subtitles, but first we’ll need to process and clean them up.

In “Booksmart” (2019), subtitles convey both dialogue and unspoken character sounds, like a chuckle

Processing

Subtitle files can be extracted from a movie in the form of a .srt file. These are basically just text files, but with very strict formatting. Each subtitle has a unique ID, a start and end time (indicating that this should be displayed at HH:MM:SS:MIL and end at HH:MM:SS:MIL), and one or two lines for the actual subtitle text to be displayed.

Since the formatting is so rigid, we can feed the .srt file into the Python library pysrt to automatically create objects for each subtitle. We can save these start and end timestamps for later, when we want to tie subtitle dialogue to actual onscreen action, but for now we’re just interested in the actual text.

Line Separation and Concatenation

First, we’ll properly separate subtitles into individual lines. Remember that subtitle text is either one or two lines. This leads to three cases:

One line: This is a single-line piece of dialogue or auditory description. This doesn’t require any cleaning.
Two lines, one speaker: This is a piece of dialogue spoken by a single character that spans both lines. These should be concatenated.
Two lines, two speakers: This is two separate speakers, each speaking a small piece of dialogue. These should be separated into two lines.

Two characters speaking in a single two-line subtitle

Text Cleaning

Since subtitles are created for the hearing-impaired, they also convey non-dialogue information important to the film, such as a character laughing, or the identify of a character speaking from off-screen. We’ll want to parse each line and watch out for these — we’ll remove them for the purposes of data cleaning for NLP input, but make note of them for later analyses. Here’s a few examples of what we’ll be cleaning:

Italics: An entire line is italics, denoted by the HTML tags “<i>” and “</i>. Italics are often used to designate narration, or someone speaking offscreen over the phone. We’ll discard the HTML tags but keep the rest of the text.
Music: Song lyrics begin and end with a music note, regardless of whether they’re diegetic (like characters signing karaoke) or non-diegetic (like music overlaid on a montage). We’ll discard all of these and not use them in NLP analysis.
Parenthetical: Full-line parentheticals describe both sound effects, and non-dialogue sounds from characters like “(grunting)”. We’ll remove these lines from the NLP input.
Laughter: A character’s laughter is often included in the subtitles, as something like “(laughter)”. Fortunately, there are few enough laughter strings to look for, and we can just create a list containing phrases like “(chuckles)”, “(laughing)”, and “(laughter)”. We’ll remove these from the subtitle text but keep all the other dialogue.
Speaker: When a character is speaking from off-screen, their name will be displayed with their subtitle text. When watching a film, we can recognize the voice of an off-screen speaker, but the hearing-impaired don’t have that luxury. We’ll remove the off-screen character’s name from the text, but we can save this name for later, since it directly tells us who’s speaking.

In Lost in Translation (2003), the audience isn’t supposed to understand the Japanese-language dialogue. These are all replaced with parentheticals, which we can ignore for NLP purposes.

DataFrame

With the text cleaned, and various supplementary information (like a y/n flag for music or if it contained laughter, speaker names, etc.) we can populate a DataFrame containing the original text, start/end times, NLP-friendly text, and more.

Wanna see more?

Repository: Moviegoer
Project Intro: Can a Machine Watch a Movie?
Previous Story: Audio Features — Sound Effects
Next Story: Subtitle Features — Character Enumeration

Moviegoer: Subtitle Features — Data Cleaning

Written by Tim Lee

No responses yet