Moviegoer: Subtitle Features — Data Cleaning

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

Subtitles are not traditionally part of the filmmaking process. They do not appear when screened in movie theaters, and are only added after film production, localized in different languages for home video or streaming services. There is little creativity or decision-making in their creation — they’re simply a ground-truth transcription of the spoken dialogue and descriptions of other audio. But that’s the power of being able to read the subtitles: we can determine exactly what’s being said onscreen.

We’ll eventually want to apply NLP (natural-language processing) to interpret these subtitles, but first we’ll need to process and clean them up.

Image for post
Image for post
In “Booksmart” (2019), subtitles convey both dialogue and unspoken character sounds, like a chuckle


Subtitle files can be extracted from a movie in the form of a .srt file. These are basically just text files, but with very strict formatting. Each subtitle has a unique ID, a start and end time (indicating that this should be displayed at HH:MM:SS:MIL and end at HH:MM:SS:MIL), and one or two lines for the actual subtitle text to be displayed.

Image for post
Image for post
A few subtitles

Since the formatting is so rigid, we can feed the .srt file into the Python library pysrt to automatically create objects for each subtitle. We can save these start and end timestamps for later, when we want to tie subtitle dialogue to actual onscreen action, but for now we’re just interested in the actual text.

Line Separation and Concatenation

First, we’ll properly separate subtitles into individual lines. Remember that subtitle text is either one or two lines. This leads to three cases:

Image for post
Image for post
Two characters speaking in a single two-line subtitle

Text Cleaning

Since subtitles are created for the hearing-impaired, they also convey non-dialogue information important to the film, such as a character laughing, or the identify of a character speaking from off-screen. We’ll want to parse each line and watch out for these — we’ll remove them for the purposes of data cleaning for NLP input, but make note of them for later analyses. Here’s a few examples of what we’ll be cleaning:

Image for post
Image for post
In Lost in Translation (2003), the audience isn’t supposed to understand the Japanese-language dialogue. These are all replaced with parentheticals, which we can ignore for NLP purposes.


With the text cleaned, and various supplementary information (like a y/n flag for music or if it contained laughter, speaker names, etc.) we can populate a DataFrame containing the original text, start/end times, NLP-friendly text, and more.

Wanna see more?

Written by

Unlocking the emotional knowledge hidden within the world of cinema.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store