Moviegoer: Subtitle Features — Subtitle Data Structures

Tim Lee
3 min readSep 27, 2020

--

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

Movie subtitles are simple, primarily consisting of an index/ID number, one or two lines of text, a start time, and an end time. But we can organize these into structured data, so that a machine can parse and read these, to try and understand a movie.

We’ve organized subtitles into three data structures, for the purposes of conducting analyses rooted in film theory and NLP (natural language processing). The goal is to automatically pass in a subtitle .srt file and automatically populate these Pandas dataframes with subtitle data. Let’s see an example from Before Sunrise (1995), a movie consisting entirely of naturalistic conversations between two characters falling in love.

Basic Extract

The first dataframe extracts the index number, text, and start/end times. We also do some very minor parsing of the subtitle text. There are two lines (separated by a newline character), and they might be both spoken by the same characters, or by two characters. In the first case, we’ll combine the two into one piece of text by removing the newline character. If the two lines are independent, attributed to two separate characters, we break them into a top line and bottom line. In the example above, the first subtitle has two lines of text, each spoken by a different character.

Dialogue, Descriptions, and Actions

Next, we have analyses per individual line. Note that this has resulted in more lines than the original subtitle file — the index numbers no longer align. We’ve started to analyze each line individually. We can do non-NLP analyses here, such as identifying speaker names (speaker names are usually included only when the speaker is offscreen). We can also look for things like laughter. Speaker names and laughter are eventually removed, so that we’re left with just dialogue.

Sentences

With cleaned dialogue, we can then feed each piece of text into spaCy, an NLP library. We can conduct sentence boundary detection — the process of automatically breaking down the text into individual sentences. This is helpful when there’s a long sentence that spans multiple subtitles, or multiple sentences within a specific subtitle. This dataframe has one row per sentence, so we can conduct NLP analyses.

Below is an example, where we identify an instance of a self-introduction. Celine has just given her name, and we can identify this as a self-introduction because spaCy has identified “Celine” as a proper noun. (We don’t want to be confused by a sentence like “I’m hungry”.) This is a key piece of information, because we can recognize her onscreen face and voice encodings. We’ll be able to identify her face and voice as belonging to Celine throughout the entire film.

Wanna see more?

--

--

Tim Lee
Tim Lee

Written by Tim Lee

Unlocking the emotional knowledge hidden within the world of cinema.

No responses yet