This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).
As audiences, we go into films not knowing any of the characters. We don’t know the names of any onscreen characters. There aren’t any onscreen labels of character names (with the exception of documentaries and docu-dramatizations). Instead, we have to pay attention to the dialogue, listening for the mention of character names. Luckily, all dialogue has already been transcribed in easily readable subtitles.
In seeking to have a machine understand a film, we need to have it process and remember character names. We can feed in the subtitle file and use NER, named entity recognition, to identify any potential names.
Dialogue-Based Namedrops
When writing a screenplay, writers understand that the audience needs to know character names. Thus, their names are usually spoken when introduced onscreen. This typically comes in the form of another character addressing them by name, like “good to see you again, Michael” or with the character introducing themselves, such as “my name is Jessica, and I’ll be taking care of you tonight”. Since these lines are dialogue, they’re directly transcribed in the subtitle file.
We can use the NLP library spaCy to identify names. We can have it perform NER on all subtitle text, and identify all named entities, such as names, companies, organizations, and countries. For now, we’ll use it to identify all possible character names, count the most common, and build a pool of possible character names.
Off-Screen Speaker Labeling in Subtitles
We have one more source of character names in subtitles. Since subtitles are designed for the hard-of-hearing, they contain clarifications to help the hearing-impaired understand the film. Most subtitles are just the direct transcription of dialogue spoken by the onscreen character. This is readily apparent to the audience — the onscreen character is moving his or her lips. However, if someone is speaking from offscreen, this causes confusion. Audiences should be able to recognize the voice to identify the speaker, but hearing-impaired people don’t have this luxury. So subtitles usually contain the speaker name for offscreen characters, in the form of “AMY: Am I driving?”.
We were able to identify and create a list of these in our previous post. These have the added bonus of definitively identifying the speaker — speaker identification is a big hurdle of this project. But because certain lines have the speaker listed, we can create a subset of dialogue lines that we know were spoken by specific characters.
Wanna see more?
- Repository: Moviegoer
- Project Intro: Can a Machine Watch a Movie?
- Previous Story: Subtitle Features — Data Cleaning
- Next Story: Subtitle Features — Subtitle Data Structures