This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).
Now that we’re starting to use features from the visual, audio, and subtitle tracks together, we can use them in tandem to gather some important information about movies. This is a small example of what we can accomplish.
We want to create a composite “average” face encoding based on times a character introduces himself with a line like “I’m Ben” or “My name is Ben”. We’ll search for these self-introductions, gather all the times he appears onscreen as he’s introducing himself, calculate his facial encoding (a numerical representation of what he looks like), and then calculate the average of these, for a composite encoding of Ben. With this, we can identify him throughout the film.
We can generate a potential list of characters based on the number of times they’re mentioned in the dialogue, or they’re listed as offscreen speakers as a subtitle clarification. Using this list, we can reasonably infer the main characters.
With this list, we can search through the dialogue for any time a character introduces themselves. Once we have the times of these self-introductions, we can calculate the associated (visual) frames that these lines of dialogue are spoken. In this example, Ben introduces himself several times throughout the film, so we have a few frames to check.
From each frame, we collect the facial encodings. However, this is not entirely guaranteed — sometimes Ben’s face will be obscured, or he won’t even be onscreen. So we’ll compare each of these facial encodings to one another. If a majority of facial encodings (roughly) match one another, we’ll consider those to be accurate representations of Ben’s face.
From there, we can take an average of those encodings, to create a composite representation of what Ben looks like. We can then compare this encoding to all of the faces in the film, identifying when Ben appears onscreen.
This exercise required lots of analysis of individual frames, looking for Ben’s faces in various frames. This was somewhat computationally expensive, as we were calculating facial encodings every time we wanted to look for his face. It would be much easier if we could simply calculate each frames’ encodings once, and then save this data to be looked up later. This will be the next focus of effort, serialization of data.
Wanna see more?
- Repository: Moviegoer
- Project Intro: Can a Machine Watch a Movie?
- Previous Story: Vision Features — Faces
- Next Story: Data Serialization