This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).
Attributing Dialogue by Analyzing Frames, Audio, and Subtitles
With scenes identified, we can begin to analyze what’s in those scenes. The biggest task is dialogue attribution, or parsing what each character is saying. Since the subtitle file contains the actual dialogue, we don’t need to conduct any sort of voice-to-text transcription.
To establish the characters, we can independently cluster face encodings and also cluster voice encodings. But we don’t have a way to automatically tie the two together. We must use various features in the three tracks of data (frames, audio, and subtitles) to links faces to voices. Only then we can attribute dialogue from the subtitles to these characters. There are two challenges: the subtitles do not indicate which character is speaking, and each stream of data uses a separate timestamp system.
Face and Voice Clustering
We need to identify the two characters in the scene by their faces and voices.
For facial clustering, we use the face_recognition library to create the face encodings, then AgglomerativeClustering from the Keras library to cluster them. We separate them into faces A and B.
For voices, we use the pyAudioAnalysis library to perform speaker diarization. Diarization breaks the audio track into a list of who-spoke-when (or who-spoke-last). These voices are separated into voices M and N.
Visual, Audio, and Subtitle Attribution Flags
Though we have faces onscreen and we have voices on the audio track, we don’t know how to tie them together. To do this, we need to identify the frames when someone is speaking. We’ve assembled a series of flags that serve as clues:
- Visual, Mouth Open: When a primary character is onscreen, we can reasonably assume they’re speaking if their mouth is open.
- Audio, Audible Sound: A character can only be speaking if there isn’t a period of silence.
- Subtitle, Subtitle Onscreen: A character can only be speaking if there are subtitles onscreen.
Speaker Identification
We need to tie the faces and voices together, either as (Face A to Voice M and Face B to Voice N) OR (Face A to Voice N and Face B to Voice M). We count the frames that support either scenario, when all three of the above attribution flags are true.
Subtitle Attribution
After tying together the faces and voices, we can attribute the written dialogue from the subtitles file to either character. The challenge here lies in the fact that the subtitle file doesn’t designate speaker — it’s up to the audience to infer who’s speaking. We can use what we learned during the speaker diarization to figure out which character is speaking when, and then attribute each line to each character. This requires a number of functions related to time, because of the difference in frame/subtitle time systems, and the need for a time offset, in case the subtitles and audio track don’t perfectly align.
Wanna see more?
- Repository: Moviegoer
- Project Intro: Can a Machine Watch a Movie?
- Previous Story: Scene Boundary Partitioning
- Next : Four Categories of Comprehension