Moviegoer — Scene Boundary Identification

Tim Lee
3 min readDec 14, 2020

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

Without any structure, a film is just a collection of a few thousand frames of images and a very long audio track. Conversations bleed into one another, characters appear and disappear without reason, and we teleport from one location to the next. We can begin to organize a film by dividing it into individual scenes. We’ll use an example of Lost in Translation (2003).

To start, we’ll just be identifying two-character dialogue scenes. These are the most basic building-blocks of films: just two characters speaking together with no distractions, purely advancing the plot with their dialogue. In modern filmmaking, these scenes are usually shot in a specific manner. We can take advantage of this by looking for specific patterns of shots.

The A/B/A/B Pattern

Two-character dialogue scenes usually follow a very distinct pattern. Character A speaks, then Character B, then back to A, then back to B, etc. We cut back and forth between the two characters.

We look for these two anchor shots, which are the shots of the two characters and form the A/B/A/B pattern. We’ll be looking through every frame in the film, and trying to find instances of these ABAB patterns. We have a few existing dataframes to use — we’ve previously clustered similar frames into “shots”. For example, all of frames with Charlotte are grouped into a single shot cluster.

Anchor Shot A: Charlotte on the left

We have another unique shot of Bob.

Anchor Shot B: Charlotte on the left

We’ve located an A/B/A/B pattern: four shots where A and B alternate. The next step in the workflow is checking for faces, and making sure that shot A has a face on the left, and shot B has a face on the right (or arbitrarily inversely assigned). These two shots have a face on the left, and a face on the right, so they pass the test.

Expanding the Scene

We can expand the scene by checking for these two anchor shots nearby. Currently, our scene consists of only the A/B/A/B pattern: just the two anchor shots. But there might be other shots, cutaways, interrupting these anchor shots. Cutaways are shots that are part of the scene, but aren’t the anchor shots. These might include shots like a closeup of an object, or a POV shot of an object being looked at by a character.

So if a true scene was B/C/A/B/A/B, we look for that additional B at the beginning, and then we’ll designate that intermediate shot as a cutaway. Starting with the A/B/A/B pattern, we expand the scene by looking before the first A, and after the last B.

Cutaway shot: a two-shot of both characters

By applying this workflow, we can identify scenes throughout the entire film. This is the first scene we identified in Lost in Translation, a famously quiet film with limited dialogue. This is indeed the first scene that Bob and Charlotte have a conversation.

This algorithm has a lot more success with traditionally-filmed, mainstream films. For example, when we applied this to Plus One (2019), a romantic comedy, we found 18 scenes. Two-character dialogue scenes, with two characters spitting sharp dialogue at each other, are a staple of rom-coms.

Wanna see more?

--

--

Tim Lee

Unlocking the emotional knowledge hidden within the world of cinema.