…and why you’d even want to!

You remember headlines like these, from about 5 years ago?

I Forced a Bot to Watch 1,000 Hours of The A-Team, and Had It Write an Episode

It was all faked, but in 2024, we now have GPT-4V(ision), which from which we can build tools to watch a video, understand its context, and even make meaningful insights from it.

While impressive demos of AI analyzing live video exist, they're not yet widely available, and are initially likely to focus on general-purpose applications. Imagine, instead, a system that can watch every game trailer, pick up on gameplay mechanics and themes that might not appear in tags or game descriptions, and make value judgements, connections, or even warnings (“I know you hate jump scares; give these games a pass.“).

Working Up From Still Images

*A local Armenian(!) salsa music performance.*

I’ll use the above image to illustrate a point that I feel people miss when they think about machine vision + LLMs: unlike traditional approaches to machine vision, which focuses on objects and their relative positions, the addition of the LLM allows a system to build up context and provide some insight. For instance, you can feed GPT-4V the above image and ask it to do the following:

Me: Could you thoroughly describe this image, including the people, any items or other objects, writing, etc., and then make a judgement as to what's going on in the scene?

And you’ll get something like this as output:

GPT-4V: This image shows a live music performance by a salsa band named "Black Sea Salsa" as indicated by the banner in the background. The band consists of ten members visible on stage. There are four female vocalists in the front wearing red dresses, three of whom are holding microphones. Behind them, there are six instrumentalists, including a pianist on the left playing a grand piano, a guitarist next to the piano, and four brass players standing in a row with two trumpets, a trombone, and a saxophone…

It missed on a few details, but is otherwise quite solid.

Here’s where it gets Inferential

When I ask it to interpret the scene further, it makes some impressive inferences:

GPT-4V: The venue appears to be a bar or club, with a casual atmosphere as evidenced by a beer glass on a table in the foreground and a coat hanging on a chair. The floor is wooden, and the lighting is subdued with no spotlight on the performers, contributing to an intimate setting. No audience members are visible, suggesting the photo was taken either before the performance started or from an angle that doesn't capture the audience.
The event seems to be a live salsa music performance, and the presence of the musicians and their setup indicates that this is likely a scheduled event rather than an impromptu session.

This level of detail and accuracy from a computer analyzing a photo was literal science fiction just a few years ago. It’s also the basis of how we have a computer digest a full trailer.

So, How to Do a Game Trailer?

What is video but a bunch of still frames strung together with a catchy musical number?

Here’s the trailer for one of our games, Drunken Robot Pornography. Don’t worry; it’s the cleanest game on the Internet, and watching it will not get you fired from your job.

If you (a human) watch a trailer for a game, you might already know a bit about that game, so I wanted to give GPT-4V some of that meta-knowledge. I used a Python script to grab Steam metadata and compile it into an LLM-ingestible format containing the name, description, and tags.

I then send this in alongside an initial frame like this one:

I then give it a timestamp and ask it to deconstruct the shot:

Me: Here's a frame at 0s. Describe what's happening here.

It breaks it down thusly:

GPT-4V: In this still frame from the game "Drunken Robot Pornography":
1. The perspective is first-person.
2. The environment appears to be an outdoor platform with a futuristic cityscape in the background.
3. The player seems to be standing on hexagonal tiles.
4. (Etc.)

The result’s pretty good, owing (in part) to the contextualizing metadata I sent along with the image. Here’s the next frame I send:

I tell it the frame’s from 5s in, and it replies like so:

GPT-4V: In this frame:
1. It mimics a news broadcast by "WORLD SKY NEWS".
2. Below, a ticker says "TOP STORY: SOMETHING INSIPID ABOUT SOME STUPID CELEBRITY".
3. The image shows a swarm of robots with multiple arms, centered around a larger, circular structure with beams of light—presumably lasers—emitting from them.
4. The scene suggests chaotic action, likely related to the robot attack mentioned in the headline.
5. (Etc.)

Awesome. It understands the context that it was a still from a simulated news broadcast; it was able to read the text; and it got the on-screen action right. I then continue the process for the remainder of the frames—these guys:

All the input and output are stacked together as part of a conversation that’s fed back into the LLM at each turn, such that it remembers everything that happens. From there, I ask it to break everything down:

Me: List everything you’ve learned from this trailer.

Here’s the report:

1. Visual Style

The game features a vibrant and colorful art style, with futuristic environments.

2. Gameplay Elements

First-person shooter mechanics with an emphasis on agility and flying via a jetpack.
Bullet-hell elements, with numerous enemies and projectiles to dodge.
Power-ups and upgrades available within levels to enhance the player's abilities.

These are true, if not particularly interesting—it’s all stuff on the back of the box. But it’s nice to know GPT-4V is paying attention.

3. Level Design

Varied structures with different heights and platforms, suggesting vertical gameplay.
Open spaces and sky-based backdrops indicating large arenas for battles.

Now, we’re getting somewhere! The metadata we sent over never talks about the level layout, but it’s able to infer that from the video. Indeed, you spend most of your time traversing the environment vertically, and the arenas are much larger than you might see in a similar FPS.

4. Enemies/Combat

Large boss robots (Titans) with multiple parts and weapons.
Smaller enemy drones that swarm and attack the player.
Intense combat scenarios with numerous enemies and laser beams to navigate through.

Aha! That’s a decent mental leap. We can, of course, tug on that thread with a specific follow-up question:

Me: What can you infer about About the Enemies in this game?

The reply:

Titans (Boss Enemies):
- Titans are large, formidable robots that serve as the primary bosses in the game. They are likely varied in design and capabilities, requiring different strategies to defeat. Titans seem to have weak points, like glowing cores, that players must target, and they likely have multiple phases or attack patterns.
Drones and Lesser Robots:
- Besides Titans, the game features smaller, spherical drones and other lesser robots that act as standard enemies. These enemies swarm the player, contributing to the bullet-hell aspect of the game and keeping the player constantly engaged.
Environmental Hazards as Enemies:
- Some elements in the environment, like laser beams, might function as both hazards and enemies. These require the player to be aware of their surroundings at all times and add to the complexity of combat scenarios.
Varied Attack Patterns:
- Enemies, particularly the Titans, seem to have varied and complex attack patterns. This includes shooting lasers, deploying drones, or creating barriers, which means players must adapt their tactics continually throughout a battle.
Boss AI and Adaptive Strategies:
- Although not explicitly shown, the complexity and intensity of the battles suggest that the Titans and other enemies might have advanced AI that adapts to the player’s actions, possibly increasing the challenge as the battle progresses.

But why the heck do we want to do this!?

This post focuses mostly on the “how it works,” but there are a handful of things where this can be useful:

Improved Game Categorization: On Steam, devs and gamers can apply tags to games, but given the sheer volume of tags available, they’re not always applied consistently (especially for lesser-played titles)—we could augment that with automatic labeling here. Also, this process could identify novel gameplay mechanics beyond what those tags can describe (see my post on same here).
Timestamped Metadata Generation: We can automate a "trailer map", pinpointing exact moments of key features (e.g., combat at 0:30s; chargen at 1m05s; etc.). Recommendation platforms (like We ❤ Every Game, which my startup is working on) might offer customized trailer cuts based on user preferences, e.g., showing only exploration sequences for players who are curious about world-building.
Accessibility: Auto-generation of audio descriptions for visually impaired gamers. Do subtitle glasses work? Time will tell!
Enhanced Game Recommendations: By grokking the visual content of game trailers, AI can significantly improve game recommendation systems. Instead of relying solely on text descriptions or player behavior, recommendations could be based on demonstrated gameplay elements, art style, and pacing shown in trailers. This could lead to more accurate and diverse recommendations, helping players discover games they might otherwise miss.

So, What Next?

One of my favorite marketing messaging exercises is to tell someone about a game I’m working on, and have them explain it back to me—how they re-tell it and what they focus on tells me lot about the product, the positioning, and the message.

As such, my next experiment with game trailers and AI will be to see if devs can use this tech to give a “first read” of a trailer and use the analysis to determine whether it’s telling the story they want to tell. For example:

Emphasis Check: A developer creates a trailer for their puzzle-platformer, emphasizing a time-manipulation mechanic. If the AI viewing it is able to pick that out, the hook’s probably coming across strongly—and if not, it might be worth focusing further on the mechanic.
Tone Alignment: A team making a dark, narrative-driven RPG gets feedback from the AI suggesting their trailer conveys a light-hearted, action-packed adventure (and, minus the AI bit, this has happened before!). This mismatch may help them realize they need to rework the video to better reflect the game's somber tone and story-focused gameplay.
Unexpected Strengths: A small studio making a retro-style shooter is surprised when the analysis praises the environmental storytelling and level design. This unexpected feedback helps them realize they have a unique selling point they hadn't considered emphasizing in their marketing.

As of today, I wouldn’t want to exclusively rely on a machine for the above, but stuff like this can serve as a useful initial feedback tool, helping developers ensure their trailers effectively communicate their games' core features, tone, and target audience. Studios of all sizes can always benefit from a gut-check!

How to Get a Computer to Watch a Game Trailer

…and why you’d even want to!

I Forced a Bot to Watch 1,000 Hours of The A-Team, and Had It Write an Episode

Working Up From Still Images

Here’s where it gets Inferential

So, How to Do a Game Trailer?

Me: Here's a frame at 0s. Describe what's happening here.

Me: List everything you’ve learned from this trailer.

1. Visual Style

2. Gameplay Elements

3. Level Design

4. Enemies/Combat

Me: What can you infer about About the Enemies in this game?

But why the heck do we want to do this!?

So, What Next?

More Reading

How to Get a Computer to Watch a Game Trailer

…and why you’d even want to!

I Forced a Bot to Watch 1,000 Hours of The A-Team, and Had It Write an Episode

Working Up From Still Images

Here’s where it gets Inferential

So, How to Do a Game Trailer?

Me: Here's a frame at 0s. Describe what's happening here.

Me: List everything you’ve learned from this trailer.

1. Visual Style

2. Gameplay Elements

3. Level Design

4. Enemies/Combat

Me: What can you infer about About the Enemies in this game?

But why the heck do we want to do this!?

So, What Next?

More Reading

Creating a Virtual Influencer to Cover Every Game