With a bit of input from Gemini about its own API, I wrote two shell scripts to:
1. Provide a text-only alternative to an audio or video file.
2. Create a timestamped description of the visuals in a video file, not including audio-only information.
Gemini definitely has some flaws when working with video, and I've only tried it on a few files. Interestingly, it gained enough context clues from a one-minute clip of The Big Bang Theory to successfully identify a character who is never named in that clip. It also doesn't know how to hold off on giving names to characters within its descriptions until those characters are given a name within the clip. If I give it permission to use context clues from the audio, it uses later ones to describe earlier parts of the file.
And it's not perfect at separating audio from video--for instance, my visual-only description had sentences like, "Sheldon walks into the kitchen and makes a request." and "Sheldon begins to sing." I could always mute the audio when sending the video, but I was hoping I could get Gemini to understand the difference and produce reasonably useful video descriptions.
I'm mostly using 2.0 Flash because I *think* it's still free for now, but I'll compare 1.5 Pro at some point and keep tweaking the prompts.
For audio, it's great. I tell it to add paragraphs for readability and line breaks to singing, and it does this quite well. It can't always identify sounds, but it tries pretty hard.