Captioning all my YouTube videos with AI - Jon Gjengset

Jon Gjengset

Every month or two, I get an email asking whether I could enablecaptions on my YouTube videos. I also get asked on Twitter, onReddit, and even on the orange site. Unfortunately, every timeI’m forced to give the same answer: I already have auto-captioningenabled on my videos, but for some reason YouTube sometimes simply doesnot generate captions. The most common case appears to be because thevideo is too long (somewhere around 2h), but I’ve seen it happen forshorter videos as well.

Each time I give that reply, it makes me sad. It means that someone whoexpressed an interest in learning from my videos was (at least in part)prevented from doing so, and that sucks. So, with the last email I got,I decided to finally do something about it.

Ages ago, a co-worker of mine suggested I might be able to use AI togenerate captions for my videos. There are a bunch of such servicesaround these days, but the one he linked me to was Gladia. So when Ifinally decided to generate captions for all my videos, that’s where Istarted. The API is pretty straightforward: you send them a video oraudio file (or even a YouTube URL), and they return a list of captions,each with an associated start time and end time1. Thatlist can then pretty easily be turned into SRT or VTT caption files(Gladia also supports producing them directly, though I didn’tuse that feature). Seemed easy enough!

Unfortunately, it turns out that Gladia (and many other similarplatforms) have a max limit on the length of the file they are able tocaption. For Gladia, it’s currently 135 minutes (though they recommendyou split your audio files into ~60 minutes chunks). Now, if you’vewatched my videos, you know that most of them are longer than that, sosome smartness was needed (more on that in a second).

I also faced another issue: my video backlog is somewhere around 250hours of video. At the time of writing, Gladia charges €0.000193 persecond of audio (which seems to be roughly where the industryhas landed), which works out to €174. That’s not nothing, especiallywith a bit of trial and error needed to get the aforementioned splittingright. Luckily, when I reached out to them pointing out that I wanted tocaption a bunch of programming teaching resources, and was willing toshare my experience and code afterwards, they graciously agreed to coverthe cost of the bulk encoding. Yay!

With that out of the way, let’s get to the how. You can also just lookat the code directly if you want!

Generating captions for long videos

My videos vary a fair bit in length. The shortest are 60-90m (so withinthe Gladia length limit), while the longest one is 7h20m. Somemay call that too long, but that’s outside the scope of this. Thisraises the question: how do you generate captions for a 7 hour longvideo in bursts of approximately 60 minutes? The naive approach is tojust split the video in 60 minute chunks, caption each oneindependently, and then join them together, but this presents a fewproblems:

You may cut the video mid-sentence, leading to a broken caption.
You may cut the video during a short silence where the next captionshould follow on from a sentence just before the silence. Splittinghere would lead to odd-looking captions where the next captionappears to start a new sentence.
Depending on how you cut, you may end up with slightly-offsetcaptions in later segments if the cutting isn’t using precise timecodes.
You may end up with “weird ends” where the last segment is only a fewseconds long, possibly without any captions. This isn’t inherently aproblem, though it does mean that progress can appear kind of random.

Instead, you have to be slightly smarter about how you cut. Here’s whatI landed on.

First, get the audio file for the video locally somehow. If it’s aYouTube video, you can use a tool like yt-dlp or yt-download to grab it:

$ yt-dlp -x -f 'bestaudio' "https://www.youtube.com/watch?v=kCj4YBZ0Og8"

In my case, I have the files for all my videos locally, so I just usedthose.

Next, take the length of the video and divide it by 60 minutes. Roundthat number up to the nearest integer value. Then divide the length ofthe video by that value (call that value seg). That’s how long we’llmake each segment.

Then, extract the first segment (of length seg) with2

$ ffmpeg -i "$audiofile" -vn -c:a libopus -b:a 192k -f ogg -t "$seg"

It’s tempting to use -acodec copy here, but don’t — it leads toinaccurate cutting. We need to mux to get exactly-accurate cuts ofthe audio. So, we export to Opus audio in an Ogg container — it ismodern, compact, and has good encoders. FLAC would be nice, but hitsthe 500MB file size limit too often. I decided against AAC since someAAC encoders are really bad.

Later segments can be extracted with:

$ ffmpeg -ss "$start" -i "$audiofile" -vn -c:a libopus -b:a 192k -f ogg -t "$seg"

Note that the last segment has to be extracted without the -t flagto make up for any rounding errors!

Sometimes, the audio stream in a video has a delay relative to thevideo stream. This might be to correct audio/video sync, or to make anintro sound line up with the visuals. This makes things weird becausethe caption timestamps are relative to the video. You can checkwhether this is the case for a given video file with this command:
$ ffprobe -i "$videofile -show_entries stream=start_time \    -select_streams a -hide_banner -of default=noprint_wrappers=1:nokey=1
If this prints anything but 0, you’ll have to adjust the ffmpeginvocation for the first segment to include -afadelay=$offset_in_ms.

Note you have to have both the audio and video to run this command,so remove -x -f 'bestaudio' if you’re grabbing a YouTube video thatmight have such a delay.

But, how do you set $start for each segment?

First, ship the extracted audio segment to the Gladia API. Then, in thecaptions you get back 3, walk backwards from the last caption,and look for the largest inter-caption gap in, say, the last 30 secondsof the segment. The intuition here is that the longest gap is the oneleast likely to be in the middle of a sentence. You can also improvethis heuristic to look at what the caption immediately before the gapends with. For example, if it ends with “,” or “…”, maybe skip that gapas the next caption is probably related and shouldn’t be split apart.

Once you’ve found that gap, set $start to be the time half-way throughthat gap. Discard all captions that follow $start from the currentsegment, then repeat the whole process for the next segment. Keep inmind that for all captions you get back from the API need to have$start added to their time codes!

Once you have all the captions from all the segments, all that remainsis to write them out into the SVT format (one of the caption fileformats that YouTube supports). The format is:

$number$caption.time_begin --> $caption.time_end$caption.transcription

where $number starts at 1 for the first caption and increases by onefor each subsequent caption, and the timestamps are printed like this:

fn seconds_to_timestamp(fracs: f64) -> String {    let mut is = fracs as i64;    assert!(is >= 0);    let h = is / 3600;    is -= h * 3600;    let m = is / 60;    is -= m * 60;    let s = is;    let frac = fracs.fract();    let frac = format!("{:.3}", frac);    let frac = if let Some(frac) = frac.strip_prefix("0.") {        format!(",{frac}")    } else if frac == "1.000" {        // 0.9995 would be truncated to 1.000 at {:.3}        String::from(",999")    } else if frac == "0" {        // integral number of seconds        String::from(",000")    } else {        unreachable!("bad fractional second: {} -> {frac}", fracs.fract())    };    format!("{h:02}:{m:02}:{s:02}{frac}")}

Note that the SVT format is pretty strict. You must have an empty linebetween each caption, you must give them consecutive sequence numbersstarting at 1, you must use -->, and you must format the sequencenumbers with exactly three fractional digits.

I’ve coded up this whole process in this project on GitHub:https://github.com/jonhoo/gladia-captions. I’ve also filed a featurerequest for Gladia to support something like thisnatively.

Mapping video files back to YouTube

If all you wanted to do was play someone else’s YouTube file withcaptions, then you’re basically done. Just pass the SRT file to yourvideo player along with the YouTube URL (if it supports it) or the videofile you downloaded, and you’re good to go.

If, like me, you want to update the YouTube video’s captions, you nextneed to figure out which YouTube video each caption belongs to. If youdownloaded the audio from YouTube originally, or have a neatly organizedvideo backup archive, then this is trivial. In my case, my local videoarchive only has video category and the recording time of the video, sothere’s no real connection to the originating YouTube video. So, I hadto also find a way to map the videos back to their respective YouTubeupload.

To do this, I wrote a program that first queries the YouTube API for allmy videos and extracts their id, title, publication date, and duration.Then, it walks all the video files in a given directory and determinestheir timestamp (from the file name) and duration (using symphonia).Finally, for each local file, it checks if any of the YouTube videoshave a duration that differs in at most single-digit seconds, and has apublication date that differs by at most a day. If any such video isfound, it associates the two, and outputs the YouTube id and title inthe name of the caption file for that local file.

Uploading the captions

Armed with the caption files and the mapping back to YouTube videos, Ireally wanted to automate the process of uploading the captions as well.It’s not too important for captioning new videos, but when doing thebacklog of almost 80 videos, that’s a lot of clicking through theYouTube studio API. Now, there is an API for uploadingcaptions, but unfortunately there are two complications:

It accesses private data, which requires OAuth 2.0authentication. A simple API key won’t do it. It’s totallypossible to implement OAuth 2.0 authentication from a command-linetool, it’s just annoying.
YouTube’s upload API uses a particular kind of request encoding(chunked transfer encoding) that isn’t supportedby the Rust HTTP library I’m using at the moment.

So I instead opted to do this part in Python (for now) based on the codein the “Try it” box on the caption API page (and theinstructions for running it). This required getting a setof OAuth credentials from Google Cloud Console (not too bad since Ialready had an “application” there for my API key), adding myself as a“Test user” under “OAuth consent screen”, and tweaking the code a bit.The end result is this Python script, which you should easily beable to fit a slightly-different use-case.

It’s worth noting that the YouTube API has a pretty strict free quota,and that uploading captions consumes a fair bit of that quota (450 outof 10k daily limit). This means that in practice you can only uploadabout 20 captions a day through the API before YouTube will cut youoff for the day. And getting that limit bumped is annoying.

End result

All my videos, including the super long ones, will soon have Englishcaptions (once the YouTube API allows), and I no longer need toapologize for YouTube auto-captioning’s shortcomings 🎉

Gladia sometimes returns captions that are overlylong. I haven’t found it to be an outright problem (like YouTuberejecting the captions), it’s just a bit awkward to read when ithappens. It’d be nice if there was a way to limit the max captionlength, so I’ve filed a featurerequest.↩
It’s a little unfortunate that the audio has to besplit on the client side, especially given that the Gladia APIsupports providing YouTube URLs directly. It’d be so convenient ifone could instead tell Gladia specifically which part of theposted URL’s audio to caption. So, I’ve filed a featurerequest.↩
The Gladia API returns a fairly big JSON payload because italso includes word-level timestamps. I didn’t need those here, butthere isn’t currently a way to omitthem.↩

本文章由 flowerss 抓取自RSS，版权归源站点所有。

查看原文：Captioning all my YouTube videos with AI - Jon Gjengset

Captioning all my YouTube videos with AI - Jon Gjengset

Report Page