March 04, 2010

An Experiment with YouTube's New Auto-Captioning

Greetings. In a move of potentially enormous positive importance to hearing-impaired Internet Users, Google's YouTube today announced the deployment of their "auto-captioning" capability across the entire universe of YouTube videos.

The rapid expansion on the Internet of uncaptioned video has been increasingly putting hearing-impaired users at a disadvantage, but the decidedly nontrivial work required to caption videos, especially for producers with limited resources, has until now greatly limited the numbers of YouTube vids that were available with full or even partial captioning.

I've captioned some of my own videos in the past, both with the help of rudimentary tools and completely manually, and I can definitely attest to the fact that it can be quite tedious indeed.

So given today's YouTube announcement (and the discovery that some of my YouTube videos were already enabled for auto-captioning), I decided to run a quickie experiment using one of my previously uncaptioned videos.

Automatic speech recognition is a very difficult task, especially in the presence of music or noise, and YouTube notes that auto-captioning must be expected to be imperfect. I was interested in seeing how well it would function to speed up the process of hand-tuning a completely accurate captioning transcription.

Executive summary: It helps a great deal, to say the least!

The video I used for this experiment was my Is Net Neutrality a Communist Plot? satire.

This video includes a number of aspects particularly useful for this test. While most of the voiceover is not accompanied by backing music, there are sections of narration that are layered on music beds. Also, the audio for almost the entire length of the production is mixed with a purpose-built "noise" track to simulate a rotting old reel of film, and there are various other audio timing artifacts that I had manually introduced as well.

You can inspect the results by watching the video and enabling captioning.

At the lower-right of the YouTube playback window is an upward pointing arrow. Hover over it (or click) and you'll see a "CC" option that you can click to enable (it will then turn red). Also, if you hover over the small left-pointing arrow on the left side of the CC option, you can choose between two captioning tracks.

"Hand-Tuned" is the default and is my final captioning track after I corrected and tuned the results of YouTube's automatically-transcribed captioning. "Machine Transcription" is the actual and original automatically-generated captioning track that YouTube generated on its own.

The automatic captioning track obviously contains many errors (and is rather humorous in places). But we definitely must keep in mind (a) that the presence of music and noise naturally degrades the machine-transcription process, and especially (b) the enormous time-saver that having this automatic track -- even with its errors -- represents when it comes to creating a hand-tuned and polished final captioning track.

I can't emphasize this latter point enough. Being able to use the automatically generated track as a foundation allowed me to create a finished captioning track in a fraction of the time that would have been required when working with a script from scratch.

Obviously, expected results will vary from video to video, but I would expect that many videos, particularly ones with quiet backgrounds, will yield rather spectacular results with a minimum of hand tuning, and in many cases will be highly useful to users without any tuning at all.

And of course, we can expect that over time the quality of the auto-captioning transcriptions will only improve.

This is a big day for Internet video accessibility in general, and for YouTube in particular. Kudos to the YouTube teams!


Posted by Lauren at March 4, 2010 08:17 PM | Permalink
Twitter: @laurenweinstein
Google+: Lauren Weinstein