CEA-608 captions in Media Source Extensions with webkitgtk
Recently, I have been working on webkitgtk support for in-band text tracks in Media Source Extensions, so far just for WebVTT in MP4. Eventually, I noticed a page that seemed to be using a CEA-608 track - most likely unintentionally, not expecting it to be handled - so I decided to take a look how that might work. Take a look at the resulting PR here: https://github.com/WebKit/WebKit/pull/47763
Now, if you’re not already familiar with subtitle and captioning formats, particularly CEA-608, you might assume they must be straightforward, compared to audio and video. After all, its just a bit of text and some timestamps, right?
However, even WebVTT as a text-based format already provides lots of un- or poorly supported features that don’t mesh well with MSE - for details on those open questions, take a look at Alicia’s session on the topic: https://github.com/w3c/breakouts-day-2025/issues/14
Quick introduction to CEA-608 #
CEA-608, also known as line 21 captions, is responsible for encoding captions as a fixed-bitrate stream of byte pairs in an analog NTSC broadcast. As the name suggests, they are transmitted during the vertical blanking period, on line 21 (and line 284, for the second field) - imagine this as the mostly blank area “above” the visible image. This provides space for up to 4 channels of captioning, plus some additional metadata about the programming, though due to the very limited bandwidth, these capabilities were rarely used to their full extent.
While digital broadcasts provide captioning defined by its successor standard CEA-708, this newer format still provides the option to embed 608 byte pairs. This is still quite common, and is enabled by later standards defining a digital encoding, known as Caption Distribution Packets.
These are also what enables CEA-608 tracks in MP4.
Current issues, and where to go in the future #
The main issue I’ve encountered in trying to make CEA-608 work in an MSE context lies in its origin as a fixed-bitrate stream - there is no concept of cues, no defined start or end, just one continuous stream.
As WebKit internally understands only WebVTT cues, we rely on GStreamer’s cea608tott
element
for the conversion to WebVTT. Essentially, this element needs to create cues with well-defined timestamps,
which works well enough if we have the entire stream present on disk.
However, when 608 is present as a track in an MSE stream, how do we tell if the “current” cue
is continued in the next SourceBuffer? Currently, cea608tott
will just wait for more data,
and emit another cue once it encounters a line break, or its current line buffer fills up,
but this also means the final cue will be swallowed, because there will never be “more data”
to allow for that decision.
The solution would be to always cut off cues at SourceBuffer boundaries, so cues might appear to be split awkwardly to the viewer. Overall, this conversion to VTT won’t reproduce the captions as they were intended to be viewed, at least not currently. In particular, roll-up mode can’t easily be emulated using WebVTT.
The other issue is that I’ve assumed for the current patch that CEA-608 captions will be present as a separate MP4 track, while in practice they’re usually injected into the video stream, which will be harder to handle well.
Finally, there is the risk of breaking existing websites, that might have unintentionally left CEA-608 captions in, and don’t handle a surprise duplicate text track well.
Takeaway #
While this patch only provides experimental support so far, I feel this has given me valuable insight into how inband text tracks can work with various formats aside from just WebVTT. Ironically, CEA-608 even avoids some of WebVTT’s issues - there are no gaps or overlapping cues to worry about, for example.
Either way, I’m looking forward to improving on WebVTT’s pain points, and maybe adding other formats eventually!