Subtitles generation via ASR (Automatic Speech Recognition)

Could we add a possibility of generating subtitles for the recordings or even for the live programs on the fly?

An eighth gen quad core i5 can achieve 5x speed when using tiny.en model in whisper:

Just like with hardware transcoding, having a GPU would speed things up quite a bit and allow for the larger and more accurate models to be used.

If somebody is interested in playing with it, just install it via PIP and follow the Command-line usage section

@lsudduth

EDIT: There is also a standalone executable

Here are some benchmarks

What have you tried it on and how accurate is it for you?

Something like this would be the final piece for things like ah4c and ADBTuner. Closed captions are the only thing I really miss.

It just works. Even random foreign videos can have their subtitles generated. I always use the largest model available so medium.en for English or large-v3 - international.

I imagine with the system requirements needed to run this that it won't be making it into Channels DVR any time soon.

Even without GPU the speed is quite acceptable and on smaller models exceeds real time

Size Parameters English-only model Multilingual model Required RAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

Something like EzDubs could work

Does it output something like a SRT, or is it designed to be integrated into another application?

I would love to try it on some of my PlayOn recordings. The timing on the subtitles gets a bit off for some reason.

By default all supported subtitle formats are produced:

$ whisper --help

  --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
                        format of the output file; if not specified, all
                        available formats will be produced (default: all)

Check out this project: GitHub - collabora/WhisperLive: A nearly-live implementation of OpenAI's Whisper.

It can already transcribe HLS streams live

client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/bbc_1xtra.isml/bbc_1xtra-audio%3d96000.norewind.m3u8")

It outputs text and if I'm understanding the documentation correctly, there's a way to return a string of two lines of text each with a max 50 characters.

1 Like

ezdubs is making a lot of progress with live translation. maybe somebody could reach out to them and ask for an integration with channels?

@babsonnexus Would the Whisper Live program I posted above be a good candidate for your project, similar to how you integrated the mpd to hls feature?

I can't say I'm totally up-to-date with all this. Can you give me a summary of what you are requesting and how you imagine it might work? Like, are you thinking PLM would intercept the stream, use Whisper Live to add subtitles, and then serve that stream up to Channels?

I've started working on a project to do this. I already have a bash script version that works. I'm working on a Python version to share. It uses Whisper to create SRT files, then ffmpeg to embed the subtitles into an mp4 file. The mp4 file is renamed to *.mpg so that the database doesn't need to be updated. The mp4 masquerading as an mpg file doesn't seem to bother the Fire TV client nor the web interface, it just works. I have this running on an I7-7700k system with an Nvidia 2040 GPU. So, older hardware but it works well! I'm using a medium model for transcription, and it works fine, only occasionally translating the wrong word, actually about as accurate as live news broadcasts manage. This system works after the file is recorded, and it takes only a couple of minutes to run.

One way to incrementally and seamlessly support this may be to support .srt files for recordings. They are currently supported only for imported personal content. It doesn't seem like it would be difficult to add this capability for recordings.

It might even be possible to create and update an srt file in real time, providing captions for live events such as new broadcasts, that are only a few seconds behind. That would make it possible to simply wait until the captions catch up to see them fully synchronized. My idea for this is to rely on a captions service co-located with the Channels server that takes the audio and spits out captions as fast as it can.

Adding the captions after the recording completes is also possible. I'm doing it now, but since the .srt files are not supported, I'm transcoding the recording to directly embed the subtitles. It takes between 2 and 10 minutes to fully process a 1 hour recording. It's still in early development, but it's actually working!

better later than never. :wink:

I think there would be lots of interest to try this out in docker/olive-tin. I watch tv delayed anyway, but even post processing on recordings would be awesome