Subtitles generation via ASR (Automatic Speech Recognition)

Could we add a possibility of generating subtitles for the recordings or even for the live programs on the fly?

An eighth gen quad core i5 can achieve 5x speed when using tiny.en model in whisper:

Just like with hardware transcoding, having a GPU would speed things up quite a bit and allow for the larger and more accurate models to be used.

If somebody is interested in playing with it, just install it via PIP and follow the Command-line usage section

@lsudduth

EDIT: There is also a standalone executable

Here are some benchmarks

What have you tried it on and how accurate is it for you?

Something like this would be the final piece for things like ah4c and ADBTuner. Closed captions are the only thing I really miss.

It just works. Even random foreign videos can have their subtitles generated. I always use the largest model available so medium.en for English or large-v3 - international.

I imagine with the system requirements needed to run this that it won't be making it into Channels DVR any time soon.

Even without GPU the speed is quite acceptable and on smaller models exceeds real time

Size Parameters English-only model Multilingual model Required RAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

Something like EzDubs could work

Does it output something like a SRT, or is it designed to be integrated into another application?

I would love to try it on some of my PlayOn recordings. The timing on the subtitles gets a bit off for some reason.

By default all supported subtitle formats are produced:

$ whisper --help

  --output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
                        format of the output file; if not specified, all
                        available formats will be produced (default: all)

Check out this project: GitHub - collabora/WhisperLive: A nearly-live implementation of OpenAI's Whisper.

It can already transcribe HLS streams live

client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/bbc_1xtra.isml/bbc_1xtra-audio%3d96000.norewind.m3u8")

It outputs text and if I'm understanding the documentation correctly, there's a way to return a string of two lines of text each with a max 50 characters.

1 Like

ezdubs is making a lot of progress with live translation. maybe somebody could reach out to them and ask for an integration with channels?

@babsonnexus Would the Whisper Live program I posted above be a good candidate for your project, similar to how you integrated the mpd to hls feature?

I can't say I'm totally up-to-date with all this. Can you give me a summary of what you are requesting and how you imagine it might work? Like, are you thinking PLM would intercept the stream, use Whisper Live to add subtitles, and then serve that stream up to Channels?