OMX Hardware Transcoding on Raspberry PI

source: h264 1080p 4mbit

encode=h264_omx: speed=1.29x

scale=720p encode=h264_omx: speed=0.949x 
encode=libx264: speed=0.226x

scale=720p encode=libx264: speed=0.341x

source: mpeg2 1080i 10mbit

encode=h264_omx: speed=1.36x

deint=blend encode=h264_omx: speed=1.17x

scale=720p encode=h264_omx: speed=0.833x

deint=blend scale=720p encode=h264_omx: speed=0.723x
encode=libx264: speed=0.34x

scale=720p encode=libx264: speed=0.211x

deint=blend encode=libx264: speed=0.204x

deint=blend scale=720p encode=libx264: speed=0.342x

It seems like performance is right on the edge of being usable, atleast for one stream.

If there's a stable aarch64 solution that emerges for the PI4, that would be enough to push thing over the edge and allow us to use a lot more of the optimizations available on the chip.

Very interesting (and good delivery timing!). Is there an option to pass transcoded material on unscaled no matter what (i.e. to skip the scale step)? For me software 720p transcoding is already quite stable, so just doing 1080i->1080p would keep you above board.

I'm surprised your libx264 results are so much slower. What's the default software scaling to h.264 in ffmpeg? It was about as fast as omx for me (with my freshly compiled version).

And using Channels' ffmpeg, here's what I get (just looking at the status below the web viewer's pane and letting it settle):

1080i to 1080p, 10mbit, speed = 0.85
1080i to 720p 6mbit, speed= 0.75

I know I overclocked, but that shouldn't be more than a 10% effect.

Ah, I was missing some of the speed flags the DVR is using when testing ffmpeg myself manually:

-preset veryfast -x264opts "subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none"

I also eeked a little bit more performance out of the deinterlacing by enabling NEON.

Using mpeg2 sample now:

encode=libx264: speed=0.761x

scale=720p encode=libx264: speed=0.896x

deint=blend encode=libx264: speed=0.761x

deint=blend scale=720p encode=libx264: speed=0.837x

And with the 1080p h264 sample:

encode=libx264: speed=0.793x

scale=720p encode=libx264: speed=0.998x

It appears Rapsbian is bound to stick with a 32bit kernel/userland for at least 4 more years to support Pi Zero, etc. And running a 32bit user land on a 64bit kernel is possible but may disable OMX/MAML.

In any case, I hadn't counted on real-time transcoding, and am actually pretty amazed it works. And it's absolutely rock solid so far for what I had planned to use it for.

But it seems like just turning on OMX and limiting the output to no scaling + moderate bit rates would make it possible to use right now!

If there are options for 64bit kernel+user level for Pi, would you then have to distribute two versions? This person is getting close (with a Gentoo build).

Glad you have one to play with; looking forward to seeing what you come up with.

If Raspbian switched to armv8l or aarch64 that would make things easier for us since we could enable optimizations unconditionally on those builds. Right now we have to support armv7l across a wide variety of CPUs, many of which don't have NEON available.

Unfortunately while my tests with h264_omx worked well on the command line, as soon as I tried with the DVR everything blew up. In Linear deinterlacing mode (60fps), the encoder locks up completely. It also seems have several timestamp handling bugs and so its not working correct with options like -copyts used by the DVR.

I'm testing with remote access to @maddox's pi4 right now, but once mine shows up next week I should have more time to play with it. It is surprising how well software encoding works already, and it makes sense to try to enable hardware encoding to save the CPU for other work.

BTW does h264_mmal work for you?

Much more in line with what I was seeing, especially with a bit of overclock. Not sure if NEON was compiled into the shipped ffmpeg though...

I stripped the mmal configs out when I recompiled with libx264 support (since I only have OTA MPG2 streams here). Not working for you?

How does 1080i -> 720p 6mbps via OMX work now with your faster de-interlacing?

Progress isn't always linear :slight_smile:. Deinterlace is a separate tool in the chain though, yes? So can't blame OMX?

I can imagine a 3rd party springing up to build and pre-install an RPI4-based Channels Appliance, complete with 3D printed Channels-logo fan case, for <$65. Just add a disk. Heck build them yourselves and bundle it for free or $25 or something with a 2 year DVR subscription...

Glad to have attracted your interest here.

The improvements were minor, it's still not enough to do a real time encode unfortunately:

deint + 720p@4mbit = 0.773x
deint + 480p@4mbit = 0.895x
deint + 480p@2mbit = 0.903x

1 Like

I have a test build of the DVR that uses OMX if you want to try it out:

curl -XPUT http://127.0.0.1:8089/updater/check/2019.08.17.0110

Awesome, thanks for this. Works pretty good for live programming, already usable. With 1080i source and any bitrate 1080p output, it starts at a healthy 1.5x transcode (and slowly drops, I guess as it hits a full buffer and converges on real-time 1x??). Not sure if it respects the bitrate setting, but all 1080p's work well with either 1080i or 720p source. CPU is 130% or below. Really impressive.

I did encounter one hitch: after a minute or two in the web viewer the stream just stopped flat, and these warnings were in the log:

2019/08/16 21:29:39 [WRN] Buffer for 10654FDA ch11.1 is more than 50% full (clients=1, len=16777684)
2019/08/16 21:29:44 [WRN] Buffer for 10654FDA ch11.1 is more than 50% full (clients=1, len=16777684)
2019/08/16 21:29:49 [WRN] Buffer for 10654FDA ch11.1 is more than 75% full (clients=1, len=25166132)
2019/08/16 21:29:51 [WRN] Buffer for 10654FDA ch11.1 is more than 50% full (clients=1, len=16777420)
2019/08/16 21:29:56 [WRN] Buffer for 10654FDA ch11.1 is more than 75% full (clients=1, len=25167052)

As you found, rescaling hits hard. So the 720p output setting with 1080i source could only achieve 0.85x or so. Even 720p source to 576p output grinds to a halt. But during either of these, it's still only using 130% CPU. I gather de-interlacing and scaling are single threaded? Possible to double up threads?

And unfortunately, hardware transcoding doesn't work at all for streaming pre-recorded segments. Just spins and refuses to start the stream at all (despite advertising early speeds up to 1000x). You'd think pulling from disk would be easier than juggling an incoming network stream, so maybe just some setting there.

Great start!

Try the updated command above for recording playback fix.

Works, but takes up to a minute for the stream to start or to move playhead.

Does the experimental new transcoder checkbox make any difference?

Better, but not as good as a live stream. Down to maybe 10-15s. Oddly says transcoding at 0.3-0.5x.

1 Like

Can you try the curl command with 2019.08.17.0445

As I am only a not super tech “user” but my original thought was: wouldn’t “remux” be better.

Sorry to bother you, if this an “off the wall” ?