Stream.mpeg M3U Buffering Bug

Vraz · September 9, 2021, 3:15am

Ran into what is likely an obscure stream.mpg buffering bug. Have moved a couple of my network cameras (home observation/security/wildlife) to Channels via custom M3U h264 RTSP. Since Channels has no .strm support (hint hint), is only useful for live viewing. Use additional instances of ffmpeg (I-frame only extraction) into mjpeg-server (since ffserver is discontinued) into my existing custom home network camera software to provide last-x-hour recording and web-based viewing with last-x-hour fast-seek.

Mjpeg-server uses a combination of HTTP chunked (yuck) plus multipart plus content-length prefixed encoding. This standards-compliant though dubious approach has exposed a bug likely in my HTTP ingestion that causes a hang after many hours of running.

Unexpectedly, the latter impacts Channels. A slow/misperforming stream listener causes Channels to buffer the stream (to allow for catch up). A good idea, but if the slow listener fails and Channels runs out of buffer, it kills all streams and all listeners including Channels clients themselves fail (Channels TVoS get a "disconnect" failure).

2021/09/07 12:04:25.653946 [WRN] Buffer for M3U-Cameras ch7002 is more than 50% full (clients=2, len=33564016)
2021/09/07 12:05:37.948777 [WRN] Buffer for M3U-Cameras ch7002 is more than 75% full (clients=2, len=50351156)
2021/09/07 12:06:36.001105 [WRN] Buffer for M3U-Cameras ch7002 is more than 95% full (clients=2, len=63768472)
2021/09/07 12:06:47.532552 [WRN] Buffer for M3U-Cameras ch7002 is more than 99% full (clients=2, len=66443900)

Adding buffering to accommodate slow listeners is good. Killing the well-performing listener streams in favor of a broken listener seems like a bug. If a listener is too far behind, Channels should drop it and continue providing the steam to the functioning listeners.

While I will resolve my listener issue, seems like the underlying Channels buffering issue should be fixed to ensure robust stream playback. Thanks.

tmm1 · September 9, 2021, 4:29am

The buffer is a ring and drops old data when hitting 100%. The messages printed are warnings about this impending behavior.

I'm not sure why you would be seeing disconnects. Are there no log messages when that happens?

Vraz · September 9, 2021, 4:52am

Longest log sequence I could find:

2021/09/08 15:21:03.663363 [TNR] Sharing existing connection to M3U-Cameras for ch7002 Cam-2 (clients=2, len=0)
2021/09/08 15:21:10.172106 [TNR] Sharing existing connection to M3U-Cameras for ch7002 Cam-2 (clients=2, len=0)
2021/09/08 15:25:04.863824 [WRN] Buffer for M3U-Cameras ch7002 is more than 50% full (clients=1, len=33574976)
2021/09/08 15:26:29.889915 [WRN] Buffer for M3U-Cameras ch7002 is more than 75% full (clients=1, len=50332300)
2021/09/08 15:27:37.770516 [WRN] Buffer for M3U-Cameras ch7002 is more than 95% full (clients=1, len=63758564)
2021/09/08 15:27:51.406799 [WRN] Buffer for M3U-Cameras ch7002 is more than 99% full (clients=1, len=66454804)
2021/09/08 15:37:27.254975 [TNR] Closed connection to M3U-Cameras for ch7002 Cam-2
2021/09/08 15:37:27.255174 [HLS] ffmpeg: rtsp-7002:  av_interleaved_write_frame(): Broken pipe
2021/09/08 15:37:27.255191 [HLS] ffmpeg: rtsp-7002:  Error writing trailer of pipe:: Broken pipe

Fixed capacity ring buffer makes sense, but appears to be hitting some internal limit and then failing. Has happened multiple times so not a one-off. Takes 3-10 hours to replicate and hoping to attach gdb next time and figure out root cause.

Vraz · September 9, 2021, 7:24pm

Found my bug and can duplicate the Channels streaming bug on demand. Start watching an M3U channel on a TVoS client and then:

curl "http://dvr.lxc.lan:8089/devices/ANY/channels/7002/stream.mpg" >/dev/null

Once curl has been running for a minute or so (establish data flow), control-Z to suspend it. Within ten minutes, channels will overflow its ring buffer and the well-behaved TVoS client will fail with "Connection Lost".

When the misbehaving client stops accepting data (but its connection remains open), the network buffer associated with the Channels stream for that client fills. Not sure if Channels is using blocking sends or what the internal issue is, but that full network buffer prevents the ring buffer from force-removing the oldest data and regular Channels clients fail.

Hope this is enough to track down. Probably very rare in normal operation, but guessing it gets triggered in the wild occasionally.

tmm1 · September 9, 2021, 7:26pm

Interesting. I wonder if this is specific to RTSP based streams somehow.

Vraz · September 9, 2021, 7:28pm

Wondered the same thing. As still fixing the underlying bug on my side, did not have time to experiment with other stream sources. Might try against Pluto if I find some free cycles.

Vraz · September 9, 2021, 8:17pm

Did this in the background. The same problem occurs with M3U fed channels such as Pluto network. So not specific to RTSP.

Vraz · September 11, 2021, 10:11pm

FYI-- this bug impacts all tuner sources including HDHR when tuner sharing is enabled. Hardly surprising since it is triggered by what is effectively extreme network congestion on the output side of your stream replication.

Vraz · September 15, 2021, 3:20pm

Though I cannot be 100% certain, looks like if you pause a Channels client (Apple TV in my case) and if its history buffer fills, that also interrupts data distribution to other clients. Unpausing the client will unblock data distribution, but then the logs spam with:

2021/09/15 09:41:45.071425 [HLS] ffmpeg: rtsp-7003:  [mpegts @ 0x6777e80] Non-monotonous DTS in output stream 0:0; previous: 4289818122, current: 2326469459; changing to 4289818123. This may result in incorrect timestamps in the output file.

Am guessing this is because the data loss meant that the stream became non-contiguous. The logs overflowed with ffmpeg "non-monotonous DTS" (shouldn't that be non-monotonic?) error messages. (Because it exceeded my 10MB log limit, could confirm if the same buffering warnings preceded this.)

This also looks like a knock-on bug in ffmpeg. Shouldn't it resync against the updated DTS values at some point? Not like the missing data is going to magically reappear. Only way to fix was to terminate the RTSP source endpoint (thus terminating the Channels ffmpeg collector) and then restart to Channels would launch a fresh ffmpeg collector.

Channels DVR can be a very powerful front-end for security cameras, but this buffering bug makes it so fragile in an application that requires high-reliability.

Vraz · September 22, 2021, 10:33pm

The "non-monotonous DTS" issue reported above is not related to the Channels ring-buffer overrun issues-- it is caused by an ffmpeg bug related to long streaming sessions of raw h264 where it overruns a 32-bit time value.

Recently added two Raspberry Pi based network cameras (one with Cam V2 and one with Cam HQ) as potential long-term replacements for commercial cameras. ffmpeg-static transports the h264 encoded video stream to Channels via an RTSP rebroadcaster. The v4l2 tools allow nice customization of the video feed (ISO, exposure, etc). However, after 12+ hours, ffmpeg rolls over the timestamps. Have tried many options (use_wallclock_as_timestamps, correct_ts_overflow, etc) without luck. A periodic force-restart of ffmpeg seems to be the only rather ugly fix.

Interestingly, Channels itself does not seem to care about the non-monotonic DTS frames and the streams continue to play fine in both the Web UI and Apple TV client. However, it spams the logs with tens of megabytes of useless data (erasing everything else if using any kind of log management). Those same streams are subsequently consumed by another ffmpeg instance that pulls the i-frames for injection into my security camera software. Ironically, ffmpeg #2 fails when the DTS errors start.

tmm1 · August 9, 2022, 8:34pm

We have finally managed to find a fix for this, which is available under a new experimental option. Please try it out and let us know if it resolves the issue.