ChromeCast - Stream Calls Failing when Stalled for long time - chromecast

I'm attempting to play a live stream on ChromeCast. The stream is thrown fine and starts playback appropriately.
However when I play the stream longer: somewhere between 2-15 minutes, the player stops playing and I get MediaStatus.IDLE_REASON_ERROR in my RemoteMediaClient.Callback.
When looking at the console logs from ChromeCast I see that 3-4 calls are failed. Here are the logs:
14:50:26.931 GET https://... 0 ()
14:50:27.624 GET https://... 0 ()
14:50:28.201 GET https://... 0 ()
14:50:29.351 GET https://... 0 ()
14:50:29.947 media_player.js:64 [1381.837s] [cast.player.api.Host] error: cast.player.api.ErrorCode.NETWORK/3126000
Looking at Cast MediaPlayer.ErrorCode Error 312.* is
Failed to retrieve the media (bitrated) playlist m3u8 file with three retries.
Developers need to validate that their playlists are indeed available. It could be the case that a user that cannot reach the playlist as well.
I checked, the playlist was available. So I thought perhaps the server wasn't responding. So I looked at the network calls response logs.
Successful Request
Stalled Request
Note that the stall time far exceeds the usual stall time.
ChromeCast isn't making these calls at all, the requests are simply stalled for a long time until they are cancelled. All the successful requests are stalled for less than 14ms (mostly under 2ms).
The Network Analysis Timing Breakdown provides three reasons for stalling
There are higher priority requests.
There are already six TCP connections open for this origin, which is the limit. Applies to HTTP/1.0 and HTTP/1.1 only.
The browser is briefly allocating space in the disk cache
While I do believe the first one should not be the case, the later two can be. However in both cases I believe the fault lies with cast.player.
Am I doing something wrong?
Has anyone else faced the same issue? Is there any way to either fix it or come up with a work-around.

Related

What do the fields in rpcdebug -c's dmesg output mean?

I'm trying to track down a stall that may or may not be on the client host, but on the server side. Unfortunately this is all kernel RPC level, with some of if not controlled by my code...
I'm something of a neophyte when it comes to rpc debugging, so any references would be helpful.
I have this output in dmesg when rpcdebug -c -m all is run:
[2109401.599881] -pid- flgs status -client- --rqstp- -timeout ---ops--
[2109401.600055] 51580 0880 0 ffff9af4c4da9800 ffff9af4c4416600 15000 ffffffffc0b26680 nfsv3 GETATTR a:call_status [sunrpc] q:xprt_pending
[2109401.600300] 51581 0880 0 ffff9af4c4da9800 ffff9af42465f800 15000 ffffffffc0b26680 nfsv3 GETATTR a:call_status [sunrpc] q:xprt_pending
I get the PID, flags, and timeout, but:
What are the "client" and "rqstp" fields supposed to mean? If I see either duplicated in subsequent outputs, does that mean the RPC is stalled? And which way?
Is "xprt_pending" a "waiting to send the RPC" queue? If that queue was "delayq," I would know what that means in the context of the problem we're trying to diagnose. But this state doesn't seem to be explained anywhere I find in Google. (And my Google-Fu is usually better than this...)
What is the "ffffffffc0b26680" supposed to be? It repeats all over the output, for EVERY RPC listed.
I'm trying to avoid running with rpcdebug set, because I'm dealing with an intermittent stall, and I'd rather not slow EVERYTHING down in the hopes of catching the stall.

FB Messenger API - Receiving double requests

I have a working FB Bot built with Ruby which allows players to play a scavenger hunt.
Sometimes though, when I have multiple players in a team, FB is sending me a players 'Answer' webhook twice. I have looked into it and at first thought it was to do with the 20 second timeout if FB gets no 200 OK response (Docs here). After checking the logs though, I am receiving the second webhook from FB only 14 seconds later. See below:
# Webhook #1
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153642358, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
# Webhook #2 (14 seconds later)
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153656901, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
Notice both are exactly the same apart from the first "time" attribute (14 secs later).
Due to a number of methods and calls that I process after receiving the first webhook, the 200 OK response is only being sent back to FB once I have finished sending my messages in response (hence the 14 second delay).
So I have two questions:
Is the 14 second delay too long and that is why FB is resending? If so, how can I send a 200OK response straight away (head :ok)?
Is it another issue entirely?
You also ensure that "Echo" is disabled.
Go to Settings>Webhooks, edit events.
Asyncronous language like NodeJS is recomended, in my case y work with AWS SQS, I have workers that process the requests witout blocking (dont wait), I return 200,"ok" to FB to avoid that FB send again the message to my webhook.
Anothe apporach maybe store the mid in database, and check in each request if the mid exists, if exists the dont process the message. I was use Dynamo DB (AWS) with TTL enabled, thus with TTL my database autoclean every hour erasing old request.
I think it is the 15 second wait before replying, was also happening to me as Facebook auto retries when you don't reply fast enough. Te EEe Te's idea is solid, write some mechanism to cache mids and check if it is a duplicate before processing

WinAPI - Why does ICMPSendEcho2Ex report false timeouts when Timeout is set below 1000ms?

Edit: I started asking this as a PowerShell / .Net question and couldn't find any reference to it on the internet. With feedback, it appears to be a WINAPI issue so this is an edit/rewrite/retag, but much of the tests and background reference .Net for that reason.
Summary
WINAPI ping function IcmpSendEcho2 appears to have a timing bug if the ping timeout parameter is set below 1000ms (1 second). This causes it to return intermittent false timeout errors. Rather than a proportional "lower timeout = more fails" behaviour, it appears to be a cutoff of >=1000ms+ is expected behaviour, <=999ms triggers false timeouts, often in an alternating success/fail/success/fail pattern.
I call them false timeouts because I have WireShark packet capture showing a reply packet coming back well within the timeout, and partly because the 1ms change shouldn't be a significant amount of time when the replies normally have 500-800ms of headroom, and partly because I can run two concurrent sets of pings with different timeouts and see different behavior between the two.
In the comments of my original .Net question, #wOxxOm has:
located the Open Source .Net code where System.Net.NetworkInformation.Ping() wraps the WinAPI and there appears to be no specific handling of timeouts there, it's passed down to the lower layer directly - possibly line 675 with a call to UnsafeNetInfoNativeMethods.IcmpSendEcho2()
and #Lieven Keersmaekers has investigated and found things beyond my skill level to interpret:
"I can second this being an underlying WINAPI problem. Following a success and timedout call into IPHLPAPI!IcmpSendEcho2Ex: the 000003e7 parameter is on the stack, both set up an event and both return into IPHLPAPI!IcmpEchoRequestComplete with the difference of the success call's eax register containing 00000000 and the timedout call's eax register containing 00000102 (WAIT_TIMEOUT)
"Compiling a 64bit C# version, there's no more calls into IPHLPAPI. A consistent thing that shows up is clr.dll GetLastError() returning WSA_QOS_ADMISSION_FAILURE for timeouts. Also consistent in my sample is the order of thread executions between a success and a timeout call being slightly different."
This StackOverflow question hints that the WSA_QOS_ADMISSION_FAILURE might be a misleading error, and is actually IP_REQ_TIMED_OUT.
Testing steps:
Pick a distant host and set some continuous pings running. (The IP in my examples belongs to Baidu.cn (China) and has ping replies around ~310ms to me). Expected behaviour for all of them: almost entirely ping replies, with occasional drops due to network conditions.
PowerShell / .Net, with 999ms timeout, actual result is bizarre reply/drop/reply/drop patterns, far more drops than expected:
$Pinger = New-Object -TypeName System.Net.NetworkInformation.Ping
while($true) {
$Pinger.Send('111.13.101.208', 999)
start-sleep -Seconds 1
}
command prompt ping.exe with 999ms timeout, actual result is more reliable (edit: but later findings call this into question as well):
ping 111.13.101.208 -t -w 999
PowerShell / .Net, with 1000ms timeout, actual result is as expected:
$Pinger = New-Object -TypeName System.Net.NetworkInformation.Ping
while($true) {
$Pinger.Send('111.13.101.208', 1000)
start-sleep -Seconds 1
}
It's repeatable with C# as well, but I've edited that code out now it seems to be a WinAPI problem.
Example screenshot of these running side by side
On the left, .Net with 999ms timeout and 50% failure.
Center, command prompt, almost all replies.
On the right, .Net with 1000ms timeout, almost all replies.
Earlier investigations / Background
I started with a 500ms timeout, and the quantity of fake replies seems to vary depending on the ping reply time of the remote host:
pinging something 30ms away reports TimedOut around 1 in 10 pings
pinging something 100ms away reports TimedOut around 1 in 4 pings
pinging something 300ms away reports TimedOut around 1 in 2 pings
From the same computer, on the same internet connection, sending the same amount of data (32 byte buffer) with the same 500ms timeout setting, with no other heavy bandwidth use. I run no antivirus networking filter outside Windows 10 defaults, two other people I know have confirmed this frequent TimedOut behaviour on their computers (now two more have in the comments), with more or fewer timeouts, so it ought not to be my network card or drivers or ISP.
WireShark packet capture of ping reply which is falsely reported as a timeout
I ran the ping by hand four times to a ~100ms away host, with a 500ms timeout, with WireShark capturing network traffic. PowerShell screenshot:
WireShark screenshot:
Note that the WireShark log records 4 requests leaving, 4 replies coming back, each with a time difference of around 0.11s (110 ms) - all well inside the timeout, but PowerShell wrongly reports the last one as a timeout.
Related questions
Googling shows me heaps of issues with System.Net.NetworkInformation.Ping but none which look the same, e.g.
System.Net.NetworkInformation.Ping crashing - it crashes if allocated/destroyed in a loop in .Net 3.5 because its internals get wrongly garbage collected. (.Net 4 here and not allocating in a loop)
Blue screen when using Ping - 6+ years of ping being able to BSOD Windows (not debugging an Async ping here)
https://github.com/dotnet/corefx/issues/15989 - it doesn't timeout if you set a timeout to 1ms and a reply comes back in 20ms, it still succeeds. False positive, but not false negative.
The documentation for Ping() warns about the low-timeout-might-still-say-success but I can't see it warns that timeout might falsely report failure if set below 1 second.
Edit: Now that I'm searching for ICMPSendEcho2, I have found exactly this problem documented before in a blog post in May 2015 here: https://www.frameflow.com/ping-utility-flaw-in-windows-api-creating-false-timeouts/ - finding the same behavior, but having no further explanation. They say that ping.exe is affected, when I originally thought it wasn't. They also say:
"In our tests we could not reproduce it on Windows Server 2003 R2 nor on Windows Server 2008 R2. However it was seen consistently in Windows Server 2012 R2 and in the latest builds of Windows 10."
Why? What's wrong with the timeout handling that makes it ignore ping repsonses coming into the network stack completely? Why is 1000ms a significant cutoff?

AJAX query weird delay between DNS lookup and initial connection on Chrome but not FF, what is it?

I have an AJAX query on my client that passes two parameters to a server:
var url = window.location.origin + "/instanceStats"
$.getJSON(url, { 'unit' : unit, "stat" : stat }, function(data) {
instanceData[key] = data;
var count = showInstanceStats(targetElement, unit, stat, limiter);
});
The server itself is a very simple Python Flask application. On that particular URL, it grabs the "unit" and "stat" parameters from the query to determine the name of a CSV file and line within that file, grabs the line, and sends the data back to the client formatted as JSON (roughly 1KB).
Here is the funny thing: When I measure the time it takes for the data to come back, I observe that some queries are fast (between 20 and 40 ms), and some queries are slow (between 320 and 350 ms). Varying the "stat" parameter (i.e. selecting a different line in the CSV) doesn't seem to have any impact. The fast and slow queries usually switch back and forth (i.e. all even queries are fast, all odd ones are slow). The Python server itself reports roughly the same time for each query.
AJAX itself doesn't seem to have any impact either, as I can take the url that is constructed in the JS and paste it into the browser myself and get the same behavior. Here are some measurements from two subsequent queries:
Fast: http://i.imgur.com/VQ7qopd.png
Slow: http://i.imgur.com/YuG0ROM.png
This seems to be Chrome-specific, as I've tried it on Firefox and the same experiment yields roughly the same query time everytime (between 30 and 50 ms). This is unfortunate, as I want to deploy on both Chrome and Firefox.
What's causing this behavior, and how can I fix it?
I've run into this also. It only seems to happen when using localhost. If you use 127.0.0.1 (or even the computer name), it will not have the extra delay.
I'm having it too, and it's exactly the same: my Node.js application serves Ajax requests and no matter which /url I request it's either 30ms or 300ms and it switches back and forth: odd requests are long, even requests are short.
The thing I see in Chrome Web Inspector (aka Chrome DevTools) is that there is a long gap between "DNS lookup" and "Initial Connection".
They say it's OCSP related here:
http://www.webpagetest.org/forums/showthread.php?tid=12357
OCSP is some kind of certificate validation protocol:
https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol
Moving from localhost to 127.0.0.1 seems to fix it: response times are 30ms now.

XBee - XBee-API and multiple endpoints

Using Andrew Rapp's XBee-API, how can I sample I/O data via a coordinator from more than two endpoints?
I have 17 Series 1 XBees. I have programmed one to be a coordinator (API mode = 2) and the rest to be endpoints. Using XBee-API I am sending a Force I/O Sample ("IS") remote AT command, unicast to each endpoint. This works perfectly well when there are up to two endpoints, but as soon as a third is added, one of the three always becomes non-responsive (times out with XBeeTimeoutException). It's not always the same physical unit that stops responding, but it is always the third one (for example, if I send Force I/O Sample to Device1, Device2, and Device3, Device3 will time out, and if I change the order to Device3, Device1, Device2, Device2 will time out.
If I set up more than three XBees, about 1 out of 3 will time out - but not every third one.
I've verified that the XBees themselves are fine. I've searched the Internet and Stack Overflow in particular to no avail. I've tried using a simple ZNetRemoteAtRequest. I've tried opening and closing the XBee coordinator serial connection once for all three devices, once per device, and once per program run. I've tried varying the distance between the coordinator and endpoints (never more than five feet apart). I've tried different coordinator configuration parameters (from the Digi documentation). I've tried changing out the XBee for the coordinator.
This is the code I'm using to send the Force I/O Sample request to each endpoint and read the response:
xbee = new XBee(); // Coordinator
xbee.open("/dev/ttyUSB0, 115200)); // Happens before any of the endpoints are contacted
... // Loop through known endpoint addresses
XBeeRequest request = new ZBForceSampleRequest(new XBeeAddress64(endpointAddress));
ZNetRemoteAtResponse response = null;
response = (ZNetRemoteAtResponse) xbee.sendSynchronous(request, remoteXBeeTimeout);
if (response.isOk()) {
// Process response payload
}
... // End loop and finally close coordinator connection
What might help polling I/O samples from more than two endpoints?
EDIT: I found that Andrew Rapp's XBee-API library fakes multithreaded behavior, which causes the synchronization issues described in this question. I wrote a replacement library that is actually multithreaded and correctly maps responses from multiple XBee endpoints: https://github.com/steveperkins/xbee-api-for-java-1-4. When I wrote it Java 1.4 was necessary for use on the BeagleBone, Plug, and Zotac single-board PCs but it's an easy conversion to 1.7+.
Are you using hardware flow control on your serial port? Is it possible that you're sending requests out when the local XBee has deasserted CTS (e.g., asking you to stop sending)? I assume you're running at 115200 bps, so the XBee serial port can keep up with the network data rate.
Can you turn on debugging information, or connect some port monitoring hardware/software to display the data going over the serial port to the local XBee?

Resources