Why would socket.write hang indefinitely? - ruby

What would make a write call to a TCPSocket hang indefinitely?
lotsOfBytes = # a really large number of bytes, like 1 or 2 MB of data
socket = TCPSocket.new # some config
socket.write(lotsOfBytes) # this line hangs
I am trying to debug an issue where a get_multi operation sent to memcached with a large number of keys hangs indefinitely, and it does so on a line that resembles that code snippet. I'm trying to better understand how the low-level sockets on which this library is built are expected to work.

What are the values of following attributes on your TCPSocket:
Keep-alive activated and what value is set?
Timeout set and what value is set?
If you will do a Wireshark dump, it's much better so see what happens before hanging connection.

tcpdump? are there any attempts to send anything?
netstat - for see output queue.
does it work on a small number of bytes in your environment?

Related

What do the fields in rpcdebug -c's dmesg output mean?

I'm trying to track down a stall that may or may not be on the client host, but on the server side. Unfortunately this is all kernel RPC level, with some of if not controlled by my code...
I'm something of a neophyte when it comes to rpc debugging, so any references would be helpful.
I have this output in dmesg when rpcdebug -c -m all is run:
[2109401.599881] -pid- flgs status -client- --rqstp- -timeout ---ops--
[2109401.600055] 51580 0880 0 ffff9af4c4da9800 ffff9af4c4416600 15000 ffffffffc0b26680 nfsv3 GETATTR a:call_status [sunrpc] q:xprt_pending
[2109401.600300] 51581 0880 0 ffff9af4c4da9800 ffff9af42465f800 15000 ffffffffc0b26680 nfsv3 GETATTR a:call_status [sunrpc] q:xprt_pending
I get the PID, flags, and timeout, but:
What are the "client" and "rqstp" fields supposed to mean? If I see either duplicated in subsequent outputs, does that mean the RPC is stalled? And which way?
Is "xprt_pending" a "waiting to send the RPC" queue? If that queue was "delayq," I would know what that means in the context of the problem we're trying to diagnose. But this state doesn't seem to be explained anywhere I find in Google. (And my Google-Fu is usually better than this...)
What is the "ffffffffc0b26680" supposed to be? It repeats all over the output, for EVERY RPC listed.
I'm trying to avoid running with rpcdebug set, because I'm dealing with an intermittent stall, and I'd rather not slow EVERYTHING down in the hopes of catching the stall.

MongoDB-Java performance with rebuilt Sync driver vs Async

I have been testing MongoDB 2.6.7 for the last couple of months using YCSB 0.1.4. I have captured good data comparing SSD to HDD and am producing engineering reports.
After my testing was completed, I wanted to explore the allanbank async driver. When I got it up and running (I am not a developer, so it was a challenge for me), I first wanted to try the rebuilt sync driver. I found performance improvements of 30-100%, depending on the workload, and was very happy with it.
Next, I tried the async driver. I was not able to see much difference between it and my results with the native driver.
The command I'm running is:
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://192.168.0.13:27017/ycsb -p mongodb.writeConcern=strict -threads 96
Over the course of my testing (mostly with the native driver), I have experimented with more and less threads than 96; turned on "noatime"; tried both xfs and ext4; disabled hyperthreading; disabled half my 12 cores; put the journal on a different drive; changed sync from 60 seconds to 1 second; and checked the network bandwidth between the client and server to ensure its not oversubscribed (10GbE).
Any feedback or suggestions welcome.
The Async move exceeded my expectations. My experience is with the Python Sync (pymongo) and Async driver (motor) and the Async driver achieved greater than 10x the throughput. further, motor is still using pymongo under the hoods but adds the async ability. that could easily be the case with your allanbank driver.
Often the dramatic changes come from threading policies and OS configurations.
Async needn't and shouldn't use any more threads than cores on the VM or machine. For example, if you're server code is spawning a new thread per incoming conn -- then all bets are off. start by looking at the way the driver is being utilized. A 4 core machine uses <= 4 incoming threads.
On the OS level, you may have to fine-tune parameters like net.core.somaxconn, net.core.netdev_max_backlog, sys.fs.file_max, /etc/security/limits.conf nofile and the best place to start is looking at nginx related performance guides including this one. nginx is the server that spearheaded or at least caught the attention of many linux sysadmin enthusiasts. Contrary to popular lore one should reduce your keepalive timeout opposed to lengthen it. The default keep-alive timeout is some absurd (4 hours) number of seconds. you might want to cut the cord in 1 minute. basically, think a short sweet relationship with your clients connections.
Bear in mind that Mongo is not Async so you can use a Mongo driver pool. nevertheless, don't let the driver get stalled on slow queries. cut it off in 5 to 10 seconds using the following equivalents in Java. I'm just cutting and pasting here with no recommendations.
# Specifies a time limit for a query operation. If the specified time is exceeded, the operation will be aborted and ExecutionTimeout is raised. If max_time_ms is None no limit is applied.
# Raises TypeError if max_time_ms is not an integer or None. Raises InvalidOperation if this Cursor has already been used.
CONN_MAX_TIME_MS = None
# socketTimeoutMS: (integer) How long (in milliseconds) a send or receive on a socket can take before timing out. Defaults to None (no timeout).
CLIENT_SOCKET_TIMEOUT_MS=None
# connectTimeoutMS: (integer) How long (in milliseconds) a connection can take to be opened before timing out. Defaults to 20000.
CLIENT_CONNECT_TIMEOUT_MS=20000
# waitQueueTimeoutMS: (integer) How long (in milliseconds) a thread will wait for a socket from the pool if the pool has no free sockets. Defaults to None (no timeout).
CLIENT_WAIT_QUEUE_TIMEOUT_MS=None
# waitQueueMultiple: (integer) Multiplied by max_pool_size to give the number of threads allowed to wait for a socket at one time. Defaults to None (no waiters).
CLIENT_WAIT_QUEUE_MULTIPLY=None
Hopefully you will have the same success. I was ready to bail on Python prior to async

How to read a constant stream of NMEA http data using ruby

I have the iPhone app called 'gps2ip', which launches a web server you can visit to get streaming NMEA data.
You can directly connect to this stream using qgis to get an updated location position on your map. I'd like to access this stream programmatically.
If I type in this into my browser url window: http://192.168.1.116:11123 where 192.168.1.116 is the ip of my smartphone as indicated by the gps2ip app
I get a constant stream of newline separated NMEA strings on my safari/chrome/mozilla browser screen, constantly being updated at the bottom with constantly new lines of data.
GPS 2 IP Server started. "exit" to finish.
$GPGGA,005730,3403.415,N,07914.488,W,1,8,0.9,13.6,M,46.9,M,0,2*56
$GPRMC,005730,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*66
$GPGGA,005730,3403.415,N,07914.488,W,1,8,0.9,13.6,M,46.9,M,0,2*56
$GPRMC,005730,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*66
$GPGGA,005731,3403.415,N,07914.488,W,1,8,0.9,13.7,M,46.9,M,0,2*56
$GPRMC,005731,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*67
$GPGGA,005731,3403.415,N,07914.488,W,1,8,0.9,13.7,M,46.9,M,0,2*56
$GPRMC,005731,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*67
$GPGGA,005732,3403.415,N,07914.488,W,1,8,0.9,13.6,M,46.9,M,0,2*54
$GPRMC,005732,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*64
$GPGGA,005732,3403.415,N,07914.488,W,1,8,0.9,13.6,M,46.9,M,0,2*54
$GPRMC,005732,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*64
$GPGGA,005733,3403.415,N,07914.488,W,1,8,0.9,13.5,M,46.9,M,0,2*56
$GPRMC,005733,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*65
$GPGGA,005733,3403.415,N,07914.488,W,1,8,0.9,13.5,M,46.9,M,0,2*56
$GPRMC,005733,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*65
$GPGGA,005734,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*50
$GPRMC,005734,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*62
$GPGGA,005734,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*50
$GPRMC,005734,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*62
$GPGGA,005735,3403.415,N,07914.488,W,1,8,0.9,13.3,M,46.9,M,0,2*56
$GPRMC,005735,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*63
$GPGGA,005735,3403.415,N,07914.488,W,1,8,0.9,13.3,M,46.9,M,0,2*56
$GPRMC,005735,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*63
$GPGGA,005736,3403.415,N,07914.488,W,1,8,0.9,13.2,M,46.9,M,0,2*54
$GPRMC,005736,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*60
$GPGGA,005736,3403.415,N,07914.488,W,1,8,0.9,13.2,M,46.9,M,0,2*54
$GPRMC,005736,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*60
$GPGGA,005737,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*53
$GPRMC,005737,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*61
$GPGGA,005737,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*53
$GPRMC,005737,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*61
$GPGGA,005738,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*5C
$GPRMC,005738,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*6E
$GPGGA,005738,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*5C
$GPRMC,005738,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*6E
$GPGGA,005739,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*5D
$GPRMC,005739,A,3403.415,N,07914.488,W,0.00,,120115,003.1,W*6F
$GPGGA,005739,3403.415,N,07914.488,W,1,8,0.9,13.4,M,46.9,M,0,2*5D
I know how to parse these lines of NMEA code into latitude/longitude pairs, I just need to be able to access "the last one" easily inside a ruby environment.
I want to be able to pluck the last one or two lines and parse the NMEA strings manually, but I haven't figured out a way to "sip from the firehose" of data without generating an error message.
When I try this:
require 'open-uri'
open("http://192.168.1.116:11123")
I get this error:
Net::HTTPBadResponse: wrong status line: "GPS 2 IP Server started. \"exit\" to finish."
Where "GPS 2 IP Server Started. 'exit' to finish." is of course the first line of the response.
What ruby gem should I use to sip from this firehose of data? Apparently open-uri wants html headers and my stream has none of that. I just need to stream pure text apparently.
Since this isn't the HTTP protocol, you'll need the more generic Socket. You can do something like this:
require 'socket'
s = TCPSocket.new '192.168.1.116', 11123
while line = s.gets
puts line
end
s.close
Depending on how fast the data arrives and how long it takes to process each line, you may need to investigate putting each line into a queue such as Sidekiq so that multiple workers can process lines simultaneously.
Since what you have is a never-ending stream of data, you can't just grab the last message. You must consume the stream, and decide when you've received enough to start your processing. You could probably take advantage of the GPRMC message to do that.
In nmea_plus (full disclosure, I wrote it), you could do it this way:
require 'socket'
require 'nmea_plus'
allowable_delay = 3 # threshold to consider the stream as fresh
caught_up = false # flag to say whether we've crossed the threshold
io_source = TCPSocket.new('192.168.1.116', 11123)
source_decoder = NMEAPlus::SourceDecoder.new(io_source)
source_decoder.each_message do |msg|
case msg.data_type
when 'GPRMC'
if Time.now - msg.utc_time < allowable_delay
caught_up = true
end
when 'GPGLL'
if caught_up
puts "Fix: #{msg.latitude}, #{msg.longitude}"
end
end
end
By default, the decoder ignores any lines that don't parse as NMEA.

Jmeter TCP Sampler

We are running JMeter for connecting TCP Socket thorugh BinaryTCPClientImpl , We are getting the response code : 500
Response message: org.apache.jmeter.protocol.tcp.sampler.ReadException
JMeter Version : 2.9
Help out
If this is the error
ERROR - jmeter.protocol.tcp.sampler.TCPSampler: org.apache.jmeter.protocol.tcp.sampler.ReadException:
at org.apache.jmeter.protocol.tcp.sampler.BinaryTCPClientImpl.read(BinaryTCPClientImpl.java:140)
at org.apache.jmeter.protocol.tcp.sampler.TCPSampler.sample(TCPSampler.java:414)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:429)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:257)
at java.lang.Thread.run(Unknown Source)
then you have 2 options. The first (and much easier if it applies
to you) is to use the LengthPrefixedBinaryTCPClientImpl. If this
applies to you, that is, if your responses are always the same fixed
sizes, you can simply set the tcp.binarylength.prefix.length property
and go about your business.
If that is not the case, then your other option is to extend
org.apache.jmeter.protocol.tcp.sampler.TCPClient. It may help to get in
touch with the client team of this proprietary protocol, because after
all, they have implemented something that works. You'll probably have
to extend it to look something like LengthPrefixedBinaryTCPClientImpl
read N bytes. Although, this runs the risks of reading too many or too
few bytes. If your application server ever miscalculates the size of
it's output, you suffer the consequences by getting another timeout or
leaving extra bytes in the buffer and reading them on the next iteration
(and then cascading errors).

WinInet: timeout management in FTP put

My program puts a file into a remote host using HTTP. For some unavoidable
reasons, the remote hosts needs some time to acknowledge the final packet of
the data transmission. More time than the default timeout, which according
to my experience is around 30 seconds.
Therefore I wanted to increase the timeout to 5 minutes, using this code:
DWORD dwTimeout= 300000; // 5 minutes
pFtpConnection->SetOption( // KB176420: this has no effect on some
INTERNET_OPTION_SEND_TIMEOUT, dwTimeout); // old versions of IE.
pFtpConnection->SetOption(
INTERNET_OPTION_RECEIVE_TIMEOUT, dwTimeout);
pFtpConnection->SetOption( // NB: Docs say these 2 are not implemented.
INTERNET_OPTION_DATA_SEND_TIMEOUT, dwTimeout);
pFtpConnection->SetOption( // our own tests show that they are!
INTERNET_OPTION_DATA_RECEIVE_TIMEOUT, dwTimeout);
This is MFC code which boils down to calling
InternetOption(hConnection, INTERNET_XXX, &dwTimeout, sizeof(dwTimeout))
The problem is that this code apparently fails to modify the timeout on a
non negligeable proportion of computers where the program is used.
How can I reliably set the data connection timeout?
TIA,
Serge Wautier.
It looks like this WinInet isue can'tbe solved reliably.
I eventually switched from WinInet to Ultimate TCP/IP.

Resources