Bigquery Streaming inserts, persistent or new http connection on every insert? - ruby

I am using google-api-ruby-client for Streaming Data Into BigQuery. so whenever there is a request. it is pushed into Redis as a queue & then a new Sidekiq worker tries to insert into bigquery. I think its involves opening a new HTTPS connection to bigquery every insert.
the way, I have it setup is:
Events post every 1 second or when the batch size reaches 1MB (one megabyte), whichever occurs first. This is per worker, so the Biquery API may receive tens of HTTP posts per second over multiple HTTPS connections.
This is done using the provided API client by Google.
Now the Question -- For Streaming inserts, what is the better approach:-
persistent HTTPS connection. if yes, then should it be a global connection that's shared across all requests? or something else?
Opening new connection. like we are doing now using google-api-ruby-client

I think it's pretty much to early to speak about these optimizations. Also other context is missing like if you exhausted the kernel's TCP connections or not. Or how many connections are in TIME_WAIT state and so on.
Until the worker pool doesn't reach 1000 connections per second on the same machine, you should stick with the default mode the library offers
Otherwise this would need lots of other context and deep level of understanding how this works in order to optimize something here.
On the other hand you can batch more rows into the same streaming insert requests, the limits are:
Maximum row size: 1 MB
HTTP request size limit: 10 MB
Maximum rows per second: 100,000 rows per second, per table.
Maximum rows per request: 500
Maximum bytes per second: 100 MB per second, per table
Read my other recommendations
Google BigQuery: Slow streaming inserts performance
I will try to give also context to better understand the complex situation when ports are exhausted:
Let's say on a machine you have a pool of 30,000 ports and 500 new connections per second (typical):
1 second goes by you now have 29500
10 seconds go by you now have 25000
30 seconds go by you now have 15000
at 59 seconds you get to 500,
at 60 you get back 500 and stay at using 29500 and that keeps rolling at
29500. Everyone is happy.
Now say that you're seeing an average of 550 connections a second.
Suddenly there aren't any available ports to use.
So, your first option is to bump up the range of allowed local ports;
easy enough, but even if you open it up as much as you can and go from
1025 to 65535, that's still only 64000 ports; with your 60 second
TCP_TIMEWAIT_LEN, you can sustain an average of 1000 connections a
second. Still no persistent connections are in use.
This port exhaust is better discussed here: http://www.gossamer-threads.com/lists/nanog/users/158655

Related

How to get high rps with JMeter load testing https endpoint

I'm trying to test my https endpoint with JMeter. I want to make at least 10000 requests per second, but when I set the number of threads to 10000 I get way less rps, around 500.
I've tried setting the number of threads to 1000 and 100, surprisingly I get this same number of rps. I'm using HTTP Sampler and "use Keep-Alive" is set to true. When I look in the statistics I see that when using 100 threads, it makes use of Keep-Alive and connect_time is around 100 ms, but when the number of threads is higher connect_time grows, it's like it stops reusing the connections.
I know this isn't a server issue, because I've tried testing that same endpoint with Yandex.Tank and phantom and it can easily maintain 10 000 requests per second, the problem is it can't use response data to make furhter requests, that's why I have to use JMeter for this task.
This can be done by using "Stepping thread group". It will allow you to send 10000 request per second upto specified time. Refer below image.
Stepping Thread Group
Download jar from below link.
https://jmeter-plugins.org/wiki/SteppingThreadGroup/
I hope you are trying to achieve this using one machine. Try with multiple machine or jmeter distributed mode.
https://jmeter.apache.org/usermanual/jmeter_distributed_testing_step_by_step.pdf
https://www.blazemeter.com/blog/how-to-perform-distributed-testing-in-jmeter/
https://blazemeter.com/blog/3-common-issues-when-running-jmeter-scripts-and-how-solve-them/
I am assuming that it is the issue with machine which is not able to generate that much load. Usually, i have use max 300 threads per machine but it depend on the machine config. Just check if the machine is having issue and multiple machine is able to generate more load, considering server is not having any issue.
Hope this helps.
Update:-Usually 200-500 can be handled my modern machines.
Please check the below link to have some more info:-
1.How do threads and number of iterations impact test and what is JMeter’s max. thread limit
2.https://www.blazemeter.com/blog/what%e2%80%99s-the-max-number-of-users-you-can-test-on-jmeter/ .

Creation of parallel threads for bulk request handling?

I have rest service and want to handle almost 100 requests in parallel for this service. I have mentioned number of threads and number of connections to create as 100 in my application.yml even i did not see 100 connections created to handle requests
Here is what i did in my application.yml
server.tomcat.max-threads=100
server.tomcat.max-connections=100
I am using yourkit to see the internals , but when i start its created only 10 connections to handle requests, when i sent multiple requests also the count of request handling threads not increased its remain as 10. see the attachment i took from yourkit.
You're setting max threads. Not minimum threads. Tomcat in this case has decided the minimum should be 10.

Pooling Not Reusing INACTIVE Sessions

This is a general question about flow -
Lately we started getting warnings by .Net of Connection timeout or Connection must be open for this operation.
We are working with Oracle DB and we set a job running every 5 seconds, counting how many connections are there (both ACTIVE and INACTIVE) by the w3wp (we are querying gv$session).
The max pool size for each WS (we have 2) is 300, meaning 600 connections in total.
We noticed that indeed we are reaching the 600 sessions before the crash, however, there are many INACTIVE sessions out of those 600 sessions.
I would except that those sessions would be reused, since they are INACTIVE at the moment.
In addition, the prev_sql_id being running by most of these INACTIVE sessions is: SELECT PARAMETER, VALUE FROM SYS.NLS_DATABASE_PARAMETERS WHERE PARAMETER IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET').
Is it a normal behavior?
Further more, after recycling, the connection count is of course small (around 30), but 1 minute later it jumps into 200. Again, the majority are INACTIVE sessions.
What is the best way to understand what are these sessions and troubleshoot it?
Thanks!

ElasticSearch gives error about queue size

RemoteTransportException[[Death][inet[/172.18.0.9:9300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1#12ae9af];
Does this mean I'm doing too many operations in one bulk at one time, or too many bulks in a row, or what? Is there a setting I should be increasing or something I should be doing differently?
One thread suggests "I think you need to increase your 'threadpool.bulk.queue_size' (and possibly 'threadpool.index.queue_size') setting due to recent defaults." However, I don't want to arbitrarily increase a setting without understanding the fault.
I lack the reputation to reply to the comment as a comment.
It's not exactly the number of bulk requests made, it is actually the total number of shards that will be updated on a given node by the bulk calls. This means the contents of the actual bulk operations inside the bulk request actually matter. For instance, if you have a single node, with a single index, running on an 8 core box, with 60 shards and you issue a bulk request that has indexing operations that affects all 60 shards, you will get this error message with a single bulk request.
If anyone wants to change this, you can see the splitting happening inside of org.elasticsearch.action.bulk.TransportBulkAction.executeBulk() near the comment "go over all the request and create a ShardId". The individual requests happen a few lines down around line 293 on version 1.2.1.
You want to up the number of bulk threads available in the thread pool. ES sets aside threads in several named pools for use on various tasks. These pools have a few settings; type, size, and queue size.
from the docs:
The queue_size allows to control the size of the queue of pending
requests that have no threads to execute them. By default, it is set
to -1 which means its unbounded. When a request comes in and the queue
is full, it will abort the request.
To me that means you have more bulk requests queued up waiting for a thread from the pool to execute one of them than your current queue size. The documentation seems to indicate the queue size is defaulted to both -1 (the text above says that) and 50 (the call out for bulk in the doc says that). You could take a look at the source to be sure for your version of es OR set the higher number and see if your bulk issues simply go away.
ES thread pool settings doco
elasticsearch 1.3.4
our system 8 core * 2
4 bulk worker each insert 300,000 message per 1 min => 20,000 per sec
i'm also that exception! then set config
elasticsearch.yml
threadpool.bulk.type: fixed
threadpool.bulk.size: 8 # availableProcessors
threadpool.bulk.queue_size: 500
source
BulkRequestBuilder bulkRequest = es.getClient().prepareBulk();
bulkRequest.setReplicationType (ReplicationType.ASYNC).setConsistencyLevel(WriteConsistencyLevel.ONE);
loop begin
bulkRequest.add(es.getClient().prepareIndex(esIndexName, esTypeName).setSource(document.getBytes ("UTF-8")));
loop end
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
4core => bulk.size 4
then no error
I was having this issue and my solution ended up being increasing ulimit -Sn and ulimit Hn for the elasticsearch user. I went from 1024 (default) to 99999 and things cleaned right up.

Occasional slow requests on Heroku

We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.
This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.
Some things to note:
the database was not the bottleneck, and it spent only 25ms doing db queries
we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.
So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.
I have been chasing a similar problem myself, with not much luck so far.
I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.
Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).
When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:
1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.
2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.
3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.
4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)
Do let us know what you come up with.

Resources