As following the limits described here
I'm interested whether batchUpdat including n requests items would treated as a single query or n ones?
And are there any other limits related this call exists? Maybe payload total size or something?
Related
I have a set up where I am using the gremlin-core library to query a remote Janusgraph server. The data size is moderate for now but will increase in the future.
A few days ago, I saw the "Max frame length of 65536 has been exceeded" error on my client. The value for the maxContentLength parameter in my server yaml file was set to default (65536). I dug up the code and realized that I am sending a large array of vertex ids as a query parameter to fetch vertices. I applied a batch to the array with a size of 100 vertex ids per batch and it resolved the issue.
After sometime I started seeing this error again in my client logs. This time around, there was no query with a large number of parameters being sent to the server. I saw a proposed solution on this topic which said that I need to set the maxContentLength parameter on the client-side as well. I did that and the issue got resolved. However, it raised a few questions regarding the configuration parameters, their values and their impact on the query request/response size.
Is the maxContentLength parameter related to the response size of a query? If yes, how do I figure out the value for this parameter with respect to my database size?
Are there any other parameters that dictate the maximum size of the query parameters in the request? If yes, which are they and how do they relate to the size of the query parameters?
Are there any parameters that dictate the size of a query response? If yes, which are they and how do they relate to the size of the query response?
The answers to these questions are crucial for me to make a robust server that will not break under the onslaught of data.
Thanks in advance
Anya
The maxContentLength is the number of bytes a single "message" can contain as a request or a response. It serves the same function as similar settings in web servers to allow filtering of obviously invalid requests. The setting has little to do with database size and more to do with the types of requests you are making and the nature of your results. For requests, I tend to think it atypical for a request to exceed 65k in most situations. Folks who exceed that size are typically trying to do batch loading or are using code generated scripts (the latter is typically problematic, but I won't go into details). For responses, 65k may not be enough depending on the nature of your queries. For example, the query:
g.V().valueMap(true)
will return all vertices in your database as an Iterator<Map> and Gremlin Server will stream those result back in batches controlled by the resultIterationBatchSize (default is 64). So if you have 128 vertices in your database Gremlin Server will stream back two batches of results behind the scenes. If those two batches are each below maxContentLength in size then no problems. If your batches are bigger than that (because you have say, 1000 properties on each vertex) then you either need to
limit the data you return - e.g. return fewer properties
increase maxContentLength
lower the resultIterationBatchSize
Also note that the previous query is very different from something like:
g.V().valueMap(true).fold()
because the fold() will realize all the vertices into a list in memory and then that list must be serialized as a whole. There is only 1 result (i.e. List<Map> with 128 vertices) and thus nothing to batch, so its much more likely that you would exceed the maxContentLength here and lowering the resultIterationBatchSize wouldn't even help. You're only recourse would be to increase maxContentLength or alter the query to allow batching to kick in to hopefully break up that large chunk of data to fit in the maxContentLength.
Setting your maxContentLength to 2mb or larger shouldn't be too big a deal. If you need to go higher for requests, then I'd be curious what the reason was for that. If you need to go much higher for responses, then perhaps I'd take a look at my queries and see if there's a better way to limit the data I'm returning or to see if there's a nicer way to get Gremlin Server streaming to work for me.
Reading the documentation on here I am still not clear of the following points:
Are there any limits on the size of the API request to the Google cloud Storage bucket? (We need to transfer PDFs from a CRM to Google Cloud Storage)
How many files we can send (it mentions a limit of 1000 writes per second) is that the same thing?
Files are the objects you store in Cloud Storage, so the maximum size limit 5 TB should be considered for individual files. There is no limit to write across multiple objects, however buckets support roughly 1000 writes per second and then scale up as needed.
For parallel uploads, please take note that up to 32 objects/files can be composed in a single composition request and a per-project composition rate limit of approximately 1,000 source objects per second.
I also recommend that you take a look at best practices on how to ramp up the request rate
You go with more than 1000 writes per second per bucket - the GCS infrastructure will scale automatically to accommodate it. The only requirement is not to ramp up the load too quickly, namely "double the request rate no faster than every 20 minutes".
https://cloud.google.com/storage/docs/request-rate#ramp-up
i am calculating distances between two places using the google api.
https://maps.googleapis.com/maps/api/distancematrix/json?&origins={origin}&destinations={destination}&key={api_key}
i have a api key, which has usage limit of 2500 requests per day.
i am calculating multiple distances in my .py program.
when the key usage limits exceeds, i get query over limit error.
I want to know how many hits are left in my api.
is there any way of doing it programatically?
I wanted to add this a comment but I'm not allowed to.
See here: https://stackoverflow.com/a/8714638/5914299
But why not check for the day and keep a counter. When reached 2500 stop requesting the API and when an new day arrives reset the counter?
I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.
I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes
There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size
In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.
Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.