I have a set up where I am using the gremlin-core library to query a remote Janusgraph server. The data size is moderate for now but will increase in the future.
A few days ago, I saw the "Max frame length of 65536 has been exceeded" error on my client. The value for the maxContentLength parameter in my server yaml file was set to default (65536). I dug up the code and realized that I am sending a large array of vertex ids as a query parameter to fetch vertices. I applied a batch to the array with a size of 100 vertex ids per batch and it resolved the issue.
After sometime I started seeing this error again in my client logs. This time around, there was no query with a large number of parameters being sent to the server. I saw a proposed solution on this topic which said that I need to set the maxContentLength parameter on the client-side as well. I did that and the issue got resolved. However, it raised a few questions regarding the configuration parameters, their values and their impact on the query request/response size.
Is the maxContentLength parameter related to the response size of a query? If yes, how do I figure out the value for this parameter with respect to my database size?
Are there any other parameters that dictate the maximum size of the query parameters in the request? If yes, which are they and how do they relate to the size of the query parameters?
Are there any parameters that dictate the size of a query response? If yes, which are they and how do they relate to the size of the query response?
The answers to these questions are crucial for me to make a robust server that will not break under the onslaught of data.
Thanks in advance
Anya
The maxContentLength is the number of bytes a single "message" can contain as a request or a response. It serves the same function as similar settings in web servers to allow filtering of obviously invalid requests. The setting has little to do with database size and more to do with the types of requests you are making and the nature of your results. For requests, I tend to think it atypical for a request to exceed 65k in most situations. Folks who exceed that size are typically trying to do batch loading or are using code generated scripts (the latter is typically problematic, but I won't go into details). For responses, 65k may not be enough depending on the nature of your queries. For example, the query:
g.V().valueMap(true)
will return all vertices in your database as an Iterator<Map> and Gremlin Server will stream those result back in batches controlled by the resultIterationBatchSize (default is 64). So if you have 128 vertices in your database Gremlin Server will stream back two batches of results behind the scenes. If those two batches are each below maxContentLength in size then no problems. If your batches are bigger than that (because you have say, 1000 properties on each vertex) then you either need to
limit the data you return - e.g. return fewer properties
increase maxContentLength
lower the resultIterationBatchSize
Also note that the previous query is very different from something like:
g.V().valueMap(true).fold()
because the fold() will realize all the vertices into a list in memory and then that list must be serialized as a whole. There is only 1 result (i.e. List<Map> with 128 vertices) and thus nothing to batch, so its much more likely that you would exceed the maxContentLength here and lowering the resultIterationBatchSize wouldn't even help. You're only recourse would be to increase maxContentLength or alter the query to allow batching to kick in to hopefully break up that large chunk of data to fit in the maxContentLength.
Setting your maxContentLength to 2mb or larger shouldn't be too big a deal. If you need to go higher for requests, then I'd be curious what the reason was for that. If you need to go much higher for responses, then perhaps I'd take a look at my queries and see if there's a better way to limit the data I'm returning or to see if there's a nicer way to get Gremlin Server streaming to work for me.
Related
If I were to bypass the limit of 10 results in ElasticSearch by adding a size parameter to my query as described here, could that cause performance problems to my ES cluster?
It will depend on various parameters
Number of request ES is getting every second/milli-second.
Size of individual document.
Out of total number of request, how many are unique. If we are hitting same
query multiple time, then results are returned from cache.
Size of query.
With the increase in number of documents, response size and time will also increase.
This will hamper the performance of application where these results are getting are displayed / delivered. So e.g. UI will go slow to parse all the result and display.
Going for pagination will be future safe as well.
I'm trying to develop a Scala microservice for data management for an Oracle database. I'm using JDBC drivers to connect to it.
Reading the answers to the performance questions regarding JDBC driver compared to the .NET one, I've understood that one of the more effective vehicle to tune the JDBC reading performance is to set the Fetch Size through the method ResultSet.setFetchSize.
I've tried connecting to an Oracle database to fetch real data for a real business case, with a fixed number of record returned by the DB, and I've measured an exponential behavior of the elapsed time. In particular, fetching 10,000 rows from the database without setting the fetch size resulting in a ridicolously large amount of fetch time, but specifying a fetch size larger than 1,000 resulting in a little amount of time gained (roughly 100 ms over 1 s).
Here's my questions regarding this topic:
I suppose that incrementing too much the fetch size would consume resources inopportunely for a little gain, so is there an even rough method to estimate the size of the ResultSet before actually fetching it? I've read about the following technique:
result.last();
result.getRow();
but this would mean scroll the entire ResultSet, and I was wondering if there's any even rough accurate technique to evaluate the count;
I've estimated that a good fetch size would be 1/10th of the number of record selected, but is there a documented rule to try to automatically estimate the correct fetch size for the largest number of cases?
Please do not set fetch size too large, unless you have network bottleneck between application and database. The larger the fetch size, the more memory consumed.
In my experience, 1024 - 2048 will lead to best performance most of the time. See
https://docs.oracle.com/javase/tutorial/jdbc/basics/retrieving.html discussing some details, but the default setting is usually best.
Do not try to get the total numbers of rows in the result set, it is not the best practice.
And finally, I want to point out that based on the hundreds of thousands of time optimize about JVM and jit, the bottleneck seems never happens on fetch size of JDBC after you set it with 1000-2000, but on the SQL performance, applications or resource limit and etc.
In JDBC the default fetch size is 10, but I guess that's not the best fetch size when I have a million rows. I understand that a fetch size too low reduces performance, but also if the fetch size is too high.
How can I find the optimal size? And does this have an impact on the DB side, does it chew up a lot of memory?
If your rows are large then keep in mind that all the rows you fetch at once will have to be stored in the Java heap in the driver's internal buffers. In 12c, Oracle has VARCHAR(32k) columns, if you have 50 of those and they're full, that's 1,600,000 characters per row. Each character is 2 bytes in Java. So each row can take up to 3.2MB. If you're fetching rows 100 by 100 then you'll need 320MB of heap to store the data and that's just for one Statement. So you should only increase the row prefetch size for queries that fetch reasonably small rows (small in data size).
As with (almost) anything, the way to find the optimal size for a particular parameter is to benchmark the workload you're trying to optimize with different values of the parameter. In this case, you'd need to run your code with different fetch size settings, evaluate the results, and pick the optimal setting.
In the vast majority of cases, people pick a fetch size of 100 or 1000 and that turns out to be a reasonably optimal setting. The performance difference among values at that point are generally pretty minimal-- you would expect that most of the performance difference between runs was the result of normal random variation rather than being caused by changes in the fetch size. If you're trying to get the last iota of performance for a particular workload in a particular configuration, you can certainly do that analysis. For most folks, though, 100 or 1000 is good enough.
The default value of JDBC fetch size property is driver specific and for Oracle driver it is 10 indeed.
For some queries fetch size should be larger, for some smaller.
I think a good idea is to set some global fetch size for whole project and overwrite it for some individual queries where it should be bigger.
Look at this article:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
there is description on how to set up fetch size globally and overwrite it for carefully selected queries using different approaches: Hibernate, JPA, Spring jdbc templates or core jdbc API. And some simple benchmark for oracle database.
As a rule of thumb you can:
set fetchsize to 50 - 100 as global setting
set fetchsize to 100 - 500 (or even more) for individual queries
JDBC does have default prefetch size of 10. Check out
OracleConnection.getDefaultRowPrefetch in JDBC Javadoc
tl;dr
How to figure out the optimal fetch size for the select query
Evaluate some maximal amount of memory (bytesInMemory)
4Mb, 8Mb or 16Mb are good starts.
Evaluate the maximal size of each column in the query and sum up
those sizes (bytesPerRow)
...
Use this formula: fetch_size = bytesInMemory / bytesPerRow
You may adjust the formula result to have predictable values.
Last words, test with different bytesInMemory values and/or different queries to appreciate the results in your application.
The above response was inspired by the (as of this writing attic) Apache MetaModel project. They found an answer for this exact question. To do so, they built a class for calculating a fetch size given a maximal memory amount. This class is based on an Oracle whitepaper explaining how Oracle JDBC drivers manage memory.
Basically, the class is constructed with a maximal memory amount (bytesInMemory). Later, it is asked a fetch size for a Query (an Apache Metamodel class). The Query class helps find the number of bytes (bytesPerRow) a typical query results row would have. The fetch size is then calculated with the below formula:
fetch_size = bytesInMemory / bytesPerRow
The fetch size is also adjusted to stay in this range : [1,25000]. Other adjustments are made along during the calculation of bytesPerRow but that's too much details for here.
This class is named FetchSizeCalculator. The link leads to the full source code.
I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.
I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes
There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size
In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.
Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.