Hbase response size - memory-management

I have a bunch of rows on HBase which store varying sizes of data (0.5 MB to 120 MB). When the scanner cache is set to say 100, the response sometimes gets too large and the region server dies. I tried but couldn't find a solution. Can someone help me finding
What is the maximum response size that HBase supports?
Is there a way to limit the response size at the server so that the result will be limited to a particular value (answer to the first question) so that the result will be returned as soon as the limit is reached?
What happens if a single record exceeds this limit? There should be a way to increase it but I don't know how.

1. What is the maximum response size that HBase supports?
It is Long.MAX_VALUE and represented by the constant DEFAULT_HBASE_CLIENT_SCANNER_MAX_RESULT_SIZE
public static long DEFAULT_HBASE_CLIENT_SCANNER_MAX_RESULT_SIZE = Long.MAX_VALUE;
2. Is there a way to limit the response size at the server so that the result will be limited to a particular value (answer to the first question) so that the result will be returned as soon as the limit is reached?
You could make use of the property hbase.client.scanner.max.result.size to handle this. It allows us to set a maximum size rather than count of rows on what a scanner gets in one go. It is actually the maximum number of bytes returned when calling a scanner's next method.
3. What happens if a single record exceeds this limit? There should be a way to increase it but I don't know how.
Complete record(row) will be returned even if it exceeds the limit.

Related

java.lang.OutOfMemoryError: Direct buffer memory during Druid ingestion task

I'm ingesting data into Druid using Kafka ingestion task.
The test data is 1 message/second. Each message has 100 numeric columns and 100 string columns. Number values are random. String values are taken from a pool of 10k random 20 char strings. I have sum, min and max aggregations for each numeric column.
Config is the following:
Segment granularity: 15 mins.
Intermediate persist period: 2 mins.
druid.processing.buffer.sizeBytes=26214400
druid.processing.numMergeBuffers=2
druid.processing.numThreads=1
The Druid docs say that sane max direct memory size is
(druid.processing.numThreads + druid.processing.numMergeBuffers + 1) *
druid.processing.buffer.sizeBytes
where "The + 1 factor is a fuzzy estimate meant to account for the segment decompression buffers and dictionary merging buffers."
According to the formula I need 100 MB of direct memory but I get java.lang.OutOfMemoryError: Direct buffer memory even when I set max direct memory to 250 MB. This error is not consistent: sometimes I have this error, sometimes I don't.
My target is to calculate max direct memory before I start the task and to not get the error during task execution. My guess is that I need to calculate this "+1 factor" precisely. How can I do this?
In my experience, that formula has been pretty good, with the exception of being careful that a MB is not 1000 KB, but 1024. But I am quite surprised it still gave you the error with 250MB. How are you setting the direct memory size? And are you using a MiddleManager with Peons? Because the peons do the actual work, and you have to set the max direct memory on the peons, not the middle manager. You do it with the following parameter in the Middle Manager runtime.properties. This is what I have on mine:
druid.indexer.runner.javaOptsArray=["-server", "-Xms200m", "-Xmx200m", "-XX:MaxDirectMemorySize=220m", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager", "-XX:+ExitOnOutOfMemoryError", "-XX:+HeapDumpOnOutOfMemoryError", "-XX:HeapDumpPath=/var/log/druid/task/"]
You also have to set the other properties this way too: druid.indexer.fork.property.druid.processing.buffer.sizeBytes druid.indexer.fork.property.druid.processing.numMergeBuffers
druid.indexer.fork.property.druid.processing.numThreads

Max frame length of 65536 has been exceeded

I have a set up where I am using the gremlin-core library to query a remote Janusgraph server. The data size is moderate for now but will increase in the future.
A few days ago, I saw the "Max frame length of 65536 has been exceeded" error on my client. The value for the maxContentLength parameter in my server yaml file was set to default (65536). I dug up the code and realized that I am sending a large array of vertex ids as a query parameter to fetch vertices. I applied a batch to the array with a size of 100 vertex ids per batch and it resolved the issue.
After sometime I started seeing this error again in my client logs. This time around, there was no query with a large number of parameters being sent to the server. I saw a proposed solution on this topic which said that I need to set the maxContentLength parameter on the client-side as well. I did that and the issue got resolved. However, it raised a few questions regarding the configuration parameters, their values and their impact on the query request/response size.
Is the maxContentLength parameter related to the response size of a query? If yes, how do I figure out the value for this parameter with respect to my database size?
Are there any other parameters that dictate the maximum size of the query parameters in the request? If yes, which are they and how do they relate to the size of the query parameters?
Are there any parameters that dictate the size of a query response? If yes, which are they and how do they relate to the size of the query response?
The answers to these questions are crucial for me to make a robust server that will not break under the onslaught of data.
Thanks in advance
Anya
The maxContentLength is the number of bytes a single "message" can contain as a request or a response. It serves the same function as similar settings in web servers to allow filtering of obviously invalid requests. The setting has little to do with database size and more to do with the types of requests you are making and the nature of your results. For requests, I tend to think it atypical for a request to exceed 65k in most situations. Folks who exceed that size are typically trying to do batch loading or are using code generated scripts (the latter is typically problematic, but I won't go into details). For responses, 65k may not be enough depending on the nature of your queries. For example, the query:
g.V().valueMap(true)
will return all vertices in your database as an Iterator<Map> and Gremlin Server will stream those result back in batches controlled by the resultIterationBatchSize (default is 64). So if you have 128 vertices in your database Gremlin Server will stream back two batches of results behind the scenes. If those two batches are each below maxContentLength in size then no problems. If your batches are bigger than that (because you have say, 1000 properties on each vertex) then you either need to
limit the data you return - e.g. return fewer properties
increase maxContentLength
lower the resultIterationBatchSize
Also note that the previous query is very different from something like:
g.V().valueMap(true).fold()
because the fold() will realize all the vertices into a list in memory and then that list must be serialized as a whole. There is only 1 result (i.e. List<Map> with 128 vertices) and thus nothing to batch, so its much more likely that you would exceed the maxContentLength here and lowering the resultIterationBatchSize wouldn't even help. You're only recourse would be to increase maxContentLength or alter the query to allow batching to kick in to hopefully break up that large chunk of data to fit in the maxContentLength.
Setting your maxContentLength to 2mb or larger shouldn't be too big a deal. If you need to go higher for requests, then I'd be curious what the reason was for that. If you need to go much higher for responses, then perhaps I'd take a look at my queries and see if there's a better way to limit the data I'm returning or to see if there's a nicer way to get Gremlin Server streaming to work for me.

Max scrollable time for elasticsearch

What is the max scrollable time that can be set for scrolling search ?
Documentation:
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#api-scroll
If you're asking this kind of question you're probably not using Scroll in ES how it was intended. You want to use scroll when you know for sure that you need to return ALL matching records.
Great use case for Scroll
I want to pull back 1,000,000 records from ES to be written to a CSV file. This is a perfect use case for scroll. You need to return 1M rows, but you don't want to return them all as 1 chunk from the database. Instead you can chunk them into ~1000 record chunks, write the chunk to the CSV file, then get the next chunk. Your scroll keep alive can be set to 1 minute and you'll have no problems.
Bad use case for Scroll
A user is viewing the first 50 records and at some time in the future, they may or may not want to view the next 50 records.
For a use case like this you want to use the Search After API
There is no one-value-fits-all value of max scroll time.
Scan & Scroll is meant to scan through a large number of records in chunks. The max value for each chunk has to be obtained by incremental increases till you hit the breaking as it depends on your cluster resources,network latency and cluster load.
We had a 3 node test setup with about 1 billion records and 1 TB of data. I was able to scroll through the entire index with scroll size 5000 and timeout 5m. However, there were lots of timeouts with those values. From our analysis,we observered that scroll timeouts were heavily dependent on cluster load and network latency. So we finally settled on 3500 size and 4m timeout.
So i would recomend the following-
Incrementally increase the size and timeout values to get the max value for your network.
Once you have the max value, reduce it a notch to accommodate for failures due to cluster load & latency

How to figure out the optimal fetch size for the select query

In JDBC the default fetch size is 10, but I guess that's not the best fetch size when I have a million rows. I understand that a fetch size too low reduces performance, but also if the fetch size is too high.
How can I find the optimal size? And does this have an impact on the DB side, does it chew up a lot of memory?
If your rows are large then keep in mind that all the rows you fetch at once will have to be stored in the Java heap in the driver's internal buffers. In 12c, Oracle has VARCHAR(32k) columns, if you have 50 of those and they're full, that's 1,600,000 characters per row. Each character is 2 bytes in Java. So each row can take up to 3.2MB. If you're fetching rows 100 by 100 then you'll need 320MB of heap to store the data and that's just for one Statement. So you should only increase the row prefetch size for queries that fetch reasonably small rows (small in data size).
As with (almost) anything, the way to find the optimal size for a particular parameter is to benchmark the workload you're trying to optimize with different values of the parameter. In this case, you'd need to run your code with different fetch size settings, evaluate the results, and pick the optimal setting.
In the vast majority of cases, people pick a fetch size of 100 or 1000 and that turns out to be a reasonably optimal setting. The performance difference among values at that point are generally pretty minimal-- you would expect that most of the performance difference between runs was the result of normal random variation rather than being caused by changes in the fetch size. If you're trying to get the last iota of performance for a particular workload in a particular configuration, you can certainly do that analysis. For most folks, though, 100 or 1000 is good enough.
The default value of JDBC fetch size property is driver specific and for Oracle driver it is 10 indeed.
For some queries fetch size should be larger, for some smaller.
I think a good idea is to set some global fetch size for whole project and overwrite it for some individual queries where it should be bigger.
Look at this article:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
there is description on how to set up fetch size globally and overwrite it for carefully selected queries using different approaches: Hibernate, JPA, Spring jdbc templates or core jdbc API. And some simple benchmark for oracle database.
As a rule of thumb you can:
set fetchsize to 50 - 100 as global setting
set fetchsize to 100 - 500 (or even more) for individual queries
JDBC does have default prefetch size of 10. Check out
OracleConnection.getDefaultRowPrefetch in JDBC Javadoc
tl;dr
How to figure out the optimal fetch size for the select query
Evaluate some maximal amount of memory (bytesInMemory)
4Mb, 8Mb or 16Mb are good starts.
Evaluate the maximal size of each column in the query and sum up
those sizes (bytesPerRow)
...
Use this formula: fetch_size = bytesInMemory / bytesPerRow
You may adjust the formula result to have predictable values.
Last words, test with different bytesInMemory values and/or different queries to appreciate the results in your application.
The above response was inspired by the (as of this writing attic) Apache MetaModel project. They found an answer for this exact question. To do so, they built a class for calculating a fetch size given a maximal memory amount. This class is based on an Oracle whitepaper explaining how Oracle JDBC drivers manage memory.
Basically, the class is constructed with a maximal memory amount (bytesInMemory). Later, it is asked a fetch size for a Query (an Apache Metamodel class). The Query class helps find the number of bytes (bytesPerRow) a typical query results row would have. The fetch size is then calculated with the below formula:
fetch_size = bytesInMemory / bytesPerRow
The fetch size is also adjusted to stay in this range : [1,25000]. Other adjustments are made along during the calculation of bytesPerRow but that's too much details for here.
This class is named FetchSizeCalculator. The link leads to the full source code.

Parse, replacing large (several thousands) number of records

I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.

Resources