Column Qualifier value max size - hadoop

i am thinking about using our HBase as a content management store. I have somewhat large xml documents (5 MB+) which i would like to store. Is there a limit to the number of bytes i can store in a single column qualifier?

The default value is 10 MB. But you can change it through hbase.client.keyvalue.maxsize property in hbase-site.xml. If you are expecting your data to be very large then you can keep the data in HDFS and store pointers to the data in HBase.

Related

How to limit on # elements in array column and/or total size in Athena / Presto?

I've been looking at the Athena and PrestoDB documentation and can't find any reference to the limit on number of elements in an array column and/or maximum total size. Files will be in Parquet format, but that's negotiable if Parquet is the limiting factor.
Is this known?
MORE CONTEXT:
I'm going to be pushing data into a Fire Hose that will emit Parquet files to S3 that I plan to query using Athena. The data is a one-to-many mapping of S3 URIs to a set of IDs, e.g.
s3://bucket/key_one, 123
s3://bucket/key_one, 456
....
s3://bucket/key_two, 321
s3://bucket/key_two, 654
...
Alternately, I could store in this form:
s3://bucket/key_one, [123, 456, ...]
s3://bucket/key_two, [321, 654, ...]
Since Parquet is compressed I'm not concerned with the size of the files on S3. The repeated URIs should get taken care of by the compression.
What's of more concern is the number of calls I need to make to Firehose in order to insert records. In the first case there's record per (object, ID) tuple, of which there are approximately 6000 per object. There's a "batch" call but it's limited to 500 records per batch, so I would end up making multiple calls. This code will execute in a Lambda function I'm trying to save execution time however possible.
There should not be any explicit limit from Presto/Athena side for number of elements in an array column type. Eventually it drills down to JVM limitation which will be huge. Just make sure that you have enough node memory available to handle these fields. Would be great if you can review your use case and avoid storing very huge column values (of array type)

How to coalesce large portioned data into single directory in spark/Hive

I have a requirement,
Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Is there maximum size of string data type in Hive?

Google a ton but haven't found it anywhere. Or does that mean Hive can support arbitrary large string data type as long as cluster is allowed? If so, where I can find what is the largest size of string data type that my cluster can support?
Thanks in advance!
The current documentation for Hive lists STRING as a valid datatype, distinct from VARCHAR and CHAR See official apache doc here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Strings
It wasn't immediately apparent to me that STRING was indeed it's own type, but if you scroll down you'll see several cases where it's used distinctly from the others.
While perhaps not authoritative, this page indicates the max length of a STRING is 2GB. http://www.folkstalk.com/2011/11/data-types-in-hive.html
By default, the columns metadata for Hive does not specify a maximum data length for STRING columns.
The driver has the parameter DefaultStringColumnLength, default is 255 maximum value.
A connection string with this parameter set to maximum size would look like this: jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
(https://github.com/exasol/virtual-schemas/issues/118)
"In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant." from
https://learning.oreilly.com/library/view/programming-hive/9781449326944/ch03.html#Collection-Data-Types

How does HBase enable Random Access to HDFS?

Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?
From the Apache HBase Reference Guide:
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.
Scanning both chapters didn't reveal a high-level answer for this question.
So how does HBase enable random access to files stored in HDFS?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.
For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.
hbase acess hdfs file by using hfile . you can check the url to get the detail: http://hbase.apache.org/book/hfilev2.html

HBase bulk load usage

I am trying to import some HDFS data to an already existing HBase table.
The table I have was created with 2 column families, and with all the default settings that HBase comes with when creating a new table.
The table is already filled up with a large volume of data, and it has 98 online regions.
The type of row keys it has, are under the form of(simplified version) :
2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH.
Example of key: IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef.
The data I want to import is on HDFS, and I am using a Map-Reduce process to read it. I emit Put objects from my mapper, which correspond to each line read from the HDFS files.
The existing data has keys which will all start with "XX181113".
The job is configured with :
HFileOutputFormat.configureIncrementalLoad(job, hTable)
Once I start the process, I see it configured with 98 reducers (equal to the online regions the table has), but the issue is that 4 reducers got 100% of the data split among them, while the rest did nothing.
As a result, I see only 4 folder outputs, which have a very large size.
Are these files corresponding to 4 new regions which I can then import to the table? And if so, why only 4, while 98 reducers get created?
Reading HBase docs
In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.
confused me even more as to why I get this behaviour.
Thanks!
The number of maps you'd get doesn't depend on the number of regions you have in the table but rather how the data is split into regions (each region contains a range of keys). since you mention that all your new data start with the same prefix it is likely it only fit into a few regions.
You can pre split your table so that the new data would be divided between more regions

Resources