maxCombinedSplitSize property in hive? - hadoop

There is a property in pig named
'pig.maxCombinedSplitSize' – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined until this size is reached.
Is there a similar property in hive for specifying the size of data to be processed by a single map?
I am trying the below command but it doesn't work.
'SET hive.maxCombinedSplitSize=64mb';
Any suggestions?

Try this:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size=67108864;

Related

hive compaction using insert overwrite partition

Trying to address the small files problem by compacting the files under hive partitions by Insert overwrite partition command in hadoop.
Query :
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purpose I have three files under the hive partition (2016/03/11) in HDFS with the size of 40 MB each.
2016/03/11/file1.csv
2016/03/11/file2.csv
2016/03/11/file3.csv
Example my block size is 128 , So I would like to create only one output files. But I am getting 3 different compressed files.
Please help me to get the hive configuration to restrict the output file size. If I am not using the compression I am getting the single file.
Hive Version : 1.1
It's interesting that you are still getting 3 files when specifying the partition when using compression so you may want to look into dynamic partitioning or ditch the partitioning and focus on the number of mappers and reducers being created by your job. If your files are small I could see how you would want them all in one file on your target, but then I would also question the need for compression on them.
The number of files created in your target is directly tied to the number of reducers or mappers. If the SQL you write needs to reduce then the number of files created will be the same as the number of reducers used in the job. This can be controlled by setting the number of reducers used in the job.
set mapred.reduce.tasks = 1;
In your example SQL there most likely wouldn't be any reducers used, so the number of files in the target is equal to the number of mappers used which is equal to the number of files in the source. It isn't as easy to control the number of output files on a map only job but there are a number of configuration settings that can be tried.
Setting to combine small input files so fewer mappers are spawned, the default is false.
set hive.hadoop.supports.splittable.combineinputformat = true;
Try setting a threshold in bytes for the input files, anything under this threshold would try to be converted to a map join which can affect the number of output files.
set hive.mapjoin.smalltable.filesize = 25000000;
As for the compression I would play with changing the type of compression being used just to see if that makes any difference in your output.
set hive.exec.orc.default.compress = gzip, snappy, etc...

Is there maximum size of string data type in Hive?

Google a ton but haven't found it anywhere. Or does that mean Hive can support arbitrary large string data type as long as cluster is allowed? If so, where I can find what is the largest size of string data type that my cluster can support?
Thanks in advance!
The current documentation for Hive lists STRING as a valid datatype, distinct from VARCHAR and CHAR See official apache doc here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Strings
It wasn't immediately apparent to me that STRING was indeed it's own type, but if you scroll down you'll see several cases where it's used distinctly from the others.
While perhaps not authoritative, this page indicates the max length of a STRING is 2GB. http://www.folkstalk.com/2011/11/data-types-in-hive.html
By default, the columns metadata for Hive does not specify a maximum data length for STRING columns.
The driver has the parameter DefaultStringColumnLength, default is 255 maximum value.
A connection string with this parameter set to maximum size would look like this: jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
(https://github.com/exasol/virtual-schemas/issues/118)
"In the “looser” world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant." from
https://learning.oreilly.com/library/view/programming-hive/9781449326944/ch03.html#Collection-Data-Types

file larger than field limit with Hcatalog

I'm working in standalone (our cluster is not configured yet). I try to create a new table from a file with HCatalog, but I have the following error.
field larger than field limit (131072)
This value seems to be the value of the io.file.buffer.size, which is configured to 131072. Am I right? But, the description for this option is Size of read/write buffer used in SequenceFiles, so I'm not sure at all. My file is a text file. So I'm not sure, this is the good property to change.
Any idea?
I guess it's either because,
Your field delimiter set in Hive create statement is not set to a right one, so field read in buffer exceeded maximum allowed length.
Your field delimiter is set right, but some field is really long, or missing the right delimiter. If that's the case, you need somehow pre-process the file to make sure it won't have such cases.
A similar question and answer here

Control the split size with Avro Input Format in Hadoop

I must read Avro record serialized in avro files in HDFS. To do that, I use the AvroKeyInputFormat, so my mapper is able to work with the read records as keys.
My question is, how can I control the split size? With the text input format it consists on define the size in bytes. Here I need to define how many records every split will consist of.
I would like to manage every file in my input directory like a one big file. Have I to use CombineFileInputFormat? Is it possible to use it with Avro?
Splits honor logical record boundaries and the min and max boundaries are in bytes - text input format won't break lines in a text file even though the split boundaries are defined in bytes.
To have each file in a split, you can either set the max split size to Long.MAX_VALUE or you can override the isSplitable method in your code and return false.

Column Qualifier value max size

i am thinking about using our HBase as a content management store. I have somewhat large xml documents (5 MB+) which i would like to store. Is there a limit to the number of bytes i can store in a single column qualifier?
The default value is 10 MB. But you can change it through hbase.client.keyvalue.maxsize property in hbase-site.xml. If you are expecting your data to be very large then you can keep the data in HDFS and store pointers to the data in HBase.

Resources