Increase write speed in hive for ORC files - hadoop

Currently an insert overwrite table T1 select * from T2; will take around 100 minutes in my cluster. Table T1 is ORC formatted and T2 is text formatted. I am reading a 60 GB of text data from T2 and inserting into ORC table T1(10 GB after insertion). If i use text format for both tables insert will take around 50 min. In both cases what are the things we can do to improve write speed( I have large tables coming in) or any other suggestions??

I have recently derived an approach which splits the source file into partitions this takes around 6mins from text table to orc table in hive for 100GB data.
Approach below
Before inserting the file into text table
1.split the file into small partitions in unix location using split command
2.then remove the original file from the path and just keep the files splitted.
Inserting into text table
3.now load the data into text table
4.it will take some mins to load and u can see that there will be same number of partitions as you have done at unix level
Inserting into orc table
Ex: you have splitted the actual file into let say 20 partitions
then you would see 20 tasks/containers being run on the cluster to load into the orc table which is very much faster than the other
solutions which i came across
#despicable-me

That is probably a normal behaviour as when you write data from text to text - it just writes data line by line from one file into another. Text-to-ORC will do some more work besides of it. Comparing to the text-to-text operation, text-to-orc importing will perform additional bucket-partition operations and compression operations to you data. That is the resaon of your time impacts. ORC format gives two main benefits upon text format:
save of space due to compression
improve access time to work with the data
Usually the INSERT operation is a single time operation, while access operations will be very frequent. So it usually makes sence to spend some more time at the beginning on importing the data and then have a huge benefite in saving space due to optimized storage of the data and
in optimized access time to this data

Related

Hive queries on external S3 table very slow

We have our dataset in s3 (parquet files) in the below format, data divided as multiple parquet files based on the row number
data1_1000000.parquet
data1000001_2000000.parquet
data2000001_3000000.parquet
...
Created hive table on top of it using,
CREATE EXTERNAL TABLE parquet_hive (
foo string
) STORED AS PARQUET
LOCATION 's3://myBucket/myParquet/';
Totally there 22000 parquet files and the size of the folder is nearly 300GB. When i run the count query on this table in Hive, it is taking 6 hours to return the result which is nearly 7 billion records. How can we make it faster? Can i create partition or index on the table or this is the time it usually take when pulling data from s3. Can anyone advice, what is wrong here.
Thanks.

Hive. Check stripe size for existing ORC storage

I have two scripts which parse data from raw logs and write it into ORC tables in HIVE. One script creates more columns and another less. Both tables partitioned by date field.
As the result I have ORC tables with different sizes of files.
Table with larger number of columns consists of many small files (~4MB per file inside each partition) and tables with less columns consists of few large files (~250 MB per file inside each partition).
I suppose it happens because of stripe.size setting in ORC. But I don't know how to check size of stripe for existing table. Commands like "show create" and "describe" don't reveal any custom settings, it means that stripe size for tables should be equal to 256 MB.
I'm looking for any advice to check stripe.size for existing ORC table.
Or explanation how file size inside ORC tables depends on data in that tables.
P.s.It matters later when I'm reading from that tables with Map Reduce and there are small number of reducers for tables with big files.
Try the Hive ORC File Dump Utility: ORC File Dump Utility.

hive merge properties not working for small files

I am trying to insert data into dynamic partitioned table which is creating lots of small files, i have set hive properties as below but i still see small files in partitioned folder, the size per task nor the avgfile size seems to be working for me as the files in partitioned folder are above the size per task i gave.
Any help will be greatly appreciated
hive.merge.mapfiles=true;
hive merge mapredfiles = true
hive.merge.size.per.task=10000;
hive.merge.smallfiles.avgsize=100;
Your example shows you setting the average size to 100 bytes which would create a lot of small files and is most likely being ignored because the files are already larger than that. Try increasing this value to an average of 128MB(134217728) which should on average increase the size of the files being merged after your job is complete.
set hive.merge.smallfiles.avgsize = 134217728;
This can happen when you execute multiple inserts into a single Hive table. 1 single insert can result in one or more files under the HDFS location.
I have managed this situation by executing below command - this will compact the table and will merge all files in one (or bigger ones)
There's one restriction though, you can't have indexes in your hive tables to execute the merge command.
I have also tested from Spark SQL over ORC files - (1.5.2) and it works fine.
ALTER TABLE schema.table PARTITION (month = '01') CONCATENATE
Hope it helps
Working with Small files in hive is a common problem and it can also be resolved by using CombineHiveInputFormat for input format. Also use ORC files by deafault:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
This will help to run hive job faster for given small files in hive.

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.
We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.
1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs
2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.
3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.
4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.
we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.
Thanks,
Mahender
In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:
hive.exec.orc.memory.pool
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize
Also, you might see, whether compression might also help you. For example:
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;
Hope it helps!
First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.
Real power of hive lies in batch processing huge amount of data.

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

Resources