Hive index rebuild too slow in compare with PostgreSQL - hadoop

I am trying to compare same functionality on my PostgreSQL data warehouse and newly created Hive data warehouse on same box with same data and same table structure . I am trying to understand Hive benefits, but... Despite the fact that data load into PostgreSQL running 3 times slower - the index creation/rebuild on PostgreSQL is 20 times faster, the index doesn't need to be rebuild every time like in Hive.
My question is: what I am missing in Hive configuration?
My setup is:
CREATE TABLE mytable
(
aa int,
bb string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/spaces/hadoop/hadoopfs';
LOAD DATA LOCAL INPATH '/data/Informix94/spaces/postgres/myfile_big' OVERWRITE INTO TABLE mytable;
CREATE INDEX mytable_indx ON TABLE mytable(aa) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD LOCATION '/data/spaces/hadoop/hadoopfs';
set hive.optimize.autoindex=true;
set hive.optimize.index.filter=true;
alter index mytable_indx ON mytable rebuild;
My Box is VM with 3 G ram with PostgreSQL running on it and taking ~ 1 G ram. He is serving as metadata store. I am using most recent stable versions of CentOS, Hadoop, Hive and didn't changed Hive default setting except matadata store location and statistics disabling.
The result:
index rebuild takes 4798 seconds on 260.000.000 rows or 80 seconds on 5.000.000 rows.

Hive only works well when your data doesn't fit on a single machine anymore. So the results you are seeing are expected results. So once you've collected Terabytes or Petabytes of data you'll be much happier with hive. In the use-case you describe PostgreSQL would be a much better match.

Related

HIVE query takes very long time to fetch 20 gb records

Hi I have a hive table on HBASE that has 200gb of records .
I am running simple hive query to fetch 20 gb records .
But this takes around 4 hours of time .
I can not create partition on HIVE table cause it is integrated on HBASE.
Please suggest any idea to improve performance
This is my HIVE query
INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FundamentalAnalytic/FundamentalAnalytic_2014.txt'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
select * from hbase_table_FundamentalAnalytic where FilePartition='ThirdPartyPrivate' and FilePartitionDate='2014';
If you can, then I think Apache Phoenix will speed things up.
https://phoenix.apache.org/faq.html
Very simple and intuitive to use and super fast.

How Hive reads data even after dropping from hdfs?

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization.
We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is from my WEB Application, I would like to show result from Hive Query Select * from Hive table on UI, but for any query, in the hive command prompt takes minimum 20-60 secs even though hive table has 60 GB data ,.
1) Can any one tell me how to show real time reporting by querying Hive table and show results immediately on UI within 10-30 secs
2) Another problem we have identified is, Initially we have Un-Partitioned table pointing to a Blob/File in HDFS,it is of size 60 GB with 200 columns, when we dump the data from Un-Partitioned table to ORC table(ORC table is partitioned), it takes 3 + hrs, Is there a way to improve performance in dumping data into ORC table.
3) When we do querying on Non Partition table with bucketing, inserting to hive table and querying taking less time than select query on ORC table, but has the number of records in hive table increase ORC table's SELECT query is better than table with buckets. Is there a way to improve performance for small data sets also. Since it is initial phase, every month we load 50 GB data into Hive table. but it can increase, we looking improve performance of loading data into Orc partitioned table.
4) TEZ supports interactive, less latency and drill down support for reports. How to enable my drill down reports to get data from Hive ( which should be interactive) within in Human response time i.e 5-40 sec.
we are testing with 4 Nodes each Node is having 4 cpu cores and 7 GB RAM and 3 disk attached to each VM.
Thanks,
Mahender
In order to improve the speed of inserting data to ORC table, you can try playing around with following parameters:
hive.exec.orc.memory.pool
hive.exec.orc.default.stripe.size
hive.exec.orc.default.block.size
hive.exec.orc.default.buffer.size
dfs.blocksize
Also, you might see, whether compression might also help you. For example:
SET mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.intermediate = true;
Hope it helps!
First of all. HIVE is not meant for real time data processing. No matter how small the data may be the query will take a while to return data.
Real power of hive lies in batch processing huge amount of data.

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

Not able to apply dynamic partitioning for a huge data set in Hive

I have a table test_details with some 4 million records. Using the data in this table, I have to create a new partitioned table test_details_par with records partitioned on visit_date. Creating the table is not a challenge, but when I come to the part where I have to INSERT the data using Dynamic Partitions, Hive gives up when I try to insert data for more number of days. If I do it for 2 or 3 days the Map Reduce jobs runs successfully but for more days it fails giving a JAVA Heap Space Error or GC Error.
A Simplified Snapshot of my DDLs is as follows:
CREATE TABLE test_details_par( visit_id INT, visit_date DATE, store_id SMALLINT);
INSERT INTO TABLE test_details_par PARTITION(visit_date) SELECT visit_id, store_id, visit_date FROM test_details DISTRIBUTE BY visit_date;
I have tried setting these parameters, so that Hive executes my job in a better way:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode = 10000;
Is there anything that I am missing to run the INSERT for a complete batch without specifying the dates specifically?
Neels,
Hive 12 and below have well-known scalability issues with dynamic partitioning that will be addressed with Hive 13. The problem is that Hive attempts to hold a file handle open for each and every partition it writes out, which causes out of memory and crashes. Hive 13 will sort by partition key so that it only needs to hold one file open at a time.
You have 3 options as I see
Change your job to insert only a few partitions at a time.
Wait for Hive 13 to be released and try that (2-3 months to wait).
If you know how, build Hive from trunk and use it to complete your data load.

Resources