I had one query where we have Hive table created and when we select * from table where=< condition>; ,it gives results immediately without invoking MR job.When I create a same duplicate table and try to execute a query then MR is invoked. What could be the possible reason for this?
I got the answer,The reason was Hive analyze command was issued on the table .Once you execute a hive analyze command it stores number of row,file size in hive metastore.So ,when u do select count(*) from table.It directly fetches it from the hive metastore instead of invoking a map reduce job.
You can also issue a Analyze command on column as well.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
Documentation link :
https://cwiki.apache.org/confluence/display/Hive/StatsDev
Local mode (hive not invoking MR) depends on several conditions (see HIVE-1408):
hive.exec.mode.local.auto=true/false - Lets Hive determine whether to run in local mode automatically.
hive.exec.mode.local.auto.input.size.max=1G - When hive.exec.mode.local.auto is true, input bytes should be less than this for local mode.
hive.exec.mode.local.auto.input.files.max=4 - When hive.exec.mode.local.auto is true, the number of tasks should be less than this for local mode.
If the tables have the same data, my guess is that there is a difference in the number of tasks that are spawned when querying the two tables causing one query to run in local mode and another to spawn a MR job.
Related
Problem statement :- I have a original external table with table count(1000) by copying its underlying data to some temp location and when created backup table pointing to that temp location. And after running the msck repair the both table counts are not matching?
Is there any reason for it. Could you please help me in understanding the reason behind it .
Answering and clarifying few things here,
Stats can be fetched either directly from Metastore or by reading through the underlying data. It can be controlled by the property hive.compute.query.using.stats
a. When it is set to TRUE, Hive will answer a few queries like min, max, and count(1) purely using statistics stored in the metastore.
b. When it is set to FALSE, Hive will spawn a YARN job to read through the data and provide the count results. It is usually time consuming based on the amount of data since this is not a direct fetch from the statistics stored in Hive Metastore.
So, if we want the correct statistics to be returned in the results when the property hive.compute.query.using.stats is set to TRUE, we need to make sure the statistics for the table is updated.
You can check if the value is set to TRUE or FALSE by running the below in Hive,
SET hive.compute.query.using.stats;
MSCK REPAIR does not do the file level checks. It looks only for directory level changes, for example if you have created a partitioned table and added a partition directory manually in HDFS and if you want Hive to be aware of it, MSCK REPAIR would serve the purpose.
I have a table inside hive, I want to fetch all data from it. The problem is that:
select * from tbl;
Gives me very different results than:
select count(*) from tbl;
Why is that? The second query seems to be running hadoop map reduce, the first does not - it simply returns the results. The table is not partitioned or bucketed, it's in the text (csv) format.
When you submit a Hive query, Hive converts a query into one or more stages. Stages could be a MapReduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
select * from table_name;
This query simply scan the entire table and dump the output on screen, therefore you see the different log output on console.
While select count(*) from table_name just scan the Hive meta_information and put the result from their itself. It also don't run any MapReduce job.
You can run below command on Hive console and you will be able to see the entire information.
hive> describe formatted table_name;
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles xx
numRows xxxxxxxx
In hadoop, aggregation/conditional/arithmetical operations etc required a processing engine to process and execute the result and therefore whenever you submit this type of job, it internally get translated into a MapReduce program, the MapReduce program gets executed on behalf of the query and produce its result to hive and Hive display on your screen and therefore you see a different result.
You can put the EXPLAIN keyword in front of the query to see the query plan and other information.
Please refer Programming Hadoop Book, Chapter 10 to know more about use of Hive EXPLAIN features.
At my organization we are trying to use HIVE Or PIG as alternative
Primary goal : reduce process time
NETEZZA process time : 90 min
looking to end process : within 30 min
How does process works:
Process is about to maintain incremental history.There are two tables history_table and new_table.History table maintains total history and new
_table has updated records.So every day updated records are added to history table.Process has very complex stored procedures (Joins/deletion/insert/update)
same process is being applied on multiple tables.Every history table has almost billions of records.
Doubts I have :
Does HIVE/PIG perform better than NETEZZA ?
Is UDF in hive a good alternative for Stored Procedure as I want to create generic process for multiple tables (where I can pass table name as argument) ?
Which performs better HIVE or PIG for really complex joins with multiple condition, generating create statement dynamically and exception handling?
use impala which is Netezza on Hadoop, try Kudu for real time and batch or use HBase for real time and impala for batch , but you can query HBase using impala
I recently did an integration between Hive and HBase. I created a hive table with HBase serde and when i insert the records into the hive table it gets loaded into the HBase table. I am trying to understand what if the insert into HiveHBase table fails in between ? (HBase service fails / network issue). I assume the records which have already loaded into the HBase will be there and when i do a rerun of the operation i will have two copies of data with different timestamp (Assuming out of 20K records 10k was inserted and the failure occured).
What is the best way to insert records into HBase ?
Can Hive provide me the security check to see if the data is already there ?
Is mapreduce the best shot for scenarios like these ? I will write a mapreduce program that reads data from hive and checks record by record in hbase before the insertion . This makes sure there are no duplicate writes
Any help on this would be greatly appreciated.
Yes, you will have 2 versions of data when you rerun the load operation. But that's ok since the 2nd version will get cleaned up on the next compaction. As long as your inserts are idempotent (which they most likely are), you won't have a problem.
At Lithium+Klout, we use a custom built HBaseSerDe which writes HFiles, instead of using Put's to insert the data. So we generate the HFiles and use the bulk load tool to load all of the data after the job has completed. That's another way you can integrate Hive and HBase.
HIVE 0.13 will SHARED lock the entire database(I see a node like LOCK-0000000000 as a child of the database node in Zookeeper) when running a select statement on any table in the database. HIVE creates a shared lock on the entire schema even when running a select statement - this results in a freeze on CREATE/DELETE statements on other tables in the database until the original query finishes and the lock is released.
Does anybody know a way around this? Following link suggests concurrency to be turned off but we can't do that as we are replacing the entire table and we have to make sure that no select statement is accessing the table before we replace the entire contents.
http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3C0eba01cfc035$3501e4f0$9f05aed0$#com%3E
use mydatabase;
select count(*) from large_table limit 1; # this table is very large and hive.support.concurrency=true`
In another hive shell, meanwhile the 1st query is executing:
use mydatabase;
create table sometable (id string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ;
The problem is that the “create table” does not execute untill the first query (select) has finished.
Update:
We are using Cloudera's distribution of Hive CDH-5.2.1-1 and we are seeing this issue.
I think they never made such that in Hive 0.13. Please verify your Resource manager and see that you have enough memory when you are executing multiple Hive queries.
As you know each Hive query will trigger a map reduce job and if YARN doesn't have enough resources it will wait till the previous running job completes. Please approach your issue from memory point of view.
All the best !!