Hive not respecting mapreduce.job.reduces - hadoop

A hive insert statement of the following form:
insert into my_table select * from my_other_table;
is using ONE reducer - even though just prior the following had been executed:
set mapreduce.job.reduces=80;
Is there a way to force hive to use more reducers? there is not clear reason why this particular query would do a single reducer - given there is no ORDER BY clause at the end.
BTW the source and destination tables are both
stored as parquet

SELECT * FROM table; in Hive does not use any reducers - it is a map-only job.
One way to force Hive to use reducers in a SELECT * is to GROUP BY all of the fields, for example:
SELECT field1, field2, field3 FROM table GROUP BY field1, field2, field3;
Though, note that this will remove duplicate records.

In the insert query you mentioned hive will try to write all the data into a single file. cause its just a select * statement. Hence 1 reducer.
But if you use bucketing. hive will use the same number of reducers as your buckets.
If you have 128 buckets hive will fire 128 reducer jobs and ultimately it will create 128 files.

Related

Overwrite multiple partitions at once Hadoop

I have a partitioned external table Hive that i have to overwrite with some records.
There are a lot of dates that we need to reload and the queries are a bit heavy.
What we want to know is if it is possible, in a simultaneous way load two or more different partitions at the same time?
For example, 3 (or more) processes running in parallel like:
Process1
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221110;
Process2
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221111;
Process3
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221112;
Short answer is yes you can.
Real question is how - because you have to consider large volume of data.
Option 1 - yes, you can use shell script or some scheduler tool. But the query you're using is going to be slow. you can use static partitioning which is way faster.
insert overwrite table_prod partition (data_date=20221110) -- pls note i mentioned partition value.
select
col1, col2... -- exclude data_date column from select list
from table_old where data_date=20221110;
Option 2 - You can also use dynamic partition scheme to load all the partitions at once. This is perf intensive operation but you dont have to create any shell script or any other process.
insert overwrite table_prod partition (data_date)
select * from table_old

Combine Multiple Hive Tables as single table in Hadoop

Hi I have multiple Hive tables around 15-20 tables. All the tables will be common schema . I Need to combine all the tables as single table.The single table should be queried from reporting tool, So performance is also needs to be care..
I tried like this..
create table new as
select * from table_a
union all
select * from table_b
Is there any other way to combine all the tables more efficient. Any help will be appreciated.
Hive would be processing in parallel if you set "hive.exec.parallel" as true. With "hive.exec.parallel.thread.number" you can specify the number of parallel threads. This would increase the overall efficiency.
If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union

select all but few columns in impala

Is there a way to replicate the below in impala?
SET hive.support.quoted.identifiers=none
INSERT OVERWRITE TABLE MyTableParquet PARTITION (A='SumVal', B='SumOtherVal') SELECT `(A)?+.+` FROM MyTxtTable WHERE A='SumVal'
Basically I have a table in hive as text with 1000 fields, and I need a select that drops off the field A. The above works for Hive but now impala, how can I do this in impala without specifying all other 999 fields directly?

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Insert overwrite local directory launching map reduce jobs for a simple query

I have two hive queries
select * from tab1 limit 3;
This returns the 3 rows quickly without launching any map reduce jobs;
Now the same Query if i ask to write the output to a local directory as
`INSERT OVERWRITE LOCAL DIRECTORY "/tmp/query1/" select * from tab1 limit 3;
This Query launches a map reduce job that scans through all the files of the table and then returns 3 rows and the table under question is a big one so scanning through the whole thing takes a long time.
Why is there a difference in execution style of both queries?
A simple explanation is:
When you are executing a simple select * from tab1 limit 3 query in Hive, it access the raw data files from HDFS and presents the output like a view on top of the files stored in HDFS basically dfs -cat 'filepath' . A Map Reduce job is not triggered in this case hence completing the job faster. If you modify your query to even pull on column like select col1 from tab1 limit 3, the Map Reduce job is triggered and the part files are scanned to pull out the results parallely thus consuming some Cumulative CPU Time.
The same thing happens when you hit a query like INSERT OVERWRITE LOCAL DIRECTORY "/tmp/query1/" select * from tab1 limit 3;
In order to find out more on how Hive translates queries into Map Reduce Jobs, you can use the EXPLAIN keyword before your SELECT keyword. This should make things more clear to you.

Resources