Hive Map reduce Job clarification on selecting column - hadoop

Map reduce Jobs on Hive Statement
When i Query the following statment in Hive
hive> SELECT * FROM USERS LIMIT 100;
It doesn't launch a Map reduce Job, as becuase we are selection every things from the table and limiting the number of records it's return
But when i do the following
hive> select age,occupation from users limit 100;
it's actually kicks a Map reduce Job ?
does that means, applying column level projection required a Map reduce Job, ? though i have not applied any kind of filter on it.

Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. This is equivalent to a:
hadoop fs -cat $file_name
Whereas whenever you do a 'select column', a map job internally filters that particular column and gives the output.

when you write select * from table_name, the whole file is viewed, while if you select column, a map only job is launched, no reduce will be launched for it, since we are selecting the whole column .
Select * from table_name; --> will not launch a MR JOB
Select column from table_name; --> will launch a M JOB (map only job)
Select MAX(column_name) from table_name; --> will launch a MR JOB

Related

Force HiveServer2 to run MapReduce job

I am using hive-1.1.0.
Submitting queries to HiveServer2 via Beeline which are read-only and contain no predicates will cause HiveServer2 to try to read the data from HDFS itself without spawning a MapReduce job:
SELECT * FROM my_table LIMIT 100;
For very large datasets this can cause HiveServer2 to hold onto a lot of memory leading to long garbage collection pauses. Adding a "fake" predicate will cause HiveServer2 to run the MapReduce job as desired; e.g.
SELECT * FROM my_table WHERE (my_id > 0 OR my_id <= 0) LIMIT 100;
By "fake", I mean a predicate that does not matter; the above example predicate will always be true.
Is there a setting to force HiveServer2 to always run the MapReduce job without having to add bogus predicates?
I am not talking about when HiveServer2 determines it can run a MapReduce job locally; I have this disabled entirely:
> SET hive.exec.mode.local.auto;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.exec.mode.local.auto=false |
+----------------------------------+--+
but queries without predicates are still read entirely by HiveServer2 causing issues.
Any guidance much appreciated.
Thanks!
Some select queries can be converted to a single FETCH task, without map-reduce at all.
This behavior is controlled by hive.fetch.task.conversion configuration parameter.
Possible values are: none, minimal and more.
If you want to disable fetch task conversion, set it to none:
set hive.fetch.task.conversion=none;
minimal will trigger FETCH-only task for
SELECT *, FILTER on partition columns (WHERE and HAVING clauses),
LIMIT only.
more will trigger FETCH-only task for
SELECT any kind of expressions including UDFs, FILTER, LIMIT only
(including TABLESAMPLE, virtual columns)
Read also about hive.fetch.task.conversion.threshold parameter and more details here: Hive Configuration Properties

Hive + Tez :: A join query stuck at last 2 mappers for a long time

I have a views table joining with a temp table with the below parameters intentionally enabled.
hive.auto.convert.join=true;
hive.execution.engine=tez;
The Code Snippet is,
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM VIEWS TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
In the normal scenario, both the tables can have any number of records.
However,in the SCU_TMP table, only 10-50 records are expected with the same User Id.
But in some cases, couple of User IDs come with around 10k-20k records in SCU Temp table, which creates a cross product effect.
In such cases, it'll run for ever with just 1 mapper to complete.
Is there any way to optimise this and run this gracefully?
I was able to find a solution to it by the below query.
set hive.exec.reducers.bytes.per.reducer=10000
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM (SELECT TIME,MULTI_DIM_ID,SV1 FROM VIEWS SORT BY TIME) TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
The problem arises due to the fact that when a single user id dominates the table, join of that user gets processed through a single mapper which gets stuck.
Two modifications to it,
1) Replaced Table name with a subquery - which added a sorting process before the join.
2)Reduced the hive.exec.reducers.bytes.per.reducer parameter to 10KB.
Sort by time in step (1) added a shuffle phase which evenly distributed the data which was earlier skewed by the User ID.
Reducing the bytes per reducer parameter resulted in distribution of data to all available reducers.
By these two enhancements, 10-12hrs run was reduced to 45 mins.

Hive on Tez Pushdown Predicate doesn't work in view using window function on partitioned table

Using Hive on Tez running this query against this view causes a full table scan even though there is a Partition on regionid and id. This query in Cloudera Impala takes 0.6s to complete and using Hortonworks Data Platform and Hive on Tez it takes 800s. I've come to the conclusion that in Hive on Tez using a window function prevents the predicate to be pushed down to the inner select causing the full table scan.
CREATE VIEW latestposition AS
WITH t1 AS (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY regionid, id, deviceid order by ts desc) AS rownos FROM positions
)
SELECT *
FROM t1
WHERE rownos = 1;
SELECT * FROM latestposition WHERE regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1' and id=1422792010 and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432';
I've tried joining this table to itself using the MAX function to get the latest record, it works, and finishes in a few seconds but it still is too slow for my use case. Also if I remove the window function the predicate gets pushed down and this will return in milliseconds.
If anyone has any ideas it would be much appreciated.
For anyone that is interested, I posted this question on the Hortonworks Community forum. The good guys over there raised a bug for this issue on the Hive Jira and are actively working on it.
https://community.hortonworks.com/questions/8880/hive-on-tez-pushdown-predicate-doesnt-work-in-part.html
https://issues.apache.org/jira/browse/HIVE-12808
It is expected behavior. To avoid full table scan you have to apply the where condition like this (which you cannot by usage of view). This is limitation of most of the databases. Analytical functions are supposed to be applied after data is filtered as it will create temp tables internally.
WITH t1 AS (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY regionid, id, deviceid order by ts desc) AS rownos FROM positions **<where condition>**
)
SELECT *
FROM t1
WHERE rownos = 1;

Hive job not starting

I have created a new VM with hadoop on it
(used this guide: http://alanxelsys.com/2014/02/01/hadoop-2-2-single-node-installation-on-centos-6-5/ )
I have created some tables with data on it.
when I run a "select * from table" query, I get the result,
when I try to run "Select * from table where..."
I get those lines:
Total jobs = 1
lunching job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
(Now I am waiting and nothing happened)
Any one have idea what I can do to fix it?
Thanks for the help!

Forcing a reduce phase or a second map reduce job in hive

I am running a hive query of the following form:
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT /*+ MAPJOIN(...) */ * FROM ...
Because of the MAPJOIN, the result does not require a reduce phase. The map phase uses about 5000 mappers, and it ends up taking about 50 minutes to complete the job. It turns out that most of this time is spent copying those 5000 files to the local directory.
To try to optimize this, I replaced SELECT * ... with SELECT DISTINCT * ... (I know in advance that my results are already distinct, so this doesn't actually change my result), in order to force a second map reduce job. The first map reduce job is the same as before, with 5000 mappers and 0 reducers. The second map reduce job now has 5000 mappers and 3 reducers. With this change, there are now only 3 files to be copied, rather than 5000, and the query now only takes a total of about 20 minutes.
Since I don't actually need the DISTINCT, I'd like to know whether my query can be optimized in a less kludge-y way, without using DISTINCT?
What about wrapping you query with another SELECT, and maybe a useless WHERE clause to make sure it kicks off a job.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT *
FROM (
SELECT /*+ MAPJOIN(...) */ *
FROM ..
) x
WHERE 1 = 1
I'll run this when I get a chance tomorrow and delete this part of the answer if it doesn't work. If you get to it before me then great.
Another option would be to take advantage of the virtual columns for file name and line number to force distinct results. This complicates the query and introduces two meaningless columns, but has the advantage that you no longer have to know in advance that your results will be distinct. If you can't abide the useless columns, wrap it in another SELECT to remove them.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT {{enumerate every column except the virutal columns}}
FROM (
SELECT DISTINCT /*+ MAPJOIN(...) */ *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
FROM ..
) x
Both solutions are more kludge-y than what you came up with, but have the advantage that you are not limited to queries with distinct results.
We get another option if you aren't limited to Hive. You could get rid of the LOCAL and write the results to HDFS, which should be fast even with 5000 mappers. Then use hadoop fs -getmerge /result/dir/on/hdfs/ to pull the results into the local filesystem. This unfortunately reaches out of Hive, but maybe setting up a two step Oozie workflow is acceptable for your use case.

Resources