I have created a new VM with hadoop on it
(used this guide: http://alanxelsys.com/2014/02/01/hadoop-2-2-single-node-installation-on-centos-6-5/ )
I have created some tables with data on it.
when I run a "select * from table" query, I get the result,
when I try to run "Select * from table where..."
I get those lines:
Total jobs = 1
lunching job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
(Now I am waiting and nothing happened)
Any one have idea what I can do to fix it?
Thanks for the help!
Related
I am using hive-1.1.0.
Submitting queries to HiveServer2 via Beeline which are read-only and contain no predicates will cause HiveServer2 to try to read the data from HDFS itself without spawning a MapReduce job:
SELECT * FROM my_table LIMIT 100;
For very large datasets this can cause HiveServer2 to hold onto a lot of memory leading to long garbage collection pauses. Adding a "fake" predicate will cause HiveServer2 to run the MapReduce job as desired; e.g.
SELECT * FROM my_table WHERE (my_id > 0 OR my_id <= 0) LIMIT 100;
By "fake", I mean a predicate that does not matter; the above example predicate will always be true.
Is there a setting to force HiveServer2 to always run the MapReduce job without having to add bogus predicates?
I am not talking about when HiveServer2 determines it can run a MapReduce job locally; I have this disabled entirely:
> SET hive.exec.mode.local.auto;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.exec.mode.local.auto=false |
+----------------------------------+--+
but queries without predicates are still read entirely by HiveServer2 causing issues.
Any guidance much appreciated.
Thanks!
Some select queries can be converted to a single FETCH task, without map-reduce at all.
This behavior is controlled by hive.fetch.task.conversion configuration parameter.
Possible values are: none, minimal and more.
If you want to disable fetch task conversion, set it to none:
set hive.fetch.task.conversion=none;
minimal will trigger FETCH-only task for
SELECT *, FILTER on partition columns (WHERE and HAVING clauses),
LIMIT only.
more will trigger FETCH-only task for
SELECT any kind of expressions including UDFs, FILTER, LIMIT only
(including TABLESAMPLE, virtual columns)
Read also about hive.fetch.task.conversion.threshold parameter and more details here: Hive Configuration Properties
I am trying to run the join query in hive on the following two tables-
select b.location from user_activity_rule a inner join user_info_rule b where a.uid=b.uid and a.cancellation=true;
Query ID = username_20180530154141_0a187506-7aca-442a-8310-582d335ad78d
Total jobs = 1
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Execution log at: /tmp/username/username_20180530154141_0a187506-7aca-442a-8310-582d335ad78d.log
2018-05-30 03:41:51 Starting to launch local task to process map join; maximum memory = 2058354688
Execution failed with exit status: 2
Obtaining error information
Task failed!
Task ID:
Stage-4
Logs:
/tmp/username/hive.log
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
What does this error mean and how to resolve this?
This happens when the job you are trying to run runs out of memory. one way of overcoming this is to use this command:
set hive.auto.convert.join = false;
This will help in join optimization.
Sometimes when the number of concurrent users using it is high(at some peak time), this happens.
Alternatively, you can fire this query when not many users are using it. Apparently, there would
be much free memory so that your job can consume required. This alternative can be adopted when
nodes are less in Dev environment and you are sure that there will be no memory issues in production.
Instead of where can you use the below code and try
SELECT b.location FROM user_activity_rule a JOIN user_info_rule b ON(a.uid=b.uid) WHERE a.cancellation="true";
First of all, make sure the HADOOP_USER that you used to run SQL can run MapReduce.
Then, use SQL like following:
set hive.auto.convert.join = false;
select b.location
from user_activity_rule a
inner join user_info_rule b
where a.uid=b.uid and a.cancellation=true;
To test a scenario I have executed below in Hive and now any of my normal queries are not running and it returning same error even for all different query executions.
Command I executed at initial is
set dfs.block.size=1073741824;
select * from l_rate where CONVERSION_START_DATE='20160701'
Later I have executed below
set dfs.block.size=${hiveconf:test}
select * from ${hiveconf:test} limit 10
However I stop my above testing and came to my normal tasks. Now I can't able to run even normal queries
select * from country limit 10;
Now I getting below error for all executions on different tables
FAILED: RuntimeException java.lang.NumberFormatException: For input
string: "${hiveconf:test} select * from ${hiveconf:test} limit 10
Please help me to get rid from this error ! even I logout my session in Quoble and reconnected but doesn't help.
Map reduce Jobs on Hive Statement
When i Query the following statment in Hive
hive> SELECT * FROM USERS LIMIT 100;
It doesn't launch a Map reduce Job, as becuase we are selection every things from the table and limiting the number of records it's return
But when i do the following
hive> select age,occupation from users limit 100;
it's actually kicks a Map reduce Job ?
does that means, applying column level projection required a Map reduce Job, ? though i have not applied any kind of filter on it.
Whenever you run a normal 'select *', a fetch task is created rather than a mapreduce task which just dumps the data as it is without doing anything on it. This is equivalent to a:
hadoop fs -cat $file_name
Whereas whenever you do a 'select column', a map job internally filters that particular column and gives the output.
when you write select * from table_name, the whole file is viewed, while if you select column, a map only job is launched, no reduce will be launched for it, since we are selecting the whole column .
Select * from table_name; --> will not launch a MR JOB
Select column from table_name; --> will launch a M JOB (map only job)
Select MAX(column_name) from table_name; --> will launch a MR JOB
I am running a hive query of the following form:
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT /*+ MAPJOIN(...) */ * FROM ...
Because of the MAPJOIN, the result does not require a reduce phase. The map phase uses about 5000 mappers, and it ends up taking about 50 minutes to complete the job. It turns out that most of this time is spent copying those 5000 files to the local directory.
To try to optimize this, I replaced SELECT * ... with SELECT DISTINCT * ... (I know in advance that my results are already distinct, so this doesn't actually change my result), in order to force a second map reduce job. The first map reduce job is the same as before, with 5000 mappers and 0 reducers. The second map reduce job now has 5000 mappers and 3 reducers. With this change, there are now only 3 files to be copied, rather than 5000, and the query now only takes a total of about 20 minutes.
Since I don't actually need the DISTINCT, I'd like to know whether my query can be optimized in a less kludge-y way, without using DISTINCT?
What about wrapping you query with another SELECT, and maybe a useless WHERE clause to make sure it kicks off a job.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT *
FROM (
SELECT /*+ MAPJOIN(...) */ *
FROM ..
) x
WHERE 1 = 1
I'll run this when I get a chance tomorrow and delete this part of the answer if it doesn't work. If you get to it before me then great.
Another option would be to take advantage of the virtual columns for file name and line number to force distinct results. This complicates the query and introduces two meaningless columns, but has the advantage that you no longer have to know in advance that your results will be distinct. If you can't abide the useless columns, wrap it in another SELECT to remove them.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT {{enumerate every column except the virutal columns}}
FROM (
SELECT DISTINCT /*+ MAPJOIN(...) */ *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
FROM ..
) x
Both solutions are more kludge-y than what you came up with, but have the advantage that you are not limited to queries with distinct results.
We get another option if you aren't limited to Hive. You could get rid of the LOCAL and write the results to HDFS, which should be fast even with 5000 mappers. Then use hadoop fs -getmerge /result/dir/on/hdfs/ to pull the results into the local filesystem. This unfortunately reaches out of Hive, but maybe setting up a two step Oozie workflow is acceptable for your use case.