Hive cross join fails on local map join - hadoop

Is there a direct way to address the following error or overall a better way to use Hive to get the join that I need? Output to a stored table isn't a requirement as I can be content with an INSERT OVERWRITE LOCAL DIRECTORY to a csv.
I am trying to perform the following cross join. ipint is a 9GB table, and geoiplite is 270MB.
CREATE TABLE iplatlong_sample AS
SELECT ipintegers.networkinteger, geoiplite.latitude, geoiplite.longitude
FROM geoiplite
CROSS JOIN ipintegers
WHERE ipintegers.networkinteger >= geoiplite.network_start_integer AND ipintegers.networkinteger <= geoiplite.network_last_integer;
I use CROSS JOIN on ipintegers instead of geoiplite because I have read that the rule is for the smaller table to be on the left, larger on the right.
Map and Reduce stages complete to 100% according to HIVE, but then
2015-08-01 04:45:36,947 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 8767.09 sec
MapReduce Total cumulative CPU time: 0 days 2 hours 26
minutes 7 seconds 90 msec
Ended Job = job_201508010407_0001
Stage-8 is selected by condition resolver.
Execution log at: /tmp/myuser/.log
2015-08-01 04:45:38 Starting to launch local task to process map
join; maximum memory = 12221153280
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID: Stage-8
Logs:
/tmp/myuser/hive.log
FAILED: Execution Error, return code 3 from
org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
MapReduce Jobs
Launched: Job 0: Map: 38 Reduce: 1 Cumulative CPU: 8767.09 sec
HDFS Read: 9438495086 HDFS Write: 8575548486 SUCCESS
My hive config:
SET hive.mapred.local.mem=40960;
SET hive.exec.parallel=true;
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate = true;
SET hive.optimize.skewjoin = true;
SET mapred.compress.map.output=true;
SET hive.stats.autogather=false;
I have varied SET hive.auto.convert.join between true and false but with the same result.
Here are the errors in the output log from /tmp/myuser/hive.log
$ tail -12 -f tmp/mysyer/hive.log
2015-08-01 07:30:46,086 ERROR exec.Task (SessionState.java:printError(419)) - Execution failed with exit status: 3
2015-08-01 07:30:46,086 ERROR exec.Task (SessionState.java:printError(419)) - Obtaining error information
2015-08-01 07:30:46,087 ERROR exec.Task (SessionState.java:printError(419)) -
Task failed!
Task ID:
Stage-8
Logs:
2015-08-01 07:30:46,087 ERROR exec.Task (SessionState.java:printError(419)) - /tmp/myuser/hive.log
2015-08-01 07:30:46,087 ERROR mr.MapredLocalTask (MapredLocalTask.java:execute(268)) - Execution failed with exit status: 3
2015-08-01 07:30:46,094 ERROR ql.Driver (SessionState.java:printError(419)) - FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
I am running the hive client on the Master, a Google Cloud Platform instance of type n1-highmem-8 type (8 CPU, 52GB) and workers are n1-highmem-4 (4CPU 26GB), but I suspect after MAP and REDUCE that a local join (as implied) takes place on the Master. Regardless, in bdutils I configured the JAVAOPTS for the worker nodes (n1-highmem-4) to: n1-highmem-4
SOLUTION EDIT: The solution is to organize the data the range data into a range tree.

I don't think it is possible to perform this kind of cross join brute force - just multiply the row numbers, it's a little out of hand. You need some optimizations, which I don't think hive is capable yet.
But is this problem can actually be solved in O(N1+N2) time providing you have your data sorted (which hive can do for you) - you just go through both lists simultaneously, on each step getting an ip integer, seeing if any intervals start on this integer, adding them, removing those that ended, emitting matching tuples, and so on. Pseudocode:
intervals=[]
ipintegers = iterator(ipintegers_sorted_file)
intervals = iterator(intervals_sorted_on_start_file)
for x in ipintegers:
intervals = [i for i in intervals if i.end >= x]
while(intervals.current.start<=x):
intervals.append(intervals.current)
intervals.next()
for i in intervals:
output_match(i, x)
Now, if you have an external script/UDF function that knows how to read the smaller table and gets ip integers as input and spits matching tuples as output, you can use hive and SELECT TRANSFORM to stream the inputs to it.
Or you can probably just run this algorithm on a local machine with two input files, because this is just O(N), and even 9 gb of data is very doable.

Related

how to get mapreduce job number from hive server

if use hive cli. the log is :
Total MapReduce jobs = 1
Stage-1 is selected by condition resolver.
Launching Job 1 out of 1
but in hive server or beeline. the log is :
INFO : Stage-1 is selected by condition resolver.
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
how can I get the job number ?
I need calculate job progress and print it..

How to skip failed map tasks in hadoop streaming

I am running a hadoop streaming mapreduce job which has 26895 map tasks in total. However, one task that deals a certain input always fails. So I set mapreduce.map.failures.maxpercent=1 and want to skip failed tasks, but the job was still not successful.
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts
map 100.00% 26895 0 0 26894 1 8 / 44
reduce 100.00% 1 0 0 0 1 0 / 1
How can I do to skip this?
There is a configuration available for the same.
Specify the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent in the mapred-site.xml to specify the failure threshold. Both are set to 0.
These properties are deprecated now and use following properties for this purpose
mapreduce.map.failures.maxpercent
mapreduce.reduce.failures.maxpercent

Hive takes long time to launch hadoop job

I am a newbie to Hadoop and Hive. I am using Hive integration with Hadoop to execute the queries. When I submit any query, following log messages appear on console:
Hive history
file=/tmp/root/hive_job_log_root_28058#hadoop2_201203062232_1076893031.txt Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce
tasks determined at compile time: 1 In order to change the average
load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapred.reduce.tasks= Starting Job = job_201203062223_0004,
Tracking URL =
http://:50030/jobdetails.jsp?jobid=job_201203062223_0004 Kill
Command = //opt/hadoop_installation/hadoop-0.20.2/bin/../bin/hadoop
job -kill job_201203062223_0004 Hadoop job information for Stage-1:
number of mappers: 1; number of reducers: 1 2012-03-06 22:32:26,707
Stage-1 map = 0%, reduce = 0% 2012-03-06 22:32:29,716 Stage-1 map =
100%, reduce = 0% 2012-03-06 22:32:38,748 Stage-1 map = 100%, reduce
= 100% Ended Job = job_201203062223_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 8107686 HDFS Write: 4 SUCCESS Total
MapReduce CPU Time Spent: 0 msec OK
The text mentioned in bold starts a hadoop job (that's what I believe). It takes long time to start the job. Once this line gets executed, the map reduce operations execute swiftly. Following are my questions:
Is there any way to make the launch of hadoop job faster. Is it possible to skip this phase?
Where does the value of 'Kill command' come from (in the bold text)?
Please let me know if any inputs are required.
1) Starting Job = job_201203062223_0004, Tracking URL = http: :50030/jobdetails.jsp?jobid=job_201203062223_0004
ANS: your HQL query > translated to hadoop job > hadoop will do some background work (like planning resources,data locality,stages needed to process query,launch configs,job,taskids generation etc) > launch mappers > sort && shuffle > reduce (aggregation) > result to hdfs .
The above flow is part of hadoop job life cycle, so no skipping of any..
http://namenode:port/jobtracker.jsp --- you can see ur job status with job-id :job_201203062223_0004, (Monitering)
2) Kill Command = HADOOP_HOME/bin/hadoop job -kill job_201203062223_0004
Ans : before launching your mappers, you will be showed with these lines because, hadoop works on bigdata, which may take much or less time depends on your dataset size. so at any point of time if you want to kill the job, its a help line . For any hadoop-job this line will be shown, it won't take much time to show an info line like this.
some addons with respect to your comments :
Hive is not meant for low Latency jobs , i mean immediate in time results not possible.
(plz check the hive -purposes in apache.hive)
launching overhead(refer q1s - hadoop will do some background work) is there in Hive, it cant be avoided.
Even for datasets of small size, these launching over head is there in hadoop.
PS : if you are really expecting in time quick results ( plz refer shark )
first,Hive is the tool which replace your mr work by HQL.In the background,it has lost of predefined funcitions,mr programes.Run a HQL,HADOOP Cluster will do lost of things,find the data blocks,allocating taskļ¼Œand so on.
Second,you can kill a job by the hadoop shell command.
If you job id is AAAAA.
you can execute below command to kill it
$HADOOP_HOME/bin/hadoop job -kill AAAAA
Launch of hadoop job can get delayed due to unavailability of resources. If you use yarn you can see that the jobs are in accepted state but not yet running. This means there is some other ongoing job that has consumed all your executors and the new query is waiting to run.
You can kill the older job by using hadoop job -kill <job_id> command or wait for it to finish.

hive collect_set crashes query

I've got the following table:
hive> describe tv_counter_stats;
OK
day string
event string
query_id string
userid string
headers string
And I want to perform the following query:
hive -e 'SELECT
day,
event,
query_id,
COUNT(1) AS count,
COLLECT_SET(userid)
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
However, this query fails. But the following query works fine:
hive -e 'SELECT
day,
event,
query_id,
COUNT(1) AS count
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
where I remove the collect_set command. So my question: Has anybody an idea why collect_set might fail in this case?
UPDATE: Error message added:
Diagnostic Messages for this Task:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 10.49 sec HDFS Read: 109136387 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 10 seconds 490 msec
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
Error: GC overhead limit exceeded
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
Error: GC overhead limit exceeded
UPDATE 2:
I altered the query such that it look now like this:
hive -e '
SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC";
SELECT
day,
event,
query_id,
COUNT(1) AS count,
COLLECT_SET(userid)
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
However, then I get the following error:
Diagnostic Messages for this Task:
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
This is probably the memory problem, since collect_set aggregates data in the memory.
Try increasing heap size and enabling concurrent GC (via setting Hadoop mapred.child.java.opts to e.g -Xmx1g -XX:+UseConcMarkSweepGC).
This answer has more information about "GC overhead limit" error.
I had the same exact problem and came across this question, so I thought I'd share the solution I found.
The underlying problem is most likely that Hive is trying to do the aggregation on the mapper side, and the heuristics it uses to manage the in-memory hashmaps for that approach are thrown off by data that is "wide but shallow" -- i.e. in your case, if there are very few user_id values per day/event/query_id group.
I found an article that explains various ways to address this issue, but most of them are just optimizations to the full-out nuclear option: disable mapper-side aggregations entirely.
Using set hive.map.aggr = false; should do the trick.

Hive - Queries on Partitions return nothing

I have a table that is being partitioned by a specific start date (ds). I can query the latest partition (the previous day's data) and it will use the partition fine.
hive> select count(1) from vtc4 where ds='2012-11-01' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 46281957 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
151225
Time taken: 35.007 seconds
However, when I try to query earlier partitions, hive seems to read the partition fine, but does not return any results.
hive> select count(1) from vtc4 where ds='2012-10-31' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.64 sec HDFS Read: 37754168 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 640 msec
OK
0
Time taken: 29.07 seconds
However, if I tell hive to run the query against the date field inside the table itself, and don't use the partition, I get the correct result.
hive> select count(1) from vtc4 where date_started >= "2012-10-31 00:00:00" and date_started < "2012-11-01 00:00:00" ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 63 Reduce: 1 Cumulative CPU: 453.52 sec HDFS Read: 16420276606 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 minutes 33 seconds 520 msec
OK
123201
Time taken: 265.874 seconds
What am I missing here? I'm running hadoop 1.03 and hive 0.9. I'm pretty new to hive/hadoop, so any help would be appreciated.
Thanks.
EDIT 1:
hive> describe formatted vtc4 partition (ds='2012-10-31');
Partition Value: [2012-10-31 ]
Database: default
Table: vtc4
CreateTime: Wed Oct 31 12:02:24 PDT 2012
LastAccessTime: UNKNOWN
Protect Mode: None
Location: hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
Partition Parameters:
transient_lastDdlTime 1351875579
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.191 seconds
The partition folders exist, but when i try to do a hadoop fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31 it says the file/directory does not exist. If I browse to that directory using the web interface, I can get into the folder , as well as see the /part-m-000* files. If I do a fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01 it works fine.
Seems like either a permissions thing, or something funky with the either hive's or the namenode's metadata. Here's what I would try:
copy the data in that partition to some other location in hdfs. You may need to do this as the hive or hdfs user, depending on how your permissions are set up.
alter table vtc4 drop partition (ds='2012-10-31');
alter table vtc4 add partition (ds='2012-10-31');
copy the data back into that partition on hdfs
Another thing with hive partition is that it sometime doesn't register in metadata system when created outside of hive (e.g. from sparksql). You can also try MSCK REPAIR TABLE xc_bonus; after any changes to partition so it reflects correctly.

Resources