problems with hive query in cloudera - hadoop

I can do all other queries in hive, but when I do a join it just gets stuck.
hive> select count (*) from tab10 join tab1;
Warning: Map Join MAPJOIN[13][bigTable=tab10] in task 'Stage-2:MAPRED' is a cross product
Query ID = root_20160406145959_b57642e0-7499-41a0-914c-0004774fe4ac
Total jobs = 1
Execution log at: /tmp/root/root_20160406145959_b57642e0-7499-41a0-914c-0004774fe4ac.log
2016-04-06 03:00:03 Starting to launch local task to process map join; maximum memory = 2058354688
2016-04-06 03:00:03 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile01--.hashtable
2016-04-06 03:00:03 Uploaded 1 File to: file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (280 bytes)
2016-04-06 03:00:03 End of local task; Time Taken: 0.562 sec.
Its hung at this point, and it doesn't spawn any of the map reduce tasks at all. What could be wrong?
I did see this in hive.log.
2016-04-06 15:00:00,124 INFO [main]: ql.Driver (Driver.java:launchTask(1643)) - Starting task [Stage-5:MAPREDLOCAL] in serial mode
2016-04-06 15:00:00,125 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(159)) - Generating plan file file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10006/plan.xml
2016-04-06 15:00:00,233 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(288)) - Executing: /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/bin/hadoop jar /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/jars/hive-exec-1.1.0-cdh5.5.2.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10006/plan.xml -jobconffile file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10007/jobconf.xml
There is nothing beyond this. Anyone knows how to fix this ?

Open mapred-site.xml file and add the property:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
you need to increase your heap memory used by hadoop JVM

Related

Not able to insert data into hbase table using PIG

if i run ->
data = LOAD 'hdfs:/user/zzz/Pokemon.csv' USING PigStorage(',') AS (serial_no:int,name:chararray,type1:chararray,type2:chararray,total:int,hp:int,attack:int,defence:int,sp_attk:int,sp_def:int,speed:int);
data loaded successfully as i can see by dumping the data.
but after that when i run ->
STORE data INTO 'hbase://pokemons' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:name,cf:type1,cf:type2,cf:total,cf:hp,cf:attack,cf:defence,cf:sp_attk,cf:sp_def,cf:speed');
then the problem arises you can check that below ->
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
3.2.1 0.17.0 zzz 2019-12-11 12:57:34 2019-12-11 12:57:43 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1576044193401_0008 data MAP_ONLY Message: Job failed! hbase://pokemons,
Input(s):
Failed to read data from "hdfs:/user/zzz/Pokemon.csv"
Output(s):
Failed to produce result in "hbase://pokemons"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1576044193401_0008
2019-12-11 12:57:43,115 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
I'm not exactly sure what happened but I do know that current Pig is not tested against hadoop version 3 or higher. Tracked in https://issues.apache.org/jira/browse/PIG-5253

Hive terminal hangs on inserting data using INSERT command

I am trying to insert data in an external hive table in Hive 1.2 from another table using INSERT COmmand-
INSERT INTO perf_tech_security_detail_extn_fltr partition
(created_date)
SELECT seq_num,
action,
sde_timestamp,
instrmnt_id,
dm_lstupddt,
grnfthr_ind,
grnfthr_tl_dt,
grnfthr_frm_dt,
ftc_chge_rsn,
Substring (sde_timestamp, 0, 10)
FROM tech_security_detail_extn_fltr
WHERE Substring (sde_timestamp, 0, 10) = '2018-05-02';
But the hive shell hangs on-
hive> SET hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.enforce.bucketing=true;
hive> INSERT INTO PERF_TECH_SECURITY_DETAIL_EXTN_FLTR partition (created_date) select seq_num, action, sde_timestamp, instrmnt_id, dm_lstupddt, grnfthr_ind, grnfthr_tl_dt, grnfthr_frm_dt, ftc_chge_rsn, substring (sde_timestamp,0,10) from TECH_SECURITY_DETAIL_EXTN_FLTR where substring (sde_timestamp,0,10)='2018-05-02';
Query ID = tcs_20180503215950_585152fd-ecdc-4296-85fc-d464fef44e68
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 100
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Hive logs are as below-
18-05-03 21:28:01,703 INFO [main]: log.PerfLogger
(PerfLogger.java:PerfLogEnd(148)) - 2018-05-03 21:28:01,716
ERROR [main]: mr.ExecDriver (ExecDriver.java:execute(400)) - yarn
2018-05-03 21:28:01,758 INFO [main]: client.RMProxy
(RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at
/0.0.0.0:8032 2018-05-03 21:28:01,903 INFO [main]:
fs.FSStatsPublisher (FSStatsPublisher.java:init(49)) - created :
hdfs://localhost:9000/datanode/nifi_data/perf_tech_security_detail_extn_fltr/.hive-staging_hive_2018-05-03_21-27-59_433_5606951945441160381-1/-ext-10001 2018-05-03 21:28:01,960 INFO [main]: client.RMProxy
(RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at
/0.0.0.0:8032 2018-05-03 21:28:01,965 INFO [main]: exec.Utilities
(Utilities.java:getBaseWork(389)) - PLAN PATH =
hdfs://localhost:9000/tmp/hive/tcs/576b0aa3-059d-4fb2-bed8-c975781a5fce/hive_2018-05-03_21-27-59_433_5606951945441160381-1/-mr-10003/303a392c-2383-41ed-bc9d-78d37ae49f39/map.xml
2018-05-03 21:28:01,967 INFO [main]: exec.Utilities
(Utilities.java:getBaseWork(389)) - PLAN PATH =
hdfs://localhost:9000/tmp/hive/tcs/576b0aa3-059d-4fb2-bed8-c975781a5fce/hive_2018-05-03_21-27-59_433_5606951945441160381-1/-mr-10003/303a392c-2383-41ed-bc9d-78d37ae49f39/reduce.xml
2018-05-03 21:28:22,009 INFO [main]: ipc.Client
(Client.java:handleConnectionTimeout(832)) - Retrying connect to
server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); maxRetries=45
2018-05-03 21:28:42,027 INFO [main]: ipc.Client
(Client.java:handleConnectionTimeout(832)) - Retrying connect to
server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); maxRetries=45
..........................................................
I have also tried to insert data normally in unpartitioned table but even that is not working-
INSERT INTO emp values (1 ,'ROB')
I am not sure why you have not written table before table name, like this below:
INSERT INTO TABLE emp
VALUES (1 ,'ROB'), (2 ,'Shailesh');
Write proper commands to make them work
Resolved
MapReduce is not running due to wrong framename,so edited property mapreduce.framework.name in mapred-site.xml
In a cluster environment, the property yarn.resourcemanager.hostname is key to avoid this problem. It worked great for me.
Use this command to monitor YARN performance:
yarn application -list and yarn node -list

hadoop stuck at “running job”

I want to running the hadoop word count program from the doc. However the program stuck at running job
16/09/02 10:51:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/02 10:51:13 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/09/02 10:51:13 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/09/02 10:51:14 INFO input.FileInputFormat: Total input paths to process : 1
16/09/02 10:51:14 INFO mapreduce.JobSubmitter: number of splits:2
16/09/02 10:51:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1472783047951_0003
16/09/02 10:51:14 INFO impl.YarnClientImpl: Submitted application application_1472783047951_0003
16/09/02 10:51:14 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1472783047951_0003/
16/09/02 10:51:14 INFO mapreduce.Job: Running job: job_1472783047951_0003
And show http://hadoop-master:8088/proxy/application_1472783047951_0003/
And it runs a AppMaster on http://hadoop-slave2:8042, show it
However, since it stucks on WordCount, it also stuck on Hive
hive (default)> select a, b, count(1) as cnt from newtb group by a, b;
Query ID = hadoop_20160902110124_d2b2680b-c493-4986-aa84-f65794bfd8e4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1472783047951_0004, Tracking URL = http://hadoop-master:8088/proxy/application_1472783047951_0004/
Kill Command = /opt/hadoop-2.6.4/bin/hadoop job -kill job_1472783047951_0004
The is nothing wrong with select *.
hive (default)> select * from newtb;
OK
1 2 3
1 3 4
2 3 4
5 6 7
8 9 0
1 8 3
Time taken: 0.101 seconds, Fetched: 6 row(s)
So, I think there is something wrong with MapReduce. There is enough disk an memory. So, How to solve it?
You are having issues because application master is unable to start containers and run the job. First try restarting your system and if it doesn't change you have to change memory allocations in yarn-site.xml and mapred-site.xml. go with basic memory settings.
use Following links
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#yarn-configuration_1
here
here
As I was running hadoop on the ubuntu as a quest on the wmware, I just do increase the amount of RAM that I had been allocated to the ubuntu, I allocated 4GB Ram instead of 2GB Ram to the ubuntu, finally it could continue out and finished the job.

cloudera Oozie sqoop2 job hangs running forever Heart beat Heart beat

I am trying to run two sqoop jobs in parallel using oozie. But two jobs are stuck after 95 % , other two are in accepted state.I have also increased yarn resource maximum memory . also added
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>50 </value>
in mapred-site.xml , but nothing helped. please help.
Yarn Cluster Metrix:
Apps Submitted 4
Apps Pending 2
Apps Running 2
Apps Completed 0
Containers Running 4
Memory Used 10GB
Memory Total 32GB
Memory Reserved 0B
VCores Used 4
VCores Total 24
VCores Reserved 0
Active Nodes 4
Decommissioned Nodes 0
Lost Nodes 0
Unhealthy Nodes 0
Rebooted Nodes 0
----------
Sysout Log
========================================================================
3175 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
3198 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.5-cdh5.2.0
3212 [main] WARN org.apache.sqoop.tool.BaseSqoopTool - Setting your password on the command-line is insecure. Consider using -P instead.
3213 [main] INFO org.apache.sqoop.tool.BaseSqoopTool - Using Hive-specific delimiters for output. You can override
3213 [main] INFO org.apache.sqoop.tool.BaseSqoopTool - delimiters with --fields-terminated-by, etc.
3224 [main] WARN org.apache.sqoop.ConnFactory - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
3280 [main] INFO org.apache.sqoop.manager.oracle.OraOopManagerFactory - Data Connector for Oracle and Hadoop is disabled.
3293 [main] INFO org.apache.sqoop.manager.SqlManager - Using default fetchSize of 1000
3297 [main] INFO org.apache.sqoop.tool.CodeGenTool - Beginning code generation
3951 [main] INFO org.apache.sqoop.manager.OracleManager - Time zone has been set to GMT
4023 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM PT_PRELIM_FINDING_V t WHERE 1=0
4068 [main] INFO org.apache.sqoop.orm.CompilationManager - HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce
5925 [main] INFO org.apache.sqoop.orm.CompilationManager - Writing jar file: /tmp/sqoop-nobody/compile/0dab11f6545d8ef69d6dd0f6b9041a50/PT_PRELIM_FINDING_CYTOGEN_V.jar
5937 [main] INFO org.apache.sqoop.mapreduce.ImportJobBase - Beginning import of PT_PRELIM_FINDING_V
5962 [main] INFO org.apache.sqoop.manager.OracleManager - Time zone has been set to GMT
5981 [main] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies.
6769 [main] INFO org.apache.sqoop.mapreduce.db.DBInputFormat - Using read commited transaction isolation
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Thanks #abeaamase.
I asked our DBA to increase oracle database max process to 750 and max session pool to around 1.5 times process size i.e 1125.
This has solved the issue. This has nothing to do with yarn memory.Unfortunately in sqoop2 this exception is not handled.
Please feel free to add more answers,if you feel this explanation is not appropriate.

Loading data from HDFS to HBASE

I'm using Apache hadoop 1.1.1 and Apache hbase 0.94.3.I wanted to load data from HDFS to HBASE.
I wrote pig script to serve the purpose. First i created hbase table in habse and next wrote pig script to load the data from HDFS to HBASE. But it is not loading the data into hbase table. Dont know where it's going worng.
Below is the command used to craete hbase table :
create table 'mydata','mycf'
Below is the pig script to load data from hdfs to hbase:
A = LOAD '/user/hduser/Dataparse/goodrec1.txt' USING PigStorage(',') as (c1:int, c2:chararray,c3:chararray,c4:int,c5:chararray);
STORE A INTO 'hbase://mydata'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'mycf:c1,mycf:c2,mycf:c3,mycf:c4,mycf:c5');
After excecuting the script it says
2014-04-29 16:01:06,367 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-04-29 16:01:06,376 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2014-04-29 16:01:06,382 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.1.1 0.12.0 hduser 2014-04-29 15:58:07 2014-04-29 16:01:06 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201403142119_0084 A MAP_ONLY Message: Job failed! Error - JobCleanup Task Failure, Task: task_201403142119_0084_m_000001 hbase://mydata,
Input(s):
Failed to read data from "/user/hduser/Dataparse/goodrec1.txt"
Output(s):
Failed to produce result in "hbase://mydata"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201403142119_0084
2014-04-29 16:01:06,382 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
Please help where i'm going wrong ?
You have specified too many columns in the output to hbase. You have 5 input columns and 5 output columns, but HBaseStorage requires the first column to be the row key so there should only be 4 in the output

Resources