Error When Loading Data from HDFS and Writing to HBase using Pig

Error When Loading Data from HDFS and Writing to HBase using Pig - hadoop

How to load the output data of a mapreduce program which is in the hdfs into hbase?
I tried to running the following pig command to load the data from hdfs to hbase:-
A = LOAD 'hdfs://b**/user/user1/development/hbase/output/part-00000' USING PigStorage('t') as (strdata1:chararray, strdata2:chararray);
STORE A INTO 'hbase://mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:strdata2');
where, hdfs://b**/user/user1/development/hbase/output/part-00000 is the map-reduce output mydata is the hbase table name created
mycf is the column family name
I am getting the following error:-
ERROR 2017: Internal error creating job configuration.
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:673)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198)
at org.apache.pig.PigServer.execute(PigServer.java:1190)
at org.apache.pig.PigServer.access$100(PigServer.java:128)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1517)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:362)
at org.apache.pig.PigServer.executeBatch(PigServer.java:329)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:406)
at org.apache.pig.Main.main(Main.java:107)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.<init>(Path.java:71)
at org.apache.hadoop.fs.Path.<init>(Path.java:45)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:476)
... 15 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs at java.net.URI.checkPath(URI.java:1787)
at java.net.URI.<init>(URI.java:735)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)

Only remove the hbase:// schema from data source string like that :
STORE A INTO 'mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:strdata2');

Related

JNI error while creating external table using gphdfs protocol : greenplum

1) Completed "One-time HDFS Protocol Installation" using link - http://gpdb.docs.pivotal.io/4360/admin_guide/load/topics/g-one-time-hdfs-protocol-installation.html#topic20
2) copied the 'csv' file on hdfs system at path - data/etl/ext01
3) created external table using following command
create external table orgData(orghk varchar(200),eff_datetime timestamp, source varchar(20), handle_id varchar(200), created_by_d varchar(100), created_datetime timestamp)
location ('gphdfs://<hostname>:8020/data/etl/ext01/part-r-00000-3eae416a-d0ff-4562-a762-d53469d42cd2.csv')
Format 'CSV' (DELIMITER ',')
However after executing the command - select * from orgData
I am getting following error
ERROR: ERROR: external table gphdfs protocol command ended with
error. Error: A JNI error has occurred, please check your
installation and try again (seg1 slice1
<hostname2>:40000 pid=4977) Detail:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/mapreduce/lib/input/FileInputFormat at
java.lang.Class.getDeclaredMethods0(Native Method) at
java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at
java.lang.Class.privateGetMethodRecursive(Class.java:3048) at
java.lang.Class.getMethod0(Class.java:3018) at
java.lang.Class.getMethod(Class.java:1784) at
sun.launcher.LauncherHelper.valid Command:
'gphdfs://<hostname>:8040/data/etl/ext01/part-r-00000-3eae416a-d0ff-4562-a762-d53469d42cd2.csv'
External table orgdata, file
gphdfs://<hostname>:8040/data/etl/ext01/part-r-00000-3eae416a-d0ff-4562-a762-d53469d42cd2.csv
Am I missing something?

Can you verify you set JAVA_HOME and HADOOP_HOME on ALL segments, then restarted the cluster?
gpssh -f clusterHostfile -e 'egrep (JAVA_HOME|HADOOP_HOME) ~/.bashrc | wc -l'
You should see the number 2 from each host in the cluster.

A lot of AlreadyBeingCreatedException and LeaseExpiredException when writing parquet from spark

I have several parallel Spark jobs doing the same thing, they work on separate input/output dirs, at the end they write results to parquet from dataframe using one of the columns as a partitioner. Jobs with the biggest inputs often fail. Some of executors start to fail with below exceptions, then a stage fails and start recalculating a failed partition, if number of failed stages reaches 4(if it reaches, sometimes it doesn't and the whole job finishes successfully) the whole job is canceled.
Stages fails with these failure reasons(from spark UI):
org.apache.spark.shuffle.FetchFailedException
Connection closed by
peer
I tried to find clues on the Internet and it seems the reason maybe speculative execution, but I don't enable it in Spark, any other ideas what is the reason of that?
Spark job code:
sqlContext
.createDataFrame(finalRdd, structType)
.write()
.partitionBy(PARTITION_COLUMN_NAME)
.parquet(tmpDir);
Exceptions in executors:
16/09/14 11:04:06 ERROR datasources.DynamicPartitionWriterContainer: Aborting task.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006023_0/partition=2/part-r-06023-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_1489398656_198] for client [10.117.102.72], because this file is already being created by [DFSClient_NONMAPREDUCE_-2049022202_200] on [10.117.102.15]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141105_0001_m_006489_0/partition=2/part-r-06489-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318361396): File does not exist. Holder DFSClient_NONMAPREDUCE_-1428957718_196 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141105_0001_m_006310_0/partition=2/part-r-06310-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_-419723425_199] for client [10.117.102.44], because this file is already being created by [DFSClient_NONMAPREDUCE_596138765_198] on [10.117.102.35]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_005877_0/partition=2/part-r-05877-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318359423): File does not exist. Holder DFSClient_NONMAPREDUCE_193375828_196 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_005621_0/partition=2/part-r-05621-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_498917218_197] for client [10.117.102.36], because this file is already being created by [DFSClient_NONMAPREDUCE_-578682558_197] on [10.117.102.16]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006311_0/partition=2/part-r-06311-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318359109): File does not exist. Holder DFSClient_NONMAPREDUCE_-60951070_198 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3284)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006215_0/partition=2/part-r-06215-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318359393): File does not exist. Holder DFSClient_NONMAPREDUCE_-331523575_197 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006311_0/partition=2/part-r-06311-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_1869576560_198] for client [10.117.102.44], because this file is already being created by [DFSClient_NONMAPREDUCE_-60951070_198] on [10.117.102.70]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
Spark UI:
We use Spark 1.6 (CDH 5.8)

Load CSV data to HBase using pig or hive

Hi I have created a pig script which loads data into hbase. My csv file is stored into hadoop location at /hbase_tables/zip.csv
Pig Script
register /home/hduser/pig-0.12.0/lib/pig-0.8.0-core.jar;
A = LOAD '/hbase_tables/zip.csv' USING PigStorage(',') as (id:chararray, zip:chararray, desc1:chararray, desc2:chararray, income:chararray);
STORE A INTO 'hbase://mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('zip:zip,desc:desc1,desc:desc2,income:income');
when i execute it gives below error
Pig Stack Trace
ERROR 2017: Internal error creating job configuration.
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:667)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198)
at org.apache.pig.PigServer.execute(PigServer.java:1190)
at org.apache.pig.PigServer.access$100(PigServer.java:128)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1517)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:362)
at org.apache.pig.PigServer.executeBatch(PigServer.java:329)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:510)
at org.apache.pig.Main.main(Main.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.<init>(Path.java:71)
at org.apache.hadoop.fs.Path.<init>(Path.java:45)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:470)
... 20 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
... 23 more
Please let me know how i can import csv data file into hbase or if you have any alternate solution.

Seems like your problem is with "Relative path" in absolute URI: hbase://mydata_logs.
Are you sure the path is correct?

Probably table mydata_logs does not exist. Start: hbase shell and type list. Is your table mydata_logs on the list?

I had the same task once and have fully-working solution (actually, I'm not sure about commas in your third line of the code):
%default hbase_home `echo \$HBASE_HOME`;
%default tmp '/user/alexander/tmp/users_dump/k14'
set zookeeper.znode.parent '/hbase-unsecure';
set hbase.zookeeper.quorum 'dmp-hbase.local';
register $hbase_home/lib/zookeeper-3.4.5.jar;
register $hbase_home/hbase-0.94.20.jar;
UsersHdfs = LOAD '$tmp' using PigStorage('\t', '-schema');
store UsersHdfs into 'hbase://user_test' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'id:DEFAULT id:last_modified birth:year gender:female gender:male','-caster HBaseBinaryConverter'
);
That code works for me, maybe the matter is in you hbase configs.
You could provide your .csv file and we could talk about it in more details.

SerDe problems with Hive 0.12 and Hadoop 2.2.0-cdh5.0.0-beta2

The title is a bit weird as I'm having difficulties narrowing down the problem. I used my solution on Hadoop 2.0.0-cdh4.4.0 and hive 0.10 without issues.
I can't create a table using this SerDe: https://github.com/rcongiu/Hive-JSON-Serde
first try:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.<init>(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V
second try:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Could not initialize class org.openx.data.jsonserde.objectinspector.JsonObjectInspectorFactory
I can create a table with this SerDe: https://github.com/cloudera/cdh-twitter-example
I create an external table with tweets from flume. I can't do "SELECT * FROM tweets;"
FAILED: RuntimeException org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDe
I can do SELECT id, text FROM tweets;
I can do a SELECT COUNT(*) FROM tweets;
I can't self join this table:
Execution log at: /tmp/jochen.debie/jochen.debie_20140311121313_164611a9-b0d8-4e53-9bda-f9f7ac342aaf.log
2014-03-11 12:13:30 Starting to launch local task to process map join; maximum memory = 257294336
Execution failed with exit status: 2
Obtaining error information
Task failed!
Task ID:
Stage-5
mentioned execution log:
2014-03-11 12:13:30,331 ERROR mr.MapredLocalTask (MapredLocalTask.java:executeFromChildJVM(324)) - Hive Runtime Error: Map local work failed
org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDe
Does anyone know how to fix this or at least show me where the problem is?
EDIT: Can it be a problem that I built the serde on a Hadoop 2.0.0-cdh4.4.0 and hive 0.10?

From what I've seen, Hive-.11+ has a bug in join with custom SerDe.
https://github.com/Esri/gis-tools-for-hadoop/issues/9
You might try the workaround of copying the JAR file containing the SerDe class, to $HIVE_HOME/lib .
(I see in your question you got ClassNotFoundException both in join and in other cases; so far the times I have encountered such were all with join.)
[Edit] Another workaround is to use HADOOP_CLASSPATH:
env HADOOP_CLASSPATH=some.jar:other.jar hive ...
[Edit] The work around applies to Hive versions 0.11 and 0.12; then 0.13 and above contain the fix for HIVE-6670.

Can't Store pig relation into Hbase

Hi I am trying to store pig relation into HBase.
store result INTO 'hbase://hourlyAggregation' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount');
This is running fine in local. when I tried to run pig in mapred mode my job is failing and my log is showing no error
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message
Details at logfile: /home/HadoopUser/pig_1384412383791.log
Pig Stack Trace
---------------
ERROR 2244: Job failed, hadoop does not return any error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:119)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:107)
================================================================================
my profile is as follows
export JAVA_HOME=/home/hadoop/jdk1.6.0_39
export HADOOP_HOME=$MY_HOME/hadoop-0.20.2-cdh3u4
export CLASSPATH=$JAVA_HOME/lib/tools.jar:.
export HIVE_HOME=$MY_HOME/hive-0.7.1-cdh3u4
export PIG_HOME=$MY_HOME/pig-0.8.1-cdh3u4
export HBASE_HOME=$MY_HOME/hbase-0.90.6-cdh3u4
export PIG_CLASSPATH=”`${HBASE_HOME}/bin/hbase classpath`:$PIG_CLASSPATH”
please help me on this
I even tried to register jars of zookeeper and hbase in pig_home/lib
JT log
14-Nov-2013 14:48:29 (17sec)
java.lang.RuntimeException: could not instantiate 'org.apache.pig.backend.hadoop.hbase.HBaseStorage' with arguments '[countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount]'
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:502)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc(POStore.java:218)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:85)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:68)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:278)
at org.apache.hadoop.mapred.Task.initialize(Task.java:511)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:306)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

registering hbase,zookeeper jar in PIG/lib and guava jar in hbase lib did worked thanks Lorand Bendig for ur support.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Error When Loading Data from HDFS and Writing to HBase using Pig - hadoop

Only remove the hbase:// schema from data source string like that : STORE A INTO 'mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:strdata2');

Related

JNI error while creating external table using gphdfs protocol : greenplum

A lot of AlreadyBeingCreatedException and LeaseExpiredException when writing parquet from spark

Load CSV data to HBase using pig or hive

SerDe problems with Hive 0.12 and Hadoop 2.2.0-cdh5.0.0-beta2

Can't Store pig relation into Hbase

Categories

Resources