Load CSV data to HBase using pig or hive - hadoop

Hi I have created a pig script which loads data into hbase. My csv file is stored into hadoop location at /hbase_tables/zip.csv
Pig Script
register /home/hduser/pig-0.12.0/lib/pig-0.8.0-core.jar;
A = LOAD '/hbase_tables/zip.csv' USING PigStorage(',') as (id:chararray, zip:chararray, desc1:chararray, desc2:chararray, income:chararray);
STORE A INTO 'hbase://mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('zip:zip,desc:desc1,desc:desc2,income:income');
when i execute it gives below error
Pig Stack Trace
ERROR 2017: Internal error creating job configuration.
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:667)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198)
at org.apache.pig.PigServer.execute(PigServer.java:1190)
at org.apache.pig.PigServer.access$100(PigServer.java:128)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1517)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:362)
at org.apache.pig.PigServer.executeBatch(PigServer.java:329)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:510)
at org.apache.pig.Main.main(Main.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.<init>(Path.java:71)
at org.apache.hadoop.fs.Path.<init>(Path.java:45)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:470)
... 20 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.<init>(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
... 23 more
Please let me know how i can import csv data file into hbase or if you have any alternate solution.

Seems like your problem is with "Relative path" in absolute URI: hbase://mydata_logs.
Are you sure the path is correct?

Probably table mydata_logs does not exist. Start: hbase shell and type list. Is your table mydata_logs on the list?

I had the same task once and have fully-working solution (actually, I'm not sure about commas in your third line of the code):
%default hbase_home `echo \$HBASE_HOME`;
%default tmp '/user/alexander/tmp/users_dump/k14'
set zookeeper.znode.parent '/hbase-unsecure';
set hbase.zookeeper.quorum 'dmp-hbase.local';
register $hbase_home/lib/zookeeper-3.4.5.jar;
register $hbase_home/hbase-0.94.20.jar;
UsersHdfs = LOAD '$tmp' using PigStorage('\t', '-schema');
store UsersHdfs into 'hbase://user_test' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'id:DEFAULT id:last_modified birth:year gender:female gender:male','-caster HBaseBinaryConverter'
);
That code works for me, maybe the matter is in you hbase configs.
You could provide your .csv file and we could talk about it in more details.

Related

Google Cloud connector for Hadoop doesn't work with Pig

I'm using Hadoop with HDFS 2.7.1.2.4 and Pig 0.15.0.2.4 (Hortonworks HDP 2.4) and trying to use Google Cloud Storage Connector for Spark and Hadoop (bigdata-interop on GitHub). It works correctly when I try, say,
hadoop fs -ls gs://bucket-name
But when I try the following in Pig (in mapreduce mode):
data = LOAD 'gs://softline/o365.avro' USING AvroStorage();
data = STORE data INTO 'gs://softline/o366.avro' USING AvroStorage();
Pig fails with the following errors:
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: java.lang.IllegalArgumentException: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.checkPath(GoogleHadoopFileSystemBase.java:741)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:466)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:701)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:163)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.setWorkingDirectory(GoogleHadoopFileSystemBase.java:1094)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:235)
... 18 more
If needed I could post the logs of GC connectors.
Hame somebody used Pig with this connectors? Any help would be appeciated.
TL;DR explicitly set workmapreduce.job.working.dir=/user/root/ when starting the pig job
If a working directory has not been explicitly set during job submission then Hadoop will set the working directory to be the working directory of the default filesystem. When using HDFS as your default FS the working directory will generally be something like 'hdfs://namenode:port/user/<your username>'.
When PigInputFormat#getSplits is called, it fetches the FileSystem associated with the path of the input that it is operating on. In this case the filesystem is an instance of GoogleHadoopFileSystem. Pig then inspects the path of its input and if the path is non-local calls FileSystem#setWorkingDirectory(job.getWorkingDirectory()). The problem here is that the job's working directory is 'hdfs://namenode:port/user/<your username>' which GoogleHadoopFileSystem will reject as a path to set as its own working directory (as it only supports 'gs://' paths).

PIG Script Error: java.lang.NoSuchMethodError: org.apache.thrift.protocol.TProtocol.getScheme

I am running a PIG script in mapreduce mode. The script reads RCFile (containing Thrift serialized data stored in GZIP compressed format), deserializes it using a UDF, extracts certain fields from the Thrift struct, and stores them.
Some of the mappers fail with following error:
2015-12-23 03:07:45,638 FATAL [Thread-5] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: org.apache.thrift.protocol.TProtocol.getScheme()Ljava/lang/Class;
at com.xxx.yyy.thrift.dto.LatLong.read(LatLong.java:553)
at com.twitter.elephantbird.util.ThriftUtils.readSingleFieldNoTag(ThriftUtils.java:318)
at com.twitter.elephantbird.util.ThriftUtils.readFieldNoTag(ThriftUtils.java:352)
at com.twitter.elephantbird.mapreduce.input.RCFileThriftTupleInputFormat$TupleReader.getCurrentTupleValue(RCFileThriftTupleInputFormat.java:74)
at com.twitter.elephantbird.pig.load.RCFileThriftPigLoader.getNext(RCFileThriftPigLoader.java:46)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:204)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:553)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Here's my script:
REGISTER '/user/ameya/libs/geo-analysis-1.0.0-SNAPSHOT.jar';
REGISTER '/user/ameya/libs/libthrift-0.8.0.jar';
REGISTER '/user/ameya/libs/thrift-0.8-types-1.1.29-SNAPSHOT.jar';
REGISTER '/user/ameya/libs/libs/elephant-bird-pig-4.7.jar';
REGISTER '/user/ameya/libs/libs/elephant-bird-rcfile-4.7.jar';
REGISTER '/user/ameya/libs/libs/elephant-bird-core-4.7.jar';
REGISTER '/user/ameya/libs/libs/elephant-bird-hadoop-compat-4.7.jar';
REGISTER '/user/ameya/libs/libs/hive-0.4.1.jar';
REGISTER '/user/ameya/libs/libs/libs/hive-serde-0.13.3.jar';
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
thrift = LOAD '$input' USING com.twitter.elephantbird.pig.load.RCFileThriftPigLoader('com.xxx.yyy.thrift.dto.LatLong');
final = FOREACH thrift GENERATE (requestLatLong is not null ? requestLatLong.latitude : null) AS req_ll_lat,
(requestLatLong is not null ? requestLatLong.longitude : null) AS req_ll_lng;
STORE final INTO '$output';
I am using libthrift-0.8.0.jar, where class TProtocol.java has indeed defined getScheme() method (with public access). Interestingly, not all the mappers fail, just a few of them; but that causes my job to fail. Could this be a CLASSPATH issue?
I tried searching for this issue, but could not find relevant answers. Can someone please help me get some leads to fix this?
Found the reason. The class "org.apache.thrift.protocol.TProtocol" was defined in two jars, i.e. libthrift-0.8.0.jar and hive-0.4.1.jar. The one in hive-0.4.1.jar did not have method getScheme() defined. When it picked up hive-0.4.1.jar in the classpath first, the mappers were not able to find method getScheme().
I am not sure why the behavior was not consistent across all mappers. Any comments to explain that would be helpful.
I replaced hive-0.4.1.jar with hive-exec-0.13.3.jar and the issue got resolved.

Error while loading .gz files in Pig Script

I have large number of .gz files in HDFS and I am trying to load them using PigStorage to process the data and I am getting the following exception
java.io.EOFException: Unexpected end of input stream at
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
at java.io.InputStream.read(InputStream.java:85) at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:139)
at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:55) at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:530)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:363) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
This might be the result of some of the files are corrupt.
Is pig having any kind of error handling using which I can skip the files that are corrupt. Following is the sample code I am using:
cal = load '$inputdir/CAL/*/*/*/*/*/*/*.gz' USING PigStorage('\t');
It probably has to do with corrupted files.
You can use the mapred.max.map.failures.percent setting (or mapred.max.reduce.failures.percent but this has nothing to do with your case) to control the percentage of failures you're ok with ignoring.
The problem is that a single mapper can read multiple gz files so a corrupted file that fails the mapper, make it skip files that are ok as well.

Writing snappy compressed data to a hive table

I've created a hive table and now I want to load snappy compressed data into the table. Therefore I did the following:
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
CREATE TABLE toydata_table (id STRING, value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";'
Then I created as CSV file called toydata.csv that has the following content:
A,Value1
B,Value2
C,Value3
I compressed this file with snzip ( https://github.com/kubo/snzip ) by doing
/usr/local/bin/snzip -t snappy-java toydata.csv
which produces toydata.csv.snappy. After having done this I returned to the hive cli and loaded the data by LOAD DATA LOCAL INPATH "toydata.csv.snappy" INTO TABLE toydata_table;. But now I want to try to query from that table and get the following error message:
hive> select * from toydata_table;
OK
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method)
at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:62)
at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:189)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:175)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:433)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:515)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:489)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1471)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I did the exact same thing with gzip and using gzip works fine. So, why does this part fail?
Please install snappy compression codec on your cluster.If you want to confirm whether snappy is installed please find libsnappy.so file in your libraries.
Also you need to start hive shell with --auxpath parameter and provide snappy.jar.e.g: hive --auxpath /home/user/snappy1.0.4.1.jar.

Error When Loading Data from HDFS and Writing to HBase using Pig

How to load the output data of a mapreduce program which is in the hdfs into hbase?
I tried to running the following pig command to load the data from hdfs to hbase:-
A = LOAD 'hdfs://b**/user/user1/development/hbase/output/part-00000' USING PigStorage('t') as (strdata1:chararray, strdata2:chararray);
STORE A INTO 'hbase://mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:strdata2');
where, hdfs://b**/user/user1/development/hbase/output/part-00000 is the map-reduce output mydata is the hbase table name created
mycf is the column family name
I am getting the following error:-
ERROR 2017: Internal error creating job configuration.
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:673)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:147)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198)
at org.apache.pig.PigServer.execute(PigServer.java:1190)
at org.apache.pig.PigServer.access$100(PigServer.java:128)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1517)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:362)
at org.apache.pig.PigServer.executeBatch(PigServer.java:329)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:169)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:406)
at org.apache.pig.Main.main(Main.java:107)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.<init>(Path.java:71)
at org.apache.hadoop.fs.Path.<init>(Path.java:45)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:476)
... 15 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs at java.net.URI.checkPath(URI.java:1787)
at java.net.URI.<init>(URI.java:735)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
Only remove the hbase:// schema from data source string like that :
STORE A INTO 'mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:strdata2');

Resources