How to put data to Hbase without using java - hadoop

Are there any way to read data from a file and put them into Hbase table without using any java? I tried to store data from pig script by using
sample = LOAD '/mapr/user/username/sample.txt' AS (all:chararray);
STORE deneme INTO 'hbase://sampledata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mysampletable:intdata');
but this gave this error message:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filterWritableByteArrayComparable
ERROR org.apache.pig.tools.grunt.Grunt java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable

Pig seems like a good idea to import data into HBase. Check what Armon suggested about setting the $PIG_CLASSPATH.
Another possibility to bulk loading data into HBase is to use featured tools like ImportTsv (Tab Separated Values) and CompleteBulkLoad.
http://hbase.apache.org/book/ops_mgt.html#importtsv

Well, there's the Stargate REST interface, which is usable from any language. It's not perfect, but it's worth a look.

You just need to make sure that $PIG_CLASSPATH also points at hbase.jar

Related

AutoMLSearch with EvalML returning an error

I am getting following error message while trying to run AutoMLSearch with EvalML.
"All pipelines in the current AutoML batch produced a score of np.nan on the primary objective <evalml.objectives.standard_metrics.LogLossBinary object at 0x7f74defbe790>."
I tried the following solution to rectify this, still no use.
https://github.com/alteryx/evalml/issues/3154
Any suggestions?

ParDo did not have a ParDoPayload

I've written a Beam pipeline in Go that runs successfully on my local machine, but when I add --runner=dataflow to run it on Google Cloud Dataflow I get a vague error when it's setting up that a ParDo is missing a ParDoPayload. The stacktrace is entirely Java, so I'm not sure how to translate this back into my Go code to figure out what I'm missing.
I've gone through and used beam.RegisterFunction() for all of my functions that emit and also used beam.RegisterType() for the top-level struct I'm passing around.
Any ideas how this error connects to the code I've written / how I can debug?
java.lang.RuntimeException: ParDo did not have a ParDoPayload
at org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:327)
at org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:97)
at java.util.function.Function.lambda$andThen$1(Function.java:88)
at org.apache.beam.runners.dataflow.worker.graph.CreateRegisterFnOperationFunction.apply(CreateRegisterFnOperationFunction.java:207)
at org.apache.beam.runners.dataflow.worker.graph.CreateRegisterFnOperationFunction.apply(CreateRegisterFnOperationFunction.java:74)
at java.util.function.Function.lambda$andThen$1(Function.java:88)
at java.util.function.Function.lambda$andThen$1(Function.java:88)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:346)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
at org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.start(DataflowRunnerHarness.java:195)
at org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.main(DataflowRunnerHarness.java:123)
Caused by: org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.InvalidProtocolBufferException: Protocol message had invalid UTF-8.
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.InvalidProtocolBufferException.invalidUtf8(InvalidProtocolBufferException.java:141)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.Utf8$DecodeUtil.handleTwoBytes(Utf8.java:1909)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.Utf8$DecodeUtil.access$700(Utf8.java:1883)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.Utf8$UnsafeProcessor.decodeUtf8(Utf8.java:1411)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.Utf8.decodeUtf8(Utf8.java:340)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.CodedInputStream$ArrayDecoder.readStringRequireUtf8(CodedInputStream.java:804)
at org.apache.beam.model.pipeline.v1.RunnerApi$FunctionSpec.<init>(RunnerApi.java:55936)
at org.apache.beam.model.pipeline.v1.RunnerApi$FunctionSpec.<init>(RunnerApi.java:55897)
at org.apache.beam.model.pipeline.v1.RunnerApi$FunctionSpec$1.parsePartialFrom(RunnerApi.java:56565)
at org.apache.beam.model.pipeline.v1.RunnerApi$FunctionSpec$1.parsePartialFrom(RunnerApi.java:56559)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:883)
at org.apache.beam.model.pipeline.v1.RunnerApi$ParDoPayload.<init>(RunnerApi.java:10363)
at org.apache.beam.model.pipeline.v1.RunnerApi$ParDoPayload.<init>(RunnerApi.java:10320)
at org.apache.beam.model.pipeline.v1.RunnerApi$ParDoPayload$1.parsePartialFrom(RunnerApi.java:12633)
at org.apache.beam.model.pipeline.v1.RunnerApi$ParDoPayload$1.parsePartialFrom(RunnerApi.java:12627)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:100)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:120)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:125)
at org.apache.beam.vendor.grpc.v1p21p0.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48)
at org.apache.beam.model.pipeline.v1.RunnerApi$ParDoPayload.parseFrom(RunnerApi.java:11130)
at org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:325)
... 10 more

Sequence file reading issue using spark Java

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:
I have tried the workarounds for this issue like making the class serializable, still i face the issue. I am writing the code snippet here , please let me know what i am missing here.
Is it because of the BytesWritable data type or something else which is causing the issue.
JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class);
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){
public String call (Tuple2<BytesWritable,Text> row){
return row._2.toString()+"\n";
}).collect();
}
Here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
You can do it in spark accordingly
conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

Hbase Import Table Error

I was trying to import the data from one hbase(v0.98.4) to another hbase(v0.98.13).
I have exported the data using the below command -
hbase org.apache.hadoop.hbase.mapreduce.Driver export 'tblname' /path/
But I am not able to import it using the below command -
hbase org.apache.hadoop.hbase.mapreduce.Driver import 'tblname' /hdfs/path/
I get the below deprecation messages as well as an Exception thrown -
Is it becoz of version conflicts between source db and destination db?
I happen to solve it. All I had to do was create an empty table with same metadata and then import it. :)
Try using the commands here for Hbase versions above 0.94. May be you are using generalized Map reduce class and giving export and import as arguments, when the actual classes Export and Import are present. Hope it helps. Happy coding

Hadoop new API - Set OutputFormat

I'm trying to set the OutputFormat of my job to MapFileOutputFormat using:
jobConf.setOutputFormat(MapFileOutputFormat.class);
I get this error: mapred.output.format.class is incompatible with new reduce API mode
I suppose I should use the set setOutputFormatClass() of the new Job class but the problem is that when I try to do this:
job.setOutputFormatClass(MapFileOutputFormat.class);
it expects me to use this class: org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.
In hadoop 1.0.X there is no such class. It only exists in earlier versions (e.g 0.x)
How can I solve this problem ?
Thank you!
This problem has no decently easily implementable solution.
I gave up and used Sequence files which fit my requirements too.
Have you tried the following?
import org.apache.hadoop.mapreduce.lib.output;
...
LazyOutputFormat.setOutputFormatClass(job, MapFileOutputFormat.class);

Resources