Apache Pig type casting - hadoop

I'm using Apache Pig to do some data processing work. I wrote a Pig Latin script like this:
raw = Load 'data.csv' USING MyLoader();
repaired = FOREACH raw GENERATE MyRepairFunc(*);
filtered = FOREACH repaired GENERATE $0 AS name:chararray, $3 AS age:int;
DUMP filtered;
Pig arose an error:
java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Integer
at org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:115)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:124)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:281)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:274)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
It's a data casting problem. Due to the fact that the raw data may contain some damaged records, I cannot determine the schema when loading, in case of data loss.
Then what should I do to fix this? Thanks a lot

You should fix your raw data(data cleansing) before data analyze.
There is a pig UDF and try to cleanse raw data with expected pattern, but did not be merged into main branch.
PIG-3735 UDF to data cleanse the dirty data with expected pattern
You can try to cleanse raw data with your favorite tools.
Please refer to the tools recommended in
https://infocus.emc.com/david_dietrich/the-dirty-little-secret-of-big-data-projects/

Related

How to specify schema while reading parquet file with pyspark?

While reading a parquet file stored in hadoop with either scala or pyspark an error occurs:
#scala
var dff = spark.read.parquet("/super/important/df")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 52 elided
or
sql_context.read.parquet(output_file)
results in the same error.
Error message is pretty clear about what has to be done: Unable to infer schema for Parquet. It must be specified manually.;.
But where can I specify it?
Spark 2.1.1, Hadoop 2.5, dataframes are created with a help of pyspark. Files are partitioned into 10 peaces.
This error usually occurs when you try to read an empty directory as parquet.
If for example you create an empty DataFrame, you write it in parquet and then read it, this error appears.
You could check if the DataFrame is empty with rdd.isEmpty() before write it.
I have done a quick implementation for the same
Hope this Helps!!...

Deserialize protobuf column with Hive

I am really new to Hive, I apologize if there are any misconceptions in my question.
I need to read a hadoop Sequence File into a Hive table, the sequence file is thrift binary data, which could be deserialized using SerDe2 that comes with Hive.
The problem now is: One column in the file is encoded with Google protobuf, so when thrift SerDe processes the sequence file it does not process the protobuf encoded column properly.
I wonder if there's a way in Hive to deal with this kind of protobuf encoded columns that are nested inside a thrift sequence file, so that each column could be parsed properly?
Thank you so much for any possible help!
I believe you should use some other serde to deserialize the proto buff format,
may be you can refer this,
https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

Read Snappy compressed Hive RCFile in Apache Pig

Trying to read Hive files in Pig using http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html
Fies have RCF, SnappyCodec and hive.io.rcfile.column.number words in its beginning, they are binary files. Moreover they are partitioned over multiple directories (like /day=20140701).
However simple script of loading, grouping and counting rows prints nothing to output. If I try to add "ILLUSTRATE" like this:
rows = LOAD ... using HiveColumnarLoader ...;
ILLUSTRATE rows;
I get error like this:
2014-07-17 14:16:43,086 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
...
I'm not sure, whether it is because of Snappy compression or some trouble with specifying schema (I copied it from hive, describe table).
Could anyone please confirm that HiveColumnarLoader works with snappy compressed files or propose another approach?
Thanks in advance!
Have you tried the HCatLoader?
rows = LOAD 'tablename' using org.apache.hcatalog.pig.HCatLoader();

Loading protobuf format file into pig script using loadfunc pig UDF

I have very little knowledge of pig. I have protobuf format data file. I need to load this file into a pig script. I need to write a LoadFunc UDF to load it. say function is Protobufloader().
my PIG script would be
A = LOAD 'abc_protobuf.dat' USING Protobufloader() as (name, phonenumber, email);
All i wish to know is How do i get the file input stream. Once i get hold of file input stream, i can parse the data from protobuf format to PIG tuple format.
PS: thanks in advance
Twitter's open source library elephant bird has many such loaders:
https://github.com/kevinweil/elephant-bird
You can use LzoProtobufB64LinePigLoader and LzoProtobufBlockPigLoader.
https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load
To use it, you just need to do:
define ProtoLoader com.twitter.elephantbird.pig.load.LzoProtobufB64LineLoader('your.proto.class.name');
a = load '/your/file' using ProtoLoader;
b = foreach a generate
field1, field2;
After loading, it will be automatically translated to pig tuples with proper schema.
However, they assume you write your data in serialized protobuffer and compressed by lzo.
They have corresponding writers as well, in package com.twitter.elephantbird.pig.store.
If your data format is a bit different, you can adapt their code to your custom loader.

Running Pig query over data stored in Hive

I would like to know how to run Pig queries stored in Hive format. I have configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage).
Before that I used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use?
Note that don't need close integration like mentioned here: Using Hive with Pig, just what load function to use to read compressed sequence files generated by Hive.
Thanks for all the answers.
Here's what I found out:
Using HiveColumnarLoader makes sense if you store data as a RCFile. To load table using this you need to register some jars first:
register /srv/pigs/piggybank.jar
register /usr/lib/hive/lib/hive-exec-0.5.0.jar
register /usr/lib/hive/lib/hive-common-0.5.0.jar
a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('ts int, user_id int, url string');
To load data from Sequence file you have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:
register /srv/pigs/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/user/hive/warehouse/table' USING SequenceFileLoader AS (int, int);
This doesn't work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and you get this exception:
2011-07-01 10:30:08,589 WARN org.apache.pig.piggybank.storage.SequenceFileLoader: Unable to translate key class org.apache.hadoop.io.BytesWritable to a Pig datatype
2011-07-01 10:30:08,625 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class org.apache.hadoop.io.BytesWritable to a Pig datatype
at org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:132)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
How to compile piggybank is described here: Unable to build piggybank -> /home/build/ivy/lib does not exist

Resources