Running Pig query over data stored in Hive - hadoop

I would like to know how to run Pig queries stored in Hive format. I have configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage).
Before that I used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use?
Note that don't need close integration like mentioned here: Using Hive with Pig, just what load function to use to read compressed sequence files generated by Hive.
Thanks for all the answers.

Here's what I found out:
Using HiveColumnarLoader makes sense if you store data as a RCFile. To load table using this you need to register some jars first:
register /srv/pigs/piggybank.jar
register /usr/lib/hive/lib/hive-exec-0.5.0.jar
register /usr/lib/hive/lib/hive-common-0.5.0.jar
a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('ts int, user_id int, url string');
To load data from Sequence file you have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:
register /srv/pigs/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/user/hive/warehouse/table' USING SequenceFileLoader AS (int, int);
This doesn't work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and you get this exception:
2011-07-01 10:30:08,589 WARN org.apache.pig.piggybank.storage.SequenceFileLoader: Unable to translate key class org.apache.hadoop.io.BytesWritable to a Pig datatype
2011-07-01 10:30:08,625 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class org.apache.hadoop.io.BytesWritable to a Pig datatype
at org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:132)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
How to compile piggybank is described here: Unable to build piggybank -> /home/build/ivy/lib does not exist

Related

How to specify schema while reading parquet file with pyspark?

While reading a parquet file stored in hadoop with either scala or pyspark an error occurs:
#scala
var dff = spark.read.parquet("/super/important/df")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 52 elided
or
sql_context.read.parquet(output_file)
results in the same error.
Error message is pretty clear about what has to be done: Unable to infer schema for Parquet. It must be specified manually.;.
But where can I specify it?
Spark 2.1.1, Hadoop 2.5, dataframes are created with a help of pyspark. Files are partitioned into 10 peaces.
This error usually occurs when you try to read an empty directory as parquet.
If for example you create an empty DataFrame, you write it in parquet and then read it, this error appears.
You could check if the DataFrame is empty with rdd.isEmpty() before write it.
I have done a quick implementation for the same
Hope this Helps!!...

Apache Pig type casting

I'm using Apache Pig to do some data processing work. I wrote a Pig Latin script like this:
raw = Load 'data.csv' USING MyLoader();
repaired = FOREACH raw GENERATE MyRepairFunc(*);
filtered = FOREACH repaired GENERATE $0 AS name:chararray, $3 AS age:int;
DUMP filtered;
Pig arose an error:
java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Integer
at org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:115)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:124)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:281)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:274)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
It's a data casting problem. Due to the fact that the raw data may contain some damaged records, I cannot determine the schema when loading, in case of data loss.
Then what should I do to fix this? Thanks a lot
You should fix your raw data(data cleansing) before data analyze.
There is a pig UDF and try to cleanse raw data with expected pattern, but did not be merged into main branch.
PIG-3735 UDF to data cleanse the dirty data with expected pattern
You can try to cleanse raw data with your favorite tools.
Please refer to the tools recommended in
https://infocus.emc.com/david_dietrich/the-dirty-little-secret-of-big-data-projects/

How to store Avro format in HDFS using PIG?

After processing input data, I've a JAVA object. I've created avro schema for storing the object in avro file. I'm stuck at writing the object using schema into HDFS. Can anyone walk me through the process of writing the object using PIG script & corresponding UDF?
I suppose you are using an UDF if you use Java.
So you have just to return the result of your UDF as a pig Tuple.
Then you get a relation with your data ready to store.
Finally You can use the STORE command using AvroStorage.

Read Snappy compressed Hive RCFile in Apache Pig

Trying to read Hive files in Pig using http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html
Fies have RCF, SnappyCodec and hive.io.rcfile.column.number words in its beginning, they are binary files. Moreover they are partitioned over multiple directories (like /day=20140701).
However simple script of loading, grouping and counting rows prints nothing to output. If I try to add "ILLUSTRATE" like this:
rows = LOAD ... using HiveColumnarLoader ...;
ILLUSTRATE rows;
I get error like this:
2014-07-17 14:16:43,086 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
...
I'm not sure, whether it is because of Snappy compression or some trouble with specifying schema (I copied it from hive, describe table).
Could anyone please confirm that HiveColumnarLoader works with snappy compressed files or propose another approach?
Thanks in advance!
Have you tried the HCatLoader?
rows = LOAD 'tablename' using org.apache.hcatalog.pig.HCatLoader();

ORCfile storage implementation in Pig

does anybody know how to use ORCfiles input/output in Pig?
I found some kind of support for RCFiles in elephant-birds, but it seems ORC format is not supported...
Could you please provide a sample of using Pig to access/store ORC files in Pig?
Support for ORC Storage through Pig is not yet committed and under active development. Refer to Apache JIRA PIG-3558. Following this, you would be able to access ORC files via your Pig Script like this
load 'foo.orc' using OrcStorage();
...
store .. using OrcStorage('-c SNAPPY');
Define a HCatalog table using HCat CLI stored as ORC.Then LOAD the relation in pig using org.apache.hcatalog.pig.HCatLoader() or STORE using org.apache.hcatalog.pig.HCatStorer()

Resources