Read Snappy compressed Hive RCFile in Apache Pig - hadoop

Trying to read Hive files in Pig using http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html
Fies have RCF, SnappyCodec and hive.io.rcfile.column.number words in its beginning, they are binary files. Moreover they are partitioned over multiple directories (like /day=20140701).
However simple script of loading, grouping and counting rows prints nothing to output. If I try to add "ILLUSTRATE" like this:
rows = LOAD ... using HiveColumnarLoader ...;
ILLUSTRATE rows;
I get error like this:
2014-07-17 14:16:43,086 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
...
I'm not sure, whether it is because of Snappy compression or some trouble with specifying schema (I copied it from hive, describe table).
Could anyone please confirm that HiveColumnarLoader works with snappy compressed files or propose another approach?
Thanks in advance!

Have you tried the HCatLoader?
rows = LOAD 'tablename' using org.apache.hcatalog.pig.HCatLoader();

Related

Does BigQuery supports loading HDFS style partitioned data?

In HDFS, partitioned data is stored as multiple files like
hdfs://user/hive/warehouse/TABLE_NAME/column_1="VALUE"/column_2="VALUE"/000000
Does big query supports loading these files as they are or is it necessary to flatten the data into one single file?
Nothing is mentioned in the documentation regarding loading the files as they are.
Multiple files can be loaded in bigquery comes under same directory, So no need to flatten.
Below is the sample code:
bq load --replace --quote "" -F"\t" ${db_name}.${tgt_table_name}\$${bq_partition} gs://bucket_name/folder/*
Let me know if it helps or not.

How to specify schema while reading parquet file with pyspark?

While reading a parquet file stored in hadoop with either scala or pyspark an error occurs:
#scala
var dff = spark.read.parquet("/super/important/df")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 52 elided
or
sql_context.read.parquet(output_file)
results in the same error.
Error message is pretty clear about what has to be done: Unable to infer schema for Parquet. It must be specified manually.;.
But where can I specify it?
Spark 2.1.1, Hadoop 2.5, dataframes are created with a help of pyspark. Files are partitioned into 10 peaces.
This error usually occurs when you try to read an empty directory as parquet.
If for example you create an empty DataFrame, you write it in parquet and then read it, this error appears.
You could check if the DataFrame is empty with rdd.isEmpty() before write it.
I have done a quick implementation for the same
Hope this Helps!!...

How to skip file headers in impala external table?

I have file on HDFS with 78 GB size
I need to create an Impala External table over it to perform some grouping and aggregation on data available
Problem
The file contain headers.
Question
Is there any way to skip headers from file while reading the file and do querying on the rest of data.
Although i have a way to solve the problem by copying file to local then remove the headers and then copy the updated file to HDFS again but that is not feasible as the file size is too large
Please suggest if anyone have any idea...
Any suggestions will be appreciated....
Thanks in advance
UPDATE or DELETE row operations are not available in Hive/Impala. So you should simulate DELETE as
Load data file into a temporary Hive/Impala table
Use INSERT INTO or CREATE TABLE AS on temp table to create require table
A straightforward approach would be to run the HDFS data through Pig to filter out the headers and generate a new HDFS dataset formatted so that Impala could read it cleanly.
A more arcane approach would depend on the format of the HDFS data. For example, if both header and data lines are tab-delimited, then you could read everything using a schema with all STRING fields and then filter or partition out the headers before doing aggregations.

ORCfile storage implementation in Pig

does anybody know how to use ORCfiles input/output in Pig?
I found some kind of support for RCFiles in elephant-birds, but it seems ORC format is not supported...
Could you please provide a sample of using Pig to access/store ORC files in Pig?
Support for ORC Storage through Pig is not yet committed and under active development. Refer to Apache JIRA PIG-3558. Following this, you would be able to access ORC files via your Pig Script like this
load 'foo.orc' using OrcStorage();
...
store .. using OrcStorage('-c SNAPPY');
Define a HCatalog table using HCat CLI stored as ORC.Then LOAD the relation in pig using org.apache.hcatalog.pig.HCatLoader() or STORE using org.apache.hcatalog.pig.HCatStorer()

Filtering Using MapReduce in Hadoop

I want to filter records from given file based on some criteria,i want my criteria to be if value of third field is equal to some value then retrive that record and save it in output file .i am taking CSV file as input.Can anyone suggest something ?
Simplest way would probably be to use pig
something like
orig = load 'filename.csv' using PigStorage(',') as (first,second,third:chararray,...);
filtered_orig= FILTER orig by third=="somevalue";
store filtered_orig into 'newfilename' using PigStorage(',');
If you need scalability you can use hadoop in the following way:
Install Hadoop, install hive, put your csv files into HDFS.
define the CSV file as external table (http://hive.apache.org/docs/r0.8.1/language_manual/data-manipulation-statements.html) and then you can write SQLs against the CSV file. Results of SQL can be then exported back to CSV.

Resources