How to specify schema while reading parquet file with pyspark?

How to specify schema while reading parquet file with pyspark? - hadoop

While reading a parquet file stored in hadoop with either scala or pyspark an error occurs:
#scala
var dff = spark.read.parquet("/super/important/df")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
... 52 elided
or
sql_context.read.parquet(output_file)
results in the same error.
Error message is pretty clear about what has to be done: Unable to infer schema for Parquet. It must be specified manually.;.
But where can I specify it?
Spark 2.1.1, Hadoop 2.5, dataframes are created with a help of pyspark. Files are partitioned into 10 peaces.

This error usually occurs when you try to read an empty directory as parquet.
If for example you create an empty DataFrame, you write it in parquet and then read it, this error appears.
You could check if the DataFrame is empty with rdd.isEmpty() before write it.

I have done a quick implementation for the same
Hope this Helps!!...

Related

Ab-Initio dml-to-hive

I am using AbInitio and attempting to have my results from my query in my Input Table populated into hdfs. I am wanting the format in parquet. I tried using the dml to hive text but the following is my results and I am not sure what this means.
$ dml-to-hive text $AI_DML/myprojectdml.dml
Usage: dml-to-avro <record_format> <output_file>
or: dml-to-avro help
<record-format> is one of:
<filename> Read record format from file
-string <string> Read record format from string
<output_file> is one of:
<filename> Output Avro schema to file
- Output Avro schema to standard output
I also tried using the Write Hive Table component but I receive the following error:
[B276]
The internal charset "XXcharset_NONE" was encountered when a valid character set data
structure was expected. One possible cause of this error is that you specified a
character set to the Co>Operating System that is misspelled or otherwise incorrect.
If you cannot resolve the error please contact Customer Support.
Any help would be great, I am trying to have my output to hdfs in parquet.
Thanks,
Chris Richardson

I know this is a late reply, but if you're still working on this or somebody else stumbles onto this like I did, I think I've found a solution.
I used dml-to-hive to create a DML for parquet format and write it to a file.
dml-to-hive parquet current.dml > parquet.dml
Once this dml is created, you can use it on the in port of the "Write HDFS" component. Double click the component, go to Port tab, click Radio button "Use File" and then point it to parquet.dml
Then, just set the WRITE_FORMAT choice to parquet and give it a whirl. I was able to create parquet, orc, and avro files using the above process.

Deserialize protobuf column with Hive

I am really new to Hive, I apologize if there are any misconceptions in my question.
I need to read a hadoop Sequence File into a Hive table, the sequence file is thrift binary data, which could be deserialized using SerDe2 that comes with Hive.
The problem now is: One column in the file is encoded with Google protobuf, so when thrift SerDe processes the sequence file it does not process the protobuf encoded column properly.
I wonder if there's a way in Hive to deal with this kind of protobuf encoded columns that are nested inside a thrift sequence file, so that each column could be parsed properly?
Thank you so much for any possible help!

I believe you should use some other serde to deserialize the proto buff format,
may be you can refer this,
https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

Adding parquet-avro support to scalding

How can I create a Scalding Source that will handle conversions between avro and parquet.
The solution should:
1. Read from parquet format and convert to avro memory representation
2. Write avro objects into a parquet file
Note: I noticed Cascading has a module for leveraging thrift and parquet. It occurs to me that this would be a good place to start looking. I also opened a thread on google-groups/scalding-dev

Try our latest changes in this fork -
https://github.com/epishkin/scalding/tree/parquet_avro/scalding-parquet

Read Snappy compressed Hive RCFile in Apache Pig

Trying to read Hive files in Pig using http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html
Fies have RCF, SnappyCodec and hive.io.rcfile.column.number words in its beginning, they are binary files. Moreover they are partitioned over multiple directories (like /day=20140701).
However simple script of loading, grouping and counting rows prints nothing to output. If I try to add "ILLUSTRATE" like this:
rows = LOAD ... using HiveColumnarLoader ...;
ILLUSTRATE rows;
I get error like this:
2014-07-17 14:16:43,086 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
...
I'm not sure, whether it is because of Snappy compression or some trouble with specifying schema (I copied it from hive, describe table).
Could anyone please confirm that HiveColumnarLoader works with snappy compressed files or propose another approach?
Thanks in advance!

Have you tried the HCatLoader?
rows = LOAD 'tablename' using org.apache.hcatalog.pig.HCatLoader();

Running Pig query over data stored in Hive

I would like to know how to run Pig queries stored in Hive format. I have configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage).
Before that I used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use?
Note that don't need close integration like mentioned here: Using Hive with Pig, just what load function to use to read compressed sequence files generated by Hive.
Thanks for all the answers.

Here's what I found out:
Using HiveColumnarLoader makes sense if you store data as a RCFile. To load table using this you need to register some jars first:
register /srv/pigs/piggybank.jar
register /usr/lib/hive/lib/hive-exec-0.5.0.jar
register /usr/lib/hive/lib/hive-common-0.5.0.jar
a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('ts int, user_id int, url string');
To load data from Sequence file you have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:
register /srv/pigs/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/user/hive/warehouse/table' USING SequenceFileLoader AS (int, int);
This doesn't work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and you get this exception:
2011-07-01 10:30:08,589 WARN org.apache.pig.piggybank.storage.SequenceFileLoader: Unable to translate key class org.apache.hadoop.io.BytesWritable to a Pig datatype
2011-07-01 10:30:08,625 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class org.apache.hadoop.io.BytesWritable to a Pig datatype
at org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:132)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
How to compile piggybank is described here: Unable to build piggybank -> /home/build/ivy/lib does not exist

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to specify schema while reading parquet file with pyspark? - hadoop

This error usually occurs when you try to read an empty directory as parquet. If for example you create an empty DataFrame, you write it in parquet and then read it, this error appears. You could check if the DataFrame is empty with rdd.isEmpty() before write it.

I have done a quick implementation for the same Hope this Helps!!...

Related

Ab-Initio dml-to-hive

Deserialize protobuf column with Hive

Adding parquet-avro support to scalding

Read Snappy compressed Hive RCFile in Apache Pig

Running Pig query over data stored in Hive

Categories

Resources