How to convert parquet file to Avro file? - hadoop

I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader.
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();
But I am not sure how to include AvroParquetReader. I am not able
to import it at all.
I can read this file using spark-shell and may be convert it to some JSON
and then that JSON can be converted to avro. But I am looking for a
simpler solution.

If you are able to use Spark DataFrames, you will be able to read the parquet files natively in Apache Spark, e.g. (in Python pseudo-code):
df = spark.read.parquet(...)
To save the files, you can use the spark-avro Spark Package. To write the DataFrame out as an avro, it would be something like:
df.write.format("com.databricks.spark.avro").save("...")
Don't forget that you will need to include the right version of the spark-avro Spark Package with your version of your Spark cluster (e.g. 3.1.0-s2.11 corresponds to spark-avro package 3.1 using Scala 2.11 which matches the default Spark 2.0 cluster). For more information on how to use the package, please refer to https://spark-packages.org/package/databricks/spark-avro.
Some handy references include:
Spark SQL Programming Guide
spark-avro Spark Package.

Related

How to run analytics on Paraquet files on Non Hadoop environment

We are generating Parquet files , using apache Nifi in a non hadoop environment. We need to run analytics on Parquet files.
Apart from using apache frameworks like Hive , Spark etc. Do we have any open source BI or a reporting tool which can read Parquet files , or is there any other work around for this . In our environment we have Jasper Reporting tool.
Any suggestion is appreciated. Thanks.
You can easily process Parquet files in Python:
To read/write Parquet files, you can use pyarrow or fastparquet.
To analyze the data, you can use Pandas (which can even read/write Parquet itself using one of the implemention mentioned in the previous item behind the scenes).
To get a nice interactive data exploration environment, you can use Jupyter Notebook.
All of these work in a non-Hadoop environment.

Read protocol buffer files in apache beam

I have a bunch of protobuff files in GCS and I would like to process them through dataflow (java sdk) and I am not sure how to do that.
Apache beam provides AvroIO to read avro files
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.readGenericRecords(schema)
.from("gs://my_bucket/path/to/records-*.avro"));
Is there anything similar for reading protobuff files?
Thanks in advance

import mysql data to hdfs using apache nifi

I am learner of Apache nifi and currently expolering on "import mysql data to hdfs using apache nifi"
Please guide me on creating flow by providing an doc, end to end flow.
i have serached my sites, its not available.
To import MySQL data, you would create a DBCPConnectionPool controller service, pointing at your MySQL instance, driver, etc. Then you can use any of the following processors to get data from your database (please see the documentation for usage of each):
ExecuteSQL
QueryDatabaseTable
GenerateTableFetch
Once the data is fetched from the database, it is usually in Avro format. If you want it in another format, you will need to use some conversion processor(s) such as ConvertAvroToJSON. When the content of the flow file(s) is the way you want it, you can use PutHDFS to place the files into HDFS.

namenode.LeaseExpiredException while df.write.parquet when reading from non-hdfs source

I have a spark code that runs on a yarn cluster and converts csv to parquet using databricks library.
It works fine when the csv source is hdfs. But when the csv source is non-hdfs, which is usually the case, I come across this exception.
It should not happen as the same code works for hdfs csv source.
Complete link to the issue :
https://issues.apache.org/jira/browse/SPARK-19344
As discussed in the comments.
When the files are on the driver node, but not access-able by the nodes, the read will fail.
When using reading input file (e.g. spark.read in spark 2.0), the files should be be access by all executors nodes (e.g. when the files are on HDFS, cassandra, etc)

Hadoop Pig or Streaming and Zip Files

Using pig or hadoop streaming, has anyone loaded and uncompressed a zipped file? The original csv file was compressed using pkzip.
Not sure if this helps because its mainly focused on using MapReduce in Java, but there is a ZipFileInputFormat available in hadoop. Its use via the Java API is described here:
http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
The main part of this is the ZipFileRecordReader which uses Javas ZipInputStream to process each ZipEntry. The Hadoop reader is probably not going to work for you out of the box because it passes the file path of each ZipEntry as the key and the ZipEntry contents as the value.

Resources