How to run analytics on Paraquet files on Non Hadoop environment - parquet

We are generating Parquet files , using apache Nifi in a non hadoop environment. We need to run analytics on Parquet files.
Apart from using apache frameworks like Hive , Spark etc. Do we have any open source BI or a reporting tool which can read Parquet files , or is there any other work around for this . In our environment we have Jasper Reporting tool.
Any suggestion is appreciated. Thanks.

You can easily process Parquet files in Python:
To read/write Parquet files, you can use pyarrow or fastparquet.
To analyze the data, you can use Pandas (which can even read/write Parquet itself using one of the implemention mentioned in the previous item behind the scenes).
To get a nice interactive data exploration environment, you can use Jupyter Notebook.
All of these work in a non-Hadoop environment.

Related

namenode.LeaseExpiredException while df.write.parquet when reading from non-hdfs source

I have a spark code that runs on a yarn cluster and converts csv to parquet using databricks library.
It works fine when the csv source is hdfs. But when the csv source is non-hdfs, which is usually the case, I come across this exception.
It should not happen as the same code works for hdfs csv source.
Complete link to the issue :
https://issues.apache.org/jira/browse/SPARK-19344
As discussed in the comments.
When the files are on the driver node, but not access-able by the nodes, the read will fail.
When using reading input file (e.g. spark.read in spark 2.0), the files should be be access by all executors nodes (e.g. when the files are on HDFS, cassandra, etc)

How to convert parquet file to Avro file?

I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader.
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();
But I am not sure how to include AvroParquetReader. I am not able
to import it at all.
I can read this file using spark-shell and may be convert it to some JSON
and then that JSON can be converted to avro. But I am looking for a
simpler solution.
If you are able to use Spark DataFrames, you will be able to read the parquet files natively in Apache Spark, e.g. (in Python pseudo-code):
df = spark.read.parquet(...)
To save the files, you can use the spark-avro Spark Package. To write the DataFrame out as an avro, it would be something like:
df.write.format("com.databricks.spark.avro").save("...")
Don't forget that you will need to include the right version of the spark-avro Spark Package with your version of your Spark cluster (e.g. 3.1.0-s2.11 corresponds to spark-avro package 3.1 using Scala 2.11 which matches the default Spark 2.0 cluster). For more information on how to use the package, please refer to https://spark-packages.org/package/databricks/spark-avro.
Some handy references include:
Spark SQL Programming Guide
spark-avro Spark Package.

Different tools available for creating data pipelines

I need to create data pipelines in hadoop. I have data import, export, scripts to clean data set up and need to set it up in a pipeline now.
I have been using Oozie for data import and export schedules but now need to integrate R scripts for data cleaning process as well.
I see falcon is used for the same.
How to install falcon in cloudera?
What other tools are available to create data pipelines in hadoop?
2) I'm tempted to answer nifi from Hortonworks, since this post on linkedin it has grown a lot and it's very close to replace oozie. When I'm writing this answer the difference between oozie and nifi is the place where they run: nifi on external cluster and oozie into hadoop.

Different file process in hadoop

I have installed Hadoop and hive. I can process and query over xls, tsv files using hive. I want to process other files such as docx, pdf, ppt. how can i do this? Is there any separate procedure to process these files in AWS? please help me.
There isn't any difference in consuming those files as in any Hadoop platform. For easy access and durable storage - you may put those files in S3.

Hadoop Basics: What do I do with the output?

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)
I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.
Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.
This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....
Enlighten me.
At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.
I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.
If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera
Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).
Hope that helps.

Resources