Parquet without Hadoop? - hadoop

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?

Investigating the same question I found that apparently it's not possible for the moment.
I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet.
In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing.
EDIT:
Issues are not tracked on github anymore (first link above is dead). A newer issue I found is located on apache's Jira with the following headline:
make it easy to read and write parquet files in java without depending on hadoop

Since it is just a file format it is obviously possible to decouple parquet from the Hadoop ecosystem. Nowadays the simplest approach I could find was through Apache Arrow, see here for a python example.
Here a small excerpt from the official PyArrow docs:
Writing
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: import pyarrow as pa
In [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
...: 'two': ['foo', 'bar', 'baz'],
...: 'three': [True, False, True]},
...: index=list('abc'))
...:
In [6]: table = pa.Table.from_pandas(df)
In [7]: import pyarrow.parquet as pq
In [8]: pq.write_table(table, 'example.parquet')
Reading
In [11]: pq.read_table('example.parquet', columns=['one', 'three'])
EDIT:
With Pandas directly
It is also possible to use pandas directly to read and write
DataFrames. This makes it as simple as my_df.to_parquet("myfile.parquet") and my_df = pd.read_parquet("myfile.parquet")

You don't need to have HDFS/Hadoop for consuming Parquet file. There are different ways to consume Parquet.
You could access it using Apache Spark.
If you are on AWS, you can
directly load or access it from Redshift or Athena
If you are on
Azure, you can load or access it from SQL DataWarehouse or SQL
Server
similarly in GCP as well

Late to the party, but I've been working on something that should make this possible: https://github.com/jmd1011/parquet-readers.
This is still under development, but a final implementation should be out within a month or two of writing this.
Edit: Months later, and still working on this! It is under active development, just taking longer than expected.

What type of data do you have in Parquet? You don't require HDFS to read Parquet files. It is definitely not a pre-requisite. We use parquet files at Incorta for our staging tables. We do not ship with a dependency on HDFS, however, you can store the files on HDFS if you want. Obviously, we at Incorta can read directly from the parquet files, but you can also use Apache Drill to connect, use file:/// as the connection and not hdfs:/// See below for an example.
To read or write Parquet data, you need to include the Parquet format in the storage plugin format definitions. The dfs plugin definition includes the Parquet format.
{
"type" : "file",
"enabled" : true,
"connection" : "file:///",
"workspaces" : {
"json_files" : {
"location" : "/incorta/tenants/demo//drill/json/",
"writable" : false,
"defaultInputFormat" : json
}
},

Nowadays you dont need to rely on hadoop as heavily as before.
Please see my other post: How to view Apache Parquet file in Windows?

Related

AutoLoader with a lot of empty parquet files

I want to process some parquet files (with snappy compression) using AutoLoader in Databricks. A lot of those files are empty or contain just one record. Also, I cannot change how they are created, nor compact them.
Here are some of the approaches I tried so far:
I created a python notebook in Databricks and tried using AutoLoader to load the data. When I run it for a single table/folder, I can process it without a problem. However, when calling that notebook in a for loop for other tables (for item in active_tables_metadata: -> dbutils.notebook.run("process_raw", 0, item)) , I only get empty folders in the target.
I created a Databricks Workflows Job and called the same notebook for each table/folder (sending the name/path of the table/folder via a parameter). This way, every table/folder was processed.
I used DBX to package the python scripts into a wheel and use it inside Databricks Workflows Job tasks as entrypoints. When doing this, I managed to create the same workflow as in point 2 above but, instead of calling a notebook, I am calling a python script (specified in entypoint of the task). Unfortunately, this way I only get empty folder in the target.
Copied over to a python notebook in Databricks all functions used in DBX python wheel and ran the notebook for one table/folder. I only got an empty folder in the target.
I have set the following AutoLoader configurations:
"cloudFiles.tenantId"
"cloudFiles.clientId"
"cloudFiles.clientSecret"
"cloudFiles.resourceGroup"
"cloudFiles.subscriptionId"
"cloudFiles.format": "parquet"
"pathGlobFilter": "*.snappy"
"cloudFiles.useNotifications": True
"cloudFiles.includeExistingFiles": True
"cloudFiles.allowOverwrites": True
I use the following readStream configurations:
spark.readStream.format("cloudFiles")
.options(**CLOUDFILE_CONFIG)
.option("cloudFiles.format", "parquet")
.option("pathGlobFilter", "*.snappy")
.option("recursiveFileLookup", True)
.schema(schema)
.option("locale", "de-DE")
.option("dateFormat", "dd.MM.yyyy")
.option("timestampFormat", "MM/dd/yyyy HH:mm:ss")
.load(<path-to-source>)
And the following writeStream configurations:
df.writeStream.format("delta")
.outputMode("append")
.option("checkpointLocation", <path_to_checkpoint>)
.queryName(<processed_table_name>)
.partitionBy(<partition-key>)
.option("mergeSchema", True)
.trigger(once=True)
.start(<path-to-target>)
My prefered solution would be to use DBX but I don't know why the job is succeeding yet, I only see empty folders in the target location. This is very strange behavior because I think AutoLoader is timing out reading only empty files after some time!
P.S. the same is also happening when I use parquet spark streaming instead of AutoLoader.
Do you know of any reason why this is happening and how can I overcome this issue?
Are you specifying the schema of the streaming read? (Sorry, can't add comments yet)

How to read sql table on Nifi?

I am trying to create a basic flow on Nifi
read table from sql
process it on python
write back another table in sql
It is simple as it is.
But, I am facing issues when I try to read data on python
As far as I learn I need to use sys.stdin/out.
It only reads and writes as below.
import sys
import pandas as pd
file = pd.read_csv(sys.stdin)
file.to_csv(sys.stdout,index=False)
Below you can find processor properties, but I don't think it is the issue.
QueryDatabaseTableRecord:
ExecuteStreamCommand:
PutDatabaseRecord:
Error Message:
There's a much easier way to do this if you're running 1.12.0 or newer: ScriptedTransformRecord. It's like ExecuteScript except it works on a per-record basis. This is what a simple Groovy script for it looks like:
def fullName = record.getValue("FullName")
def nameParts = fullName.split(/[\s]{1,}/)
record.setValue("FirstName", nameParts[0])
record.setValue("LastName:", nameParts[1])
record
It's a new processor, so there's not that much documentation on it yet aside from the (very good) documentation bundled with it. So samples might be sparse at the moment. If you want to use and run into issues, feel free to join the nifi-users mailing list and asked for more detailed help.

PYSPARK - Reading, Converting and splitting a EBCDIC Mainframe file into DataFrame

We have an EBCDIC Mainframe format file which is already loaded into Hadoop HDFS Sytem. The File has the Corresponding COBOL structure as well. We have to Read this file from HDFS, Convert the file data into ASCII format and need to split the data into Dataframe based on its COBOL Structure. I've tried some options which didn't seem to work. Could anyone please suggest us some proven or working ways.
For python, take a look at the Copybook package (https://github.com/zalmane/copybook). It supports most features of Copybook includes REDEFINES and OCCURS as well as a wide variety of PIC formats.
pip install copybook
root = copybook.parse_file('sample.cbl')
For parsing into a PySpark dataframe, you can use a flattened list of fields and use a UDF to parse based on the offsets:
offset_list = root.to_flat_list()
disclaimer : I am the maintainer of https://github.com/zalmane/copybook
Find the COBOL Language Reference manual and research functions DISPLAY-OF and National-Of. The link : https://www.ibm.com/support/pages/how-convert-ebcdic-ascii-or-ascii-ebcdic-cobol-program.

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

Spark: Export graph data to anything (Hive, text, etc)

I have a Spark Graph that I created this way
val graph = Graph(vertices, edges, defaultArticle).cache
My vertices is an RDD[(Long, (String, Option[String], List[String], Option[String])] and my edges is an RDD[Edge[Long]]
How do I save this graph/Edges/Vertices to Hive/Text File/Anything else, and how would I read it back? I looked into Spark SQL doc and Spark core doc but I still don't succeed. If I do saveAsTextFile() then when I read it back it's an RDD[String], which is not what I need....
EDIT: Daniel has provided an answer to save as an object file... I'm still interested in understanding how to save and read the object above as a Hive table. Thanks!
Instead of rdd.saveAsTextFile()/sc.textFile() use rdd.saveAsObjectFile()/sc.objectFile(). This will use normal Java serialization for each line, stored as a Hadoop SequenceFile.

Resources