PYSPARK - Reading, Converting and splitting a EBCDIC Mainframe file into DataFrame - ascii

We have an EBCDIC Mainframe format file which is already loaded into Hadoop HDFS Sytem. The File has the Corresponding COBOL structure as well. We have to Read this file from HDFS, Convert the file data into ASCII format and need to split the data into Dataframe based on its COBOL Structure. I've tried some options which didn't seem to work. Could anyone please suggest us some proven or working ways.

For python, take a look at the Copybook package (https://github.com/zalmane/copybook). It supports most features of Copybook includes REDEFINES and OCCURS as well as a wide variety of PIC formats.
pip install copybook
root = copybook.parse_file('sample.cbl')
For parsing into a PySpark dataframe, you can use a flattened list of fields and use a UDF to parse based on the offsets:
offset_list = root.to_flat_list()
disclaimer : I am the maintainer of https://github.com/zalmane/copybook

Find the COBOL Language Reference manual and research functions DISPLAY-OF and National-Of. The link : https://www.ibm.com/support/pages/how-convert-ebcdic-ascii-or-ascii-ebcdic-cobol-program.

Related

Autoload Feature File Format

With the Autoloader feature, As per the documentation the configuration cloudFiles.format supports json, csv, text, parquet, binary and so on. Wanted to know if there is support for XML ?
For streaming file data sources supported file formats are text, CSV, JSON, ORC, Parquet. My assumption is that only streaming file formats are supported.
Not sure if you got a chance to go through https://github.com/databricks/spark-xml for more complex xml-files with the spark-xml library . If you want to make use of this, auto loader won' t help.

Mapreduce XML input format - to build custom format

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat

Parquet without Hadoop?

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?
Investigating the same question I found that apparently it's not possible for the moment.
I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet.
In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing.
EDIT:
Issues are not tracked on github anymore (first link above is dead). A newer issue I found is located on apache's Jira with the following headline:
make it easy to read and write parquet files in java without depending on hadoop
Since it is just a file format it is obviously possible to decouple parquet from the Hadoop ecosystem. Nowadays the simplest approach I could find was through Apache Arrow, see here for a python example.
Here a small excerpt from the official PyArrow docs:
Writing
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: import pyarrow as pa
In [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
...: 'two': ['foo', 'bar', 'baz'],
...: 'three': [True, False, True]},
...: index=list('abc'))
...:
In [6]: table = pa.Table.from_pandas(df)
In [7]: import pyarrow.parquet as pq
In [8]: pq.write_table(table, 'example.parquet')
Reading
In [11]: pq.read_table('example.parquet', columns=['one', 'three'])
EDIT:
With Pandas directly
It is also possible to use pandas directly to read and write
DataFrames. This makes it as simple as my_df.to_parquet("myfile.parquet") and my_df = pd.read_parquet("myfile.parquet")
You don't need to have HDFS/Hadoop for consuming Parquet file. There are different ways to consume Parquet.
You could access it using Apache Spark.
If you are on AWS, you can
directly load or access it from Redshift or Athena
If you are on
Azure, you can load or access it from SQL DataWarehouse or SQL
Server
similarly in GCP as well
Late to the party, but I've been working on something that should make this possible: https://github.com/jmd1011/parquet-readers.
This is still under development, but a final implementation should be out within a month or two of writing this.
Edit: Months later, and still working on this! It is under active development, just taking longer than expected.
What type of data do you have in Parquet? You don't require HDFS to read Parquet files. It is definitely not a pre-requisite. We use parquet files at Incorta for our staging tables. We do not ship with a dependency on HDFS, however, you can store the files on HDFS if you want. Obviously, we at Incorta can read directly from the parquet files, but you can also use Apache Drill to connect, use file:/// as the connection and not hdfs:/// See below for an example.
To read or write Parquet data, you need to include the Parquet format in the storage plugin format definitions. The dfs plugin definition includes the Parquet format.
{
"type" : "file",
"enabled" : true,
"connection" : "file:///",
"workspaces" : {
"json_files" : {
"location" : "/incorta/tenants/demo//drill/json/",
"writable" : false,
"defaultInputFormat" : json
}
},
Nowadays you dont need to rely on hadoop as heavily as before.
Please see my other post: How to view Apache Parquet file in Windows?

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

Pig - load Word documents (.doc & .docx) with pig

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.
I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.
I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.
How could I do ?
Thanks
They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.
However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.
Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

Resources