I am in the process of understanding the Parquet File Format and there doesn't appear to be a formal specification for this. As an example, what is the layout for the metadata? What I do see is a lot of code implementations.
Any help would be appreciated.
Thanks,
Marc
The Apache Parquet format has a formal specification which resides at https://github.com/apache/parquet-format. Changes are either discussed in the form of a pull request or if they are larger on the mailing list of the Apache Parquet project https://parquet.apache.org/community/.
Related
We have an EBCDIC Mainframe format file which is already loaded into Hadoop HDFS Sytem. The File has the Corresponding COBOL structure as well. We have to Read this file from HDFS, Convert the file data into ASCII format and need to split the data into Dataframe based on its COBOL Structure. I've tried some options which didn't seem to work. Could anyone please suggest us some proven or working ways.
For python, take a look at the Copybook package (https://github.com/zalmane/copybook). It supports most features of Copybook includes REDEFINES and OCCURS as well as a wide variety of PIC formats.
pip install copybook
root = copybook.parse_file('sample.cbl')
For parsing into a PySpark dataframe, you can use a flattened list of fields and use a UDF to parse based on the offsets:
offset_list = root.to_flat_list()
disclaimer : I am the maintainer of https://github.com/zalmane/copybook
Find the COBOL Language Reference manual and research functions DISPLAY-OF and National-Of. The link : https://www.ibm.com/support/pages/how-convert-ebcdic-ascii-or-ascii-ebcdic-cobol-program.
If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat
I am reading a table in an object and I need to generate a passthrough ebcidic file from it. This is a spring batch step. There was some suggestions to use jrecord to write an aggregator and a FlatFileItemWriter.
Any clues ?
JRecord is possible solution, I can not say whether there is a better solution for you or not as I do not
know anything about Spring-Batch. This is perhaps more of an extended Comment than a pure answer
JRecord reads / writes files using a File-Schema (or File Description).
Normally this file-schema is a Cobol-Copybook although it also can be a Xml~Schema. The file schema can also be defined in the Program if need be. Given you want to write Ebcdic files, I would think a Cobol-Copybook
will be needed at some stage.
JRecord also support for mainframe/Cobol sequential File structures (FB - Fixed-Width files)
which is what you want
JRecord allows access to fields either by Field-Name or Field-Index (or field id). Note Record_Type_index is to handle files with multiple record types (e.g. header-record, detail-record, footer-record files).
outLine.getFieldValue(record_Type_Index, field_Index).set(...)
or
outLine.getFieldValue("Field-Name").set(...)
Bruce Martin (author of JRecord)
Discussions continued at JRecord forum
https://sourceforge.net/p/jrecord/discussion/678634/thread/2709ab72/?limit=25#c009/8287
As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});
I just started using scalding and trying to find examples of reading a text file and writing to a hadoop sequence file.
Any help is appreciated.
You can use com.twitter.scalding.WritableSequenceFile (please note that you have to use the fully quantified name, otherwise it picks up the cascading one). Hope this helps.