With the Autoloader feature, As per the documentation the configuration cloudFiles.format supports json, csv, text, parquet, binary and so on. Wanted to know if there is support for XML ?
For streaming file data sources supported file formats are text, CSV, JSON, ORC, Parquet. My assumption is that only streaming file formats are supported.
Not sure if you got a chance to go through https://github.com/databricks/spark-xml for more complex xml-files with the spark-xml library . If you want to make use of this, auto loader won' t help.
Related
I'm using NiFi 1.11.4 to read CSV files from an SFTP, do a few transformations and then drop them off on GCS. Some of the files contain no content, only a header line. During my transformations I convert the files to the AVRO format, but when converting back to CSV no file output is produced for the files where the content is empty.
I have the following settings for the Processor:
And for the Controller:
I did find the following topic: How to use ConvertRecord and CSVRecordSetWriter to output header (with no data) in Apache NiFi? but in the comments it mentions explicitly that ConvertRecord should cover this since 1.8. Sadly I understood it incorrectly, it does not seem to work or my setup is wrong.
While I could make it work with by explicitly writing the schema as a line to empty files, I wanted to know if there is also a more elegant way?
This is my first time using Ruby. I'm writing an application that parses data and performs some calculations based on it, the source of which is a JSON file. I'm aware I can use JSON.parse() here but I'm trying to write my program so that it will work with other sources of data. Is there a clear cut way of doing this? Thank you.
When your source file is JSON then use JSON.parse. Do not implement a JSON parser on your own. If the source file is a CSV, then use the CSV class.
When your application should be able to read multiple different formats then just add one Reader class for each data type, like JSONReader, CSVReader, etc. And then decide depending on the file extension which reader to use to read the file.
We have an EBCDIC Mainframe format file which is already loaded into Hadoop HDFS Sytem. The File has the Corresponding COBOL structure as well. We have to Read this file from HDFS, Convert the file data into ASCII format and need to split the data into Dataframe based on its COBOL Structure. I've tried some options which didn't seem to work. Could anyone please suggest us some proven or working ways.
For python, take a look at the Copybook package (https://github.com/zalmane/copybook). It supports most features of Copybook includes REDEFINES and OCCURS as well as a wide variety of PIC formats.
pip install copybook
root = copybook.parse_file('sample.cbl')
For parsing into a PySpark dataframe, you can use a flattened list of fields and use a UDF to parse based on the offsets:
offset_list = root.to_flat_list()
disclaimer : I am the maintainer of https://github.com/zalmane/copybook
Find the COBOL Language Reference manual and research functions DISPLAY-OF and National-Of. The link : https://www.ibm.com/support/pages/how-convert-ebcdic-ascii-or-ascii-ebcdic-cobol-program.
If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat
I want to read the PDF file using hadoop, how it is possible?
I only know that hadoop can process only txt files, so is there anyway to parse the PDF files to txt.
Give me some suggestion.
An easy way would be to create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.
Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.