ArrayIndexOutofBoundException in Split text processor - hortonworks-data-platform

I am using nifi 1.1.1 package. I applied the patch files in the source code by referring the below link:
https://github.com/apache/nifi/pull/1214/files.
Now split text processor works fine, if the header line count given as 0 and above 1.
Split text processor splits the data properly and generates flow file, if the input file contains 10000 rows or below 10000 rows.
If the input file contains more than 20000 rows then it would not splits the data and ArrayIndexOutOfBoundsException occured.
Please find the below error details.

Related

NiFi ReplaceText Processor inserting Empty strings

I'm trying to convert Fixed width file text file to pipe delimited text file. I'm using NiFi's ReplaceText Processor for doing the same. These are my processor configurations
Replacement Strategy-Regex Replace
Evaluation Mode-Line-by-Line
Line-by-Line Evaluation Mode-All
Search Value- (.{1})(.{4})(.{16})(.{11})(.{14})(.{1})(.{8})(.{16})(.{9})(.{19})(.{5})(.{14})(.{4})(.{33})
replacement Value- ${'$1':trim()}${literal('|'):unescapeXml()}${'$3':trim()}${literal('|'):unescapeXml()}${'$4':trim()}${literal('|'):unescapeXml()}${'$5':toDecimal()}${literal('|'):unescapeXml()}${'$8':trim()}${literal('|'):unescapeXml()}${'$9':trim():toNumber()}${literal('|'):unescapeXml()}${'$10':trim()}${literal('|'):unescapeXml()}${'$11':toNumber()}${literal('|'):unescapeXml()}${'$12':toDecimal()}${literal('|'):unescapeXml()}${'$13':trim()}${literal('|'):unescapeXml()}${header:substring(63,69)}
I'm trying to split record according to the column length's provided to me and trying to trim spaces and and parsing to different types. In this process I observe that randomly some column in output file are empty strings even though the records in fixed width file contains some data. I can't figure out why the expression evaluation is inserting zero length strings randomly in the file. When I'm trying to with small set of records(some 100 records) from original file it is working fine. My original file is having 12 million records in it.

Apache Nifi - Split a large Json file into multiple files with a specified number of records

I am a newbie to Nifi and would like some guidance please. 
We want to split a large Json file into multiple files with a specified number of records. I am able to split a file into individual records using SplitJson and the Json Path Expression set as $..* I have also added an UpdateAttribute Processor with filename set to ${filename}_${fragment.index} so that we have the sequence of the files as order is important.
However, we might want to have say a 100,000 records split into 100 files of 1000 records each . What is the easiest way to do this ?
Thanks very much in advance
There is a SplitRecord processor. You can define the number of records to be split per file, such as:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Records Per Split 3
I have tested with the record,
id
1
...
8
and it is split into 3 files with the id = (1,2,3), (4,5,6), (7,8).

NiFi: how to get maximum timestamp from first column?

NiFi version 1.5
i have a csv file arrives first time like:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
used listfile -> fetchfile to get the csv file.
next 10 minutes, i get appended csv file:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
2019/02/04 02:00,86787,99.5,0
2019/02/04 03:00,86117,91.5,0
here, how do we need to get only new records alone (last two records). i do not want to process first two records that is already been processed.
my thought process is, we need to get maximum datetime to store in attribute and use QueryRecord. but i do not know how to get maximum datetime using which processor.
is there any better solution.
This is currently an open issue (NIFI-6047) but there has been a community contribution to address it, so you may see the DetectDuplicateRecord processor in an upcoming release of NiFi.
There may be a workaround to split up the CSV rows and create a compound key using ExtractText, then using DetectDuplicate.
It doesn't seems to be a work that is best solved on Nifi as you need to keep a state of what you have processed. An alternative would be for you to delete what you have already processed. Then you can assume what is in the file is always not processed.
here, how do we need to get only new records alone (last two records).
i do not want to process first two records that is already been
processed.
From my understanding, actual question is 'how to process/ingest csv rows as it is written to the file?'.
Description of 'TailFile' processor from NiFi documentation:
"Tails" a file, or a list of files, ingesting data from the file as it
is written to the file. The file is expected to be textual. Data is
ingested only when a new line is encountered (carriage return or
new-line character or combination)
This solution is appropriate when you don't want to move/delete actual file.

Record definition in MapReduce on different types of data-sets in Hadoop?

I want to understand the definition of Record in MapReduce Hadoop, for data types other than Text.
Typically, for Text data a record is full line terminated by new line.
Now, if we want to process an XML data, how does this data get processed , that is , how would a Record definition be on which mapper would work?
I have read that there is concept of InputFormat and RecordReader, but I didn't get it well.
Can anyone help me understand what is the relationship between InputFormat, RecordReader for various types of data-sets (other than text) and how does the data gets converted into Records on which mapper works upon?
Lets start with some basic concept.
From perspective of a file.
1. File -> collection of rows.
2. Row -> Collection of one or more columns , separated by delimiter.
2. File can be of any format=> text file, parquet file, ORC file.
Different file format, store Rows(columns) in different way , and the choice of delimiter is also different.
From Perspective of HDFS,
1. File is sequennce of bytes.
2. It has no idea of the logical structuring of file. ie Rows and columns.
3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks.
Input Format : The code which knows how to read the file chunks from splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split.
Record Reader : As you read a Split , some code(Record Reader) should be able to understand how to interpret a row from the bytes read from HDFS.
for more info :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources