Nifi change flowfile name to unique number - apache-nifi

My consumekafka processor is generating flowfile name which is alphanumeric, I tried converting it into number as I have to write it to hdfs and need number instead of alphanumeric value -using ${now():toNumber()}
the number that is getting generated is not unique value and these records are getting failed to write to HDFS because of same filename. How can I get a unique filename that is not alphanumeric?

Related

NiFi ReplaceText Processor inserting Empty strings

I'm trying to convert Fixed width file text file to pipe delimited text file. I'm using NiFi's ReplaceText Processor for doing the same. These are my processor configurations
Replacement Strategy-Regex Replace
Evaluation Mode-Line-by-Line
Line-by-Line Evaluation Mode-All
Search Value- (.{1})(.{4})(.{16})(.{11})(.{14})(.{1})(.{8})(.{16})(.{9})(.{19})(.{5})(.{14})(.{4})(.{33})
replacement Value- ${'$1':trim()}${literal('|'):unescapeXml()}${'$3':trim()}${literal('|'):unescapeXml()}${'$4':trim()}${literal('|'):unescapeXml()}${'$5':toDecimal()}${literal('|'):unescapeXml()}${'$8':trim()}${literal('|'):unescapeXml()}${'$9':trim():toNumber()}${literal('|'):unescapeXml()}${'$10':trim()}${literal('|'):unescapeXml()}${'$11':toNumber()}${literal('|'):unescapeXml()}${'$12':toDecimal()}${literal('|'):unescapeXml()}${'$13':trim()}${literal('|'):unescapeXml()}${header:substring(63,69)}
I'm trying to split record according to the column length's provided to me and trying to trim spaces and and parsing to different types. In this process I observe that randomly some column in output file are empty strings even though the records in fixed width file contains some data. I can't figure out why the expression evaluation is inserting zero length strings randomly in the file. When I'm trying to with small set of records(some 100 records) from original file it is working fine. My original file is having 12 million records in it.

Apache Nifi - Split a large Json file into multiple files with a specified number of records

I am a newbie to Nifi and would like some guidance please. 
We want to split a large Json file into multiple files with a specified number of records. I am able to split a file into individual records using SplitJson and the Json Path Expression set as $..* I have also added an UpdateAttribute Processor with filename set to ${filename}_${fragment.index} so that we have the sequence of the files as order is important.
However, we might want to have say a 100,000 records split into 100 files of 1000 records each . What is the easiest way to do this ?
Thanks very much in advance
There is a SplitRecord processor. You can define the number of records to be split per file, such as:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Records Per Split 3
I have tested with the record,
id
1
...
8
and it is split into 3 files with the id = (1,2,3), (4,5,6), (7,8).

How to output multiple values with the same key in reducer?

I have a bunch of text files which are categorized and I would like to create a sequence file for each category in which the key is the category name and the value consists of all the textual content of all the files for the category.
I have a nosql database which has only two columns. Each row represents a file, the first column is the category name and the second one is the absolute address of the text file stored on the HDFS. My mapper reads the database and output pairs in which the key is the category and the value is the absolute address. In the reducer sides, I have the addresses of all the files for each category and I would like to create one sequence files for each category in which the key is the category name and the value consists of the all textual content of all the files belonging to that category.
A simple solution is to iterate through the pairs (in the reducer) and open files one by one and append their content to a String variable and at the end create a sequence file using MultipleOutputs. However as the file sizes may be large appending the content to a single String may not be possible. Is there any way to do this without using a String variable?
Then, since you have all the files in reducer, you can get the content of those files, and append using StringBuilder to save memory, and then discard that StringBuilder reference. If avoiding String is your question, StringBuilder is a quick way. The IO operaion involving the file access and reading is resource intensive. However the data itself, should be ok given the architecture of reducers in hadoop.
You can also think of using a combiner. However, that is mainly used to reduce the traffic between map and reduce. You can take advantage of preparing part of the sequence file, at combiner and then remaining at reducer level. ofcouse this is valid only if the content can be added as it comes and not based on specific order.

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources