How to implement the equivalent of the Aggregator EIP in Nifi - apache-nifi

I'm very experienced with Apache Camel and EIPs and am struggling to understand how to implement equivalents in Nifi. I understand that Nifi uses a different paradigm (flow based programming) but I don't think what I'm trying to do is unreasonable.
In a nutshell I want the contents of each file to be sent to many rest services and I want to aggregate the responses into a single document which will stored in elasticsearch. I might also do some further processing and cleanup to improve what is stored (but this isn't my immediate issue)
The screenshot is a quick mock-up of what I'm trying to achieve but I don't understand enough about Nifi to know how to implement this pattern correctly.

If you are going to take a single piece of data and then fork to multiple parts of the flow and then converge back, there needs to be a way for MergeContent to know which pieces go together.
There are generally two ways this can be done...
The first is using MergeContent in "defragment mode". Think of this as reversing a split operation that was performed by one of the split processors like SplitText. For example, you split a file of 100 lines into 100 flow files of 1 line each, then do some stuff to each one, then want to converge back. The split processors produce a standard set of split attributes (described in the docs of the processors) and the defragment mode knows how to bin the splits accordingly and merge them back together. This probably doesn't apply to your example since you didn't start with a split processor.
The second approach is the "Correlation Attribute" in MergeConent. This tells merge content to only merge flow files together that have the same value for the attribute specified. In your example, when a file gets picked up by GetFile and sent to 3 InvokeHttp processors, there are 3 flow files created, and they all should have their "filename" attribute set to the name of the file picked up from disk. So telling MergeContent to correlate on filename should do the trick, and probably setting the min and max number of entries to the number you expect like 3, and a maximum time in case one of them fails or hangs.

Related

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

Nifi: how to avoid copying file that are partially written

I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:

Running hadoop for processing sources in full sky maps

I have few tens of full sky maps, in binary format (FITS) of about 600MB each.
For each sky map I already have a catalog of the position of few thousand sources, i.e. stars, galaxies, radio sources.
For each source I would like to:
open the full sky map
extract the relevant section, typically 20MB or less
run some statistics on them
aggregate the outputs to a catalog
I would like to run hadoop, possibly using python via the streaming interface, to process them in parallel.
I think the input to the mapper should be each record of the catalogs,
then the python mapper can open the full sky map, do the processing and print the output to stdout.
Is this a reasonable approach?
If so, I need to be able to configure hadoop so that a full sky map is copied locally to the nodes that are processing one of its sources. How can I achieve that?
Also, what is the best way to feed the input data to hadoop? for each source I have a reference to the full sky map, latitude and longitude
Though it doesn't sound like a few tens of your sky maps are a very big data set, I've used Hadoop successfully as an simple way to write distributed applications/scripts.
For the problem you describe, I would try implementing a solution with Pydoop, and specifically Pydoop Script (full disclaimer: I'm one of the Pydoop developers).
You could set up a job that takes as input the list of sections of the sky map that you want to process, serialized in some sort of text format with one record per line. Each map task should process one of these; you can achieve this split easily with the standard NLineInputFormat.
You don't need to copy the sky map locally to all the nodes as long as the map tasks can access the file system on which it is stored. Using the pydoop.hdfs module, the map function can read the section of the sky map that it needs to process (given the coordinates it received as input) and then emit the statistics as you were saying so that they can be aggregated in the reducer. pydoop.hdfs can read from both "standard" mounted file systems and HDFS.
Though the problem domain is totally unrelated, this application may serve as an example:
https://github.com/ilveroluca/seal/blob/master/seal/dist_bcl2qseq.py#L145
It uses the same strategy, preparing a list of "coordinates" to be processed, serializing them to a file, and then launching a simple pydoop job that takes that file as input.
Hope that helps!

How to create hadoop input splits that span two files?

My data input files are all of the same length, but, the records therein may span two files (starting at the end of the first file and finishing at the beginning of the second).
Is it possible to create an inputsplit that would allow me to span those two files?
Is it better to create an entirely new set of files so that records do not span more than one file?
I would definitely ensure your records do not span more than one file: you could, theoretically, write your own input format that takes care of this, but the overhead is likely to be considerable as you are - in having to ensure that you know which files belong together - taking over part of the responsiblity which the jobtracker and name node fulfill for you.
You should be free to tell the jobtracker/name node where the inputs are, and for the processing to be truly parallel, you don't want to then have to take back some of that control: IMHO it would partially defeat the object of using haoop in the first place.

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys?
My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns.
For example,
Anything that starts with /foo/ should go to /year/month/day/hour/foo/file
Anything that starts with /bar/ should go to /year/month/day/hour/bar/file
Anything that doesn't match should go to /year/month/day/hour/other/file
There are two problems here (from my understanding of Map Reduce): first, I'd prefer to just iterate over my data one time, instead of running one "grep" job per URL type I'd like to match. How would I split up the output, though? If I key the first with "foo", second with "bar", and rest with "other" then don't they all still go to the same reducers? How do I tell Hadoop to output them into different files?
The second problem is related (maybe the same?), I need to break output up by the timestamp in the access log line.
I should note that I'm not looking for code to solve this, but rather the proper terminology and high level solution to look into. If I have to do it with multiple runs, that's alright, but I can't run one "grep" for each possible hour (to make a file for that hour), there must be another way?
You need to partition the data just as you describe. Then you need to have multiple output files. See here (Generating Multiple Output files with Hadoop 0.20+).

Resources