Flume - Can an entire file be considered an event in Flume? - hadoop

I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy.
As an alternative, I was advised to use regex interceptors where multiple files would reside in a single directory and based on a string in the file, would be routed to the specific directory in HDFS. The kind of files I am expecting are CSV files where the first line is the header and the subsequent lines are comma separated values.
With this in mind, I have a few questions.
How do interceptors handle files?
Given that the header line in the CSV would be like ID, Name followed in the next lines by IDs and Names, and another file in the same directory would have Name, Address followed in the next line by names and address, what would the interceptor and channel configuration look like for it to route it to different HDFS directories?
How does an interceptor handle the subsequent lines that clearly do not match the regex expression?
Would an entire file even constitute one event or is it possible that one file can actually be multiple events?
Please let me know. Thanks!

For starters, flume doesn't work on files as such, but on a thing called events. Events are Avro structures which can contain anything, usually a line, but in your case it might be an entire file.
An interceptor gives you the ability to extract information from your event and add that to that event's headers. The latter can be used to configure a traget directory structure.
In your specific case, you would want to code a parser that analyses the content of you event and sets a header value, for instance sub path:
if (line.contains("Address")) {
event.getHeaders().put("subpath", "address");
else if (line.contains("ID")) {
event.getHeaders().put("subpath", "id");
}
You can then reference that in your hdfs-sink confirguration as follows:
hdfs-a1.sinks.hdfs-sink.hdfs.path = hdfs://cluster/path/%{subpath}
As to your question whether multiple files can constitute an event: yes, that's possible, but not with the spool source. You would have to implement a client class which speaks to a configured Avro source. You would have to pipe your files into an event and send that off. You could then also set the headers there instead of using an interceptor.

Related

How to read from a CSV file

The problem:
I have a CSV file. I want to read from it, and use one of the values in it based on the content of my flow file. My flow file will be XML. I want to read the key using EvaluateXPath into an attribute, then use that key to read the corresponding value from the CSV file and put that into a flow file attribute.
I tried following this:
https://community.hortonworks.com/questions/174144/lookuprecord-and-simplecsvfilelookupservice-in-nif.html
but found requiring several controller services, including a CSV writer to be a big more than I would think is required to solve this.
Since you're working with attributes (and only one lookup value), you can skip the record-based stuff and just use LookupAttribute with a SimpleCsvFileLookupService.
The record-based components are for doing multiple lookups per record and/or lookups for each record in a FlowFile. In your case it looks like you have one "record" and you really just want to look up an attribute from another attribute for the entire FlowFile, so the above solution should be more straightforward and easier to configure.

Pass a directory as an argument to ExecuteStreamCommand

I have a Java program that is designed to process a directory full of data, passed as an argument to the JAR.
input_dir/
file1
file2
How can I tell NiFi to pass a directory to a ExecuteStreamCommand, as an argument, instead of an individual FlowFile ?
Is there a way to model a directory as a FlowFile ?
I tried to use GetFile just before ExecuteStreamCommand on input_dir parent directory in order to get ìnput_dir`, so it would be passed to the stream command.
It didn't work, as GetFile just crawls all the directories looking for actual files when "Recurse Subdirectories" attribute is set to true.
When set to false, GetFile doesn't get any files.
To summarize, I would like to find a way to pass a directory containing data to a ExecuteStreamCommand, not just a single FlowFile.
Hope it makes sens, thank you for your suggestions.
A flow file does not have to be a file from disk, it can be anything. If I am understanding you correctly, you just need a flow file to trigger your ExecuteStreamCommand. You should be able to do this with GenerateFlowFile (set the scheduling strategy appropriately). You can put the directory directly into ExecuteStreamCommand, or if you want it to be more dynamic you can add it as a flow file attribute in GenerateFlowFile, then reference it in ExecuteStreamCommand like ${my.dir} (assuming you called it my.dir in GenerateFlowFile).

merging of flow files in the specified order

I am new to nifi(using version 1.8.0). I have the requirement of consuming kafka messages which contain vehicle position in the form of lat,lon per message. Since each message will arrive as a flow file, I need to merge all these flow files and make a json file containing the complete path followed by the vehicle. I am using consume kafka processor to subscribe to messages, update attribute processor(properties added are filename:${getStateValue("seq")},seq:${getStateValue("seq"):plus(1)}) to add a sequence number as filename (eg. filename is 1,2, 3 etc) and put file processor to write these files in the specified directory. I have configured FIFO priority queue on all the success relationship between the above mentioned processors.Once, I have received all the messages I want to merge all the flow files. For this I know I have to use get file, enforce order, merge content(merge strategy:bin packing algorithm, merge format:binary concatenation) and put file processor, respectively. Is my approach correct? How should I establish that merging of files takes place in the sequence of their names as filename is a seq number. What should I put in order attribute in enforce order processor?What should in put in group identifier? Are there more custom fields to be added in enforce order processor?
EnforceOrder processor documentation
1.Group Identifier
This property evaluate on each flowfile for your case use UpdateAttribute Processor, add group_name attribute and use the same ${group_name} attribute in Group Identifier property value.
2.Order Attribute
Expression language is not supported.
You can use filename (or) create new attribute in
UpdateAttribute processor and use same attribute name in your
Order Attribute property value.
For reference/usage of Enforce order processor use this template and upload to your NiFi instance.

Apache NiFi to split data based on condition

Our requirement is split the flow data based on condition.
We thought to use "ExecuteStreamCommand" processor for that (intern it will use java class) but it is giving single flow data file only. We would like to have two flow data files, one is for matched and another is for unmatched criteria.
I looked at "RouteText" processor but it has no feature to use java class as part of it.
Let me know if anyone has any suggestion.
I think you could use GetMongo to read those definition values and store them in a map accessed by DistributedMapCacheClientService, then use RouteOnContent to route the incoming flowfiles based on the absence/presence of the retrieved values.
If that doesn't work, you could instead route the query result from GetMongo to PutFile and then use ScanContent, which reads from a dictionary file on the file system and routes flowfiles based on the absence/presence of those keywords in the content.
Finally, if all else fails, you can use ExecuteScript to combine those steps into a single processor and route to matched/unmatched relationships. It processes Groovy code easily, so you can directly invoke your existing Java class if necessary.

Spring MVC Upload File - How is Content Type determine?

I'm using Spring 3 ability to upload a file. I would like to know the best way to validate that a file is of a certain type, specifically a csv file. I'm rather sure that checking the extension is useless and currently I am checking the content type of the file that is uploaded. I just ensure that it is of type "text/csv". And just to clarify this is a file uploaded by the client meaning I have no control of its origins.
I'm curious how Spring/the browser determines what the content type is? Is this the best/safest way to determine what kind of file has been uploaded? Can I ever be 100% certain?
UPDATE: Again I'm not wondering how to determine what the content type is of a file but how the content type gets determined. How does spring/the browser know that the content type is a "text/csv" based on the file uploaded?
You can use
org.springframework.web.multipart.commons.CommonsMultipartFile object.
it hasgetContentType(); method.
Look at the following example http://www.ioncannon.net/programming/975/spring-3-file-upload-example/
you can just add the simple test on CommonsMultipartFile object and redirect to error page if it the content type is incorrect.
So you can also count the number of commas in the file per line.There should normally be the same amount of commas on each line of the file for it to be a valid CSV file.
Why you don't just take the file name in you validator and split it, the file type is fileName.split("\.")[filename.length()-1] string
Ok, in this case i suggest you to use the Csvreader java library. You just have to check your csvreader object and that's all.
As far as I'm aware the getContentType(String) method gets its value from whatever the user agent tells it - so you're right to be wary as this can easily be spoofed.
For binary files you could check the magic number or use a library, such as mime-util or jMimeMagic. There's also Files.probeContentType(String) since Java 7 but it only works with files on disk and bugs have been reported on some OSes.

Resources