l have a 20GB XML file in my local system, I want to split the data into multiple chunks and also I want to remove specific attributes from that file. how will I achieve using Nifi?
Use SplitRecord processor and define XML Reader/Writer controller services to read the xml data and write only the required attributes into your result xml.
Also define Records Per Split property value to include how many records you needed for each split.
Related
I have a requirement to split millions of data(csv format) to single raw in apache nifi.Currently I am using multiple split text processor to achieve this. Is there any other way to do this instead of multiple split text processor
You can use SplitRecord Processor.
You need to create a Record Reader and Record Writer Service first.
Then you can give a value for Records Per Split to split at n position.
all I am new to nifi. I want to split a large xml file into multiple chunks using the split record processor.I am unable to split the records I am my original file as the output not a multiple chunks.Can anyone help me with this?
To use SplitRecord, you're going to need to create an Avro schema that defines your record. If you have that, you should be able to use the XMLReader to turn it into a record set.
I am trying to use NiFi to break up an XML document into multiple flowfiles. The XML contains many elements from a web service. I am trying to process each event separately. I think EvaluateXQuery is the appropriate processor but I can't figure out to add my XQuery if the destination is a flowfile rather than an attribute. I know I have to add a property /value pair in the processor config/properties page but I can't figure out what the property name should be. Does it matter?
If you only need to extract one element, then yes, add a dynamic property with any name and set the destination to flowfile-content.
You can add multiple dynamic properties to the processor to extract elements into attributes on the outgoing flowfile. If you want to then replace the flowfile content with the attributes, you can use a processor like ReplaceText or AttributesToJson to combine multiple attributes into the flowfile content.
A couple things to remember:
extracting multiple large elements to attributes is an anti-pattern, as this will hurt performance on the heap
you might be better off splitting the XML file into chunks via SplitXML first in order to then extract a single element per chunk into the flowfile content (or an attribute)
I want to keep my hive/MySQL table in NiFi DistributedMapCache. Can someone please help me with the example?
Or please correct me if we can not cache hive table anyhow in NiFi cache.
Thanks
You can use SelectHiveQL processor to pull data from Hive table and output format as CSV and include Header as false.
SplitText processor to split each line as individual flowfile.
Note
if your flowfile size is big then you have to use series of split text processors in series to split the flowfile to each line individually
ExtractText processor to extract the key attribute from the flowfile content.
PutDistributedMapCache processor
Configure/Enable DistributedMapCacheClientService, DistributedMapCacheServer controller service.
Add the Cache Entry Identifier property as your extracted attribute from ExtractText processor.
You need to change the Max cache entry size depending on the flowfile size.
To fetch the cached data you can use FetchDistributedMapCache processor and we need to use same exact value for the identifier that we have cached in PutDistributedMapCache
In the same way if you want to load data from external sources as we are going to have data in Avro format use ConvertRecord processor to convert Avro --> CSV format then load the data into distributed cache.
However this not an best practice to load all the data into distributedmapcache for the huge datasets as you can use lookuprecord processor also.
I am using a custom processor for csv to json conversion which converts the csv file data into a json array which contains json objects of the data.
My requirement is to get the file attributes like filename, uuid, path etc. and construct a json from these.
Question:
How can I get the related attributes of the file and construct the a json object appending it to the same json getting constructed before.
Just been few days working with apache nifi, so just going with the exact requirements now with the custom processor.
I can't speak to which attributes are being written for your custom processor, but there is a set of core attributes that most/all flow files have, such as filename and uuid. If you are using GetFile or ListFile/FetchFile to read in your CSV file, you will have those and a number of other attributes available (see the doc for more info).
When you have a flow file that has the appropriate attributes set, you can use the AttributesToJSON processor to create a JSON object containing a flat list of the specified attributes, and that object can replace the flow file content or become its own attribute (named 'JSONAttributes') depending on the setting of the "Destination" property of AttributesToJSON.