Need to extract attributes directly from Avro using NiFi - apache-nifi

I have found no way in NiFi to extract attributes directly from Avro so I am using ConvertAvroToJson -> EvaluateJsonPath -> ConvertJsonToAvro as the workaround.
But I would like to write a script to extract the attributes from the Avro flow file for use in an ExecuteScript processor to determine if it is a better approach.
Does anyone have a script to do this? Otherwise, I may end up using the original approach.
Thanks,
Kevin

Here's a Groovy script (which needs the Avro JAR in its Module Directory property) where I let the user specify dynamic properties with JSONPath expressions to be evaluated against the Avro file. Ironically it does a GenericData.toString() which converts the record to JSON anyway, but perhaps there is some code in here you could reuse:
import org.apache.avro.*
import org.apache.avro.generic.*
import org.apache.avro.file.*
import groovy.json.*
import org.apache.commons.io.IOUtils
import java.nio.charset.*
flowFile = session.get()
if(!flowFile) return
final GenericData genericData = GenericData.get();
slurper = new JsonSlurper().setType(JsonParserType.INDEX_OVERLAY)
pathAttrs = this.binding?.variables?.findAll {attr -> attr.key.startsWith('avro.path')}
newAttrs = [:]
try {
session.read(flowFile, { inputStream ->
def reader = new DataFileStream<>(inputStream, new GenericDatumReader<GenericRecord>())
GenericRecord currRecord = null;
if(reader.hasNext()) {
currRecord = reader.next();
log.info(genericData.toString(currRecord))
record = slurper.parseText(genericData.toString(currRecord))
pathAttrs?.each {k,v ->
object = record
v.value.tokenize('.').each {
object = object[it]
}
newAttrs[k - "avro.path."] = String.valueOf(object)
}
reader.close()
}
} as InputStreamCallback)
newAttrs.each{k,v ->
flowFile = session.putAttribute(flowFile, k,v)
}
session.transfer(flowFile, REL_SUCCESS)
} catch(e) {
log.error("Error during Avro Path: {}", [e.message] as Object[], e)
session.transfer(flowFile, REL_FAILURE)
}
If you meant to extract Avro metadata vs fields (not totally sure what you meant by "attributes"), also check MergeContent's AvroMerge as there is some code in there to pull Avro metadata:

If you are extracting simple patterns from a single Avro record per flowfile, ExtractText may be sufficient for you. If you want to take advantage of the new record processing available in Apache NiFi 1.3.0, AvroReader is where you should start, and there are a series of blogs describing this process in detail. You can also extract Avro metadata with ExtractAvroMetadata.

Related

Process json files in spring batch

I have a zip file containing multiple json files. I have unzipped them
then got POJO object from json using below code:
reader = new BufferedReader(new FileReader(file));
Gson gson = new GsonBuilder().create();
Element[] people = gson.fromJson(reader, Element[].class);
but I need to process these json files one by one using spring batch.
Can someone help me how I can achieve this in spring batch and I want to read json file using chunk of 1000
My json object is very complex. Example:
{
"students": {
"subelements": {
"dep": {
"data": [
"XYZ"
]
}
}
}
}
Your data structure is not one of the types you could handle with Spring Batch out-of-the-box. See more details here: https://stackoverflow.com/a/51933062/5019386.
So I think in your case, you would need to create a custom item reader to parse a specific fragment of your input file.

How to extract and manipulate data within a Nifi processor

I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.

Parsing multi-format & multi line data file in spring batch job

I am writing a spring batch job to process the below mentioned data file and write it into a db.
Sample data file is of this format where I have multiple headers and
each header has a bunch of rows associated with it .
I can have million of records for each header and I can have n number
of headers in a flat file that am processing.My requirement is to
pick a few readers which am concerned .
For all the picked readers I need to pick all the data rows .Each
header and its data format is also different .I can receive either of
these data in my processor and need to write them into my DB.
HDR01
A|41|57|Data1|S|62|Data2|9|N|2017-02-01 18:01:05|2017-02-01 00:00:00
A|41|57|Data1|S|62|Data2|9|N|2017-02-01 18:01:05|2017-02-01 00:00:00
HDR02
A|41|57|Data1|S|62|Data2|9|N|
A|41|57|Data1|S|62|Data2|9|N|
I tried exploring the PatternMatchingCompositeLineMapper where I can
map the different header pattern I have to a tokenizer and
corresponding FieldSetMapper but I need to read the body and not the
header here .
Don't have any footer to Crete a end of line policy of my own as well .
Also tried using AggregateItemReader but don't want to club all the
records of a header before I process them .
Each rows corresponding a header should be processed parallel .
#Bean
public LineMapper myLineMapper() {
PatternMatchingCompositeLineMapper< Domain > mapper = new PatternMatchingCompositeLineMapper<>();
final Map<String, LineTokenizer> tokenizers = new HashMap<String, LineTokenizer>();
tokenizers.put("* HDR01*", new DelimitedLineTokenizer());
tokenizers.put("*HDR02*", new DelimitedLineTokenizer());
tokenizers.put("*", new DelimitedLineTokenizer("|"));
mapper.setTokenizers(tokenizers);
Map<String, FieldSetMapper<VMSFeedStyleInfo>> mappers = new HashMap<String, FieldSetMapper<VMSFeedStyleInfo>>();
try {
mappers.put("* HDR01*", customMapper());
mappers.put("*HDR02*", customMapper());
mappers.put("*", customMapper() );
} catch (Exception e) {
e.printStackTrace();
}
mapper.setFieldSetMappers(mappers);
return mapper;
}
Can somebody help me provide some inputs as to how should I achieve this .

How to use CombineFileInputFormat on gzip files?

What is the best way to use CombineFileInputFormat on gzip files?
This article will help you in building up your own Inputformat with the help of CombineFIleInputFOrmat to read and process gzip files. Below parts would give you an idea of what needs to be done.
Custom InputFormat:
Build your own custom combinefileinputformat almost same as that of combinefileinputformat. Key has to be our own writable class which would hold filename,offset and value would be the actual file content. Have to set issplittable to false(we dont want to split the file). set maxsplitsize to a value of your requirement. based on that value Combinefilerecordreader decides the number of splits and creates an instance for each split.
You have to built you own custom recordreader by adding your decompression logic to it .
Custom RecordReader:
Custom Recordreader uses linereader and sets the key as filename,offset and value as actual file content. If the file is compressed it decompresses it and reads it. Here is the extract for that.
private void codecWiseDecompress(Configuration conf) throws IOException{
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(path);
if (codec == null) {
System.err.println("No Codec Found For " + path);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(path.toString(),
codec.getDefaultExtension());
dPath = new Path(outputUri);
InputStream in = null;
OutputStream out = null;
fs = this.path.getFileSystem(conf);
try {
in = codec.createInputStream(fs.open(path));
out = fs.create(dPath);
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
rlength = fs.getFileStatus(dPath).getLen();
}
}
Custom Writable Class:
A pair with filename,offset value

How to open/stream .zip files through Spark?

I have zip files that I would like to open 'through' Spark. I can open .gzip file no problem because of Hadoops native Codec support, but am unable to do so with .zip files.
Is there an easy way to read a zip file in your Spark code? I've also searched for zip codec implementations to add to the CompressionCodecFactory, but am unsuccessful so far.
There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()
In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.
#user3591785 pointed me in the correct direction, so I marked his answer as correct.
For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.
JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {
#Override
public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();
InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
newList.add(newTuple);
}
return newList;
}
});
Please try the code below:
using API sparkContext.newAPIHadoopRDD(
hadoopConf,
InputFormat.class,
ImmutableBytesWritable.class, Result.class)
I've had a similar issue and I've solved with the following code
sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.flatMap { zipEntry => ??? }
}
This answer only collects the previous knowledge and I share my experience.
ZipFileInputFormat
I tried following #Tinku and #JeffLL answers, and use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me. And I do not know how would I put com-cotdp-hadoop lib on my production cluster. I am not responsible for the setup.
ZipInputStream
#Tiago Palma gave a good advice, but he did not finish his answer and I struggled quite some time to actually get the decompressed output.
By the time I was able to do so, I had to prepare all the theoretical aspects, which you can find in my answer: https://stackoverflow.com/a/45958182/1549135
But the missing part of the mentioned answer is reading the ZipEntry:
import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)
File name should be pass using conf
conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)
Please Find PROPERTY_NAME from your input formatter for set path
Try:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")

Resources