Talend Open Studio for ESB 5.2 Route to Job Optimisation/Performance Issue - performance

Using Talend ESB 5.2.0, I want to create a mediation route that will call a processing job on the payload of an inbound request to a CXF messaging endpoint, however my current implementation is suffering some performance issues with large payloads.
I’ve investigated the issue and found that the bottleneck is in marshalling my inbound XML payload from the tRouteInput component to the internal row structure for processing, using a tXMLMap.
Is it possible, using a built-in type converter in the route, to marshal the internal row structure from the route and stream through POJOs or transport objects that are cheaper to process in the job? Or is there a better way to marshal XML to Talend’s internal row structure from a route using a less expensive transform?
Any thoughts would be welcome.
Cheers,
mids

It turns out that the issue was caused by the format of the inbound XML payload - having more than one loop element mapping to separate output flows from the tXMLMap generates relative links for each item for each output flow, enabling more advanced processing involving the loops if required.
This caused the large memory overhead that led to poor throughput.
Not requiring any more advanced processing in the XML to Talend row conversion, we overcame this issue by splitting the payload to its distinct loop elements using tReplicate and tExtractXMLField components before mapping out of the XML in separate tXMLMaps to avoid the auto-generation of those links.- mids

Related

Best practices for FlowFile content vs FlowFile attributes

I'm just getting started with Apache NiFi and I'm curious if there are any best practices around using a attributes vs content for a FlowFile. Currently, I have it setup to read a JSON message from a RabbitMQ queue, parse the JSON into attributes and use those attributes for downstream processing. This works, but I feel like its leaving the content of the FlowFile largely unused after JSON parsing and I'm wondering if I'm missing something. Alot of the processors seem more geared towards working with attributes but are there any disadvantages to primarily using attributes for processing?
In my use case, the RabbitMQ message would be an event that a new document has been made available and the flow I'm building would have branching logic based on the document type to extract data from the document via NLP processes. Currently, I'm storing the document text as and attribute but I'm wondering if there are any size considerations to account for with attributes. Some documents could be hundreds of pages and therefore lots of text.
Thanks!

How to route based on content with high perfomance?

In nifi, if I am listening to Kafka from single topic and based on the routing logic it'll call the respective process group.
However, in RouteOnContent processor, if we give regular expression for checking the occurance of string will it affect performance or how to achieve the a good performance while routing based on condition.
It would be more efficient to do some split at KSQL / Stream Processing level into different topics and have Nifi reading from different topics?
Running a regex on the content of each message is an inefficient approach, consider if you can modify your approach to one of the following:
Have your Producers write the necessary metadata into a Kafka Header which can use a much more efficient RouteOnAttribute processor in NiFi. This is still message-at-a-time which has throughput limitations
If your messages conform to a schema, use the more efficient KafkaRecord processors in NiFi with a QueryRecord approach which will significantly boost throughput
If you cannot modify the source data and the regex logic is involved, it may be more efficient to use a small Kafka Streams app to split the topic before processing the data further downstream

In nifi usgae of Evaluate jsonpath processor will it affect performance impact because of attribute creation

I'm trying to integrate nifi REST API's with my application. So by mapping input and output from my application, I am trying to call nifi REST api for flow creation. So, in my use case most of the times I will extract the JSON values and will apply expression languages.
So, for simplifying all the use-cases I am using evaluate JSONpath processor for fetching all attributes using jsonpath and apply expression language function on that in extract processor. Below is the flow diagram regarding that.
Is it the right approach because for JSON to JSON manipulation having 30 keys this is the simplest way, and as I am trying to integrate nifi REST API's with my application I cannot generate JOLT transformation logic dynamically based on the user mapping.
So, in this case, does the usage of evaluating JSONpath processor creates any performance issues for about 50 use case with different transformation logic because as I saw in documentation attribute usage creates performance(regarding memory) issues.
Your concern about having too many attributes in memory should not be an issue here; having 30 attributes per flowfile is higher than usual, but if these are all strings between 0 - ~100-200 characters, there should be minimal impact. If you start trying to extract KB worth of data from the flowfile content to the attributes on each flowfile, you will see increased heap usage, but the framework should still be able to handle this until you reach very high throughput (1000's of flowfiles per second on commodity hardware like a modern laptop).
You may want to investigate ReplaceTextWithMapping, as that processor can load from a definition file and handle many replace operations using a single processor.
It is usually a flow design "smell" to have multiple copies of the same flow process with different configuration values (with the occasional exception of database interaction). Rather, see if there is a way you can genericize the process and populate the relevant values for each flowfile using variable population (from the incoming flowfile attributes, the variable registry, environment variables, etc.).

Searching for architecture approach for converting Kafka Message to other formats

We're using Kafka as a broker which takes notifications from different message sources and then routes them to one or more target apps like Slack or E-Mail. Having such an approach it is necessary to convert the Kafka message into different output formats like JSON or E-Mail before they are sent to the apps.
I thought of having Microservices with SpringBoot at the target ends which takes the message from Kafka, converts it using one of the common template languages like Velocity or Freemarker into the target format and then forwards the converted result to the given target app.
Would you agree with such an approach or are there better ways, some caveats or even no-gos to do it this way? What about performance? Any experience in this?
Thanks for your honest assessment.
Why not have a single serialization format and let each service deserialize the payload for their use case? Templating with something like Velocity or Freemarker seems like a specific concern independent of the data used to populate the template. Maybe focus on broadcasting the raw data.

Handling Big Data with OSB Proxy

I have created a OSB Proxy Service(Messaging Service) which loading the data with a MFL file.
The format of data is:
1/1/2007;00:11:00;2.500;0.000;242.880;10.200;0.000;0.000;0.000;
1/1/2007;00:12:00;2.494;0.000;242.570;10.200;0.000;0.000;0.000;
All the data records are : 2075259
The total size of file(.txt or .data) is : 130MB.
Which is best way to handling all these data in order to inserted to an OSB Proxy and transformed all the data in a simple xml file?
I have tested with a small size of records(5000) and it works as expected but how i should insert all this data in the proxy?
The MFL transformation is a valid idea or i should create a FileAdapter Proxy which will received the data from a dbtable?
Please for your suggestion
Thank you in advance.
ESBs are efficient at handling messages in the order of KBs, not MBs, although this is very subjective and depends a lot on the number of concurrent requests, transactions per second, sizing of hardware etcetera. As Trent points out in a comment, you could implement a claim check pattern and delegate the file transformation to an external utility, such as perl or similar.

Resources