Multiple data collector for a job without duplicating records in streamsets - data-collection

I have a directory consist of multiple files, and that is shared across multiple data collectors. I have a job to process those files and put it in the destination. Because the records are huge, I want to run the job in multiple data collector. but when I tried I got the duplicate entries in my destination. Is there a way to achieve it without duplicating the records. Thanks

You can use kafka for it. For example:
Create one pipeline which reads file names and sends them to kafka topic via kafka producer.
Create pipeline with kafka consumer as an origin and set the consumer group property to it. This pipeline will read filenames and work with files.
Now you can run multiple pipelines with kafka consumer with the same consumer group. In this case kafka will balance messages within consumer group by itself and you will not be getting duplicates.
To be sure that you won't have duplicates also set 'acks' = 'all' property to kafka producer.
With this schema you can run as many collectors as your kafka topic partition count.
Hope it will help you.

Copying my answer from Ask StreamSets:
At present there is no way to automatically partition directory contents across multiple data collectors.
You could run similar pipelines on multiple data collectors and manually partition the data in the origin using different character ranges in the File Name Pattern configurations. For example, if you had two data collectors, and your file names were distributed across the alphabet, the first instance might process [a-m]* and the second [n-z]*.
One way to do this would be by setting File Name Pattern to a runtime parameter - for example ${FileNamePattern}. You would then set the value for the pattern in the pipeline's parameters tab, or when starting the pipeline via the CLI, API, UI or Control Hub.

Related

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

Writing multiple entries from a single message in Kafka Connect

If on one topic I receive messages in some format which represent a list of identical structs (e.g. a JSON list or a repeated field in protobuf) could I configure Kafka Connect to write each entry in the list as a separate row (say in a parquet file in HDFS, or in a SQL database)? Is this possible using only the bundled converters/connectors?
I.e. can I use each Kafka message to represent thousands of records, rather than sending thousands of individual messages?
What would be a straightforward way to achieve this with Kafka Connect?
The bundled message transforms are only capable of making one-to-one message manipulations. Therefore, you would have to explicitly produce those flattened lists in some way (directly, or via a stream processing application) if you wanted Connect to write it out as separate records.
Or, if applicable, you can use Hive or Spark to expand that list as well for later processing.

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

Kafka Streams Custom processing

There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.
Each Row in a specific file would be processed for a rule specific to that file.
Once the processing is complete we would be generating an output file based on the processed records.
One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)
I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.
However (I am new to kafka streams hence may be wrong):
The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..
i guess you can set a key and value record aside that could be sent to the topics at the end of the file which would signify the closure of the file.
Say the record has a unique key such as -1 which signifies that the eof

Spring Batch & Integration Flat File Generation Pattern

We have a use case where we receive data in flat files which we load into an Oracle DB using Spring Batch. Post data load in Oracle, we have to distribute the data in form of flat files to several consumers. The data selection criteria depends on some pre-decided values in some fields of the data.
We have a design in place which generates a list which contains objects that can be passed to a Spring Batch job as job parameter to generate the flat files needed to be sent to the data consumers.
Using a Splitter component, I can put the individual objects into a channel and plug a JobLaunchingGateway to launch a batch job to generate the flat file.
Need help on how I can launch multiple batch jobs in parallel using JobLaunchingGateway so that I can generate files in parallel.
A setup is already in place to FTP the files to consumers. We do not need to worry about that.
Use an ExecutorChannel with a task executor before the JobLaunchingGateway.

Resources