How To Handle Large Files Using Spring Integration - spring-boot

Im working with very large files and using Spring Integration to process them. I want to know what is the best and most efficient way to handle them using Spring Integration and the provided DSL. I have a test CSV file that has around 30K records and am using the FileSplitter component to read in each line into memory and then splitting again based on the delimiter to get the columns that I need.
Code snippet below.
IntegrationFlows
.from(Files.inboundAdapter(new File(inputFilePath))
.filter(getFileFilters())
.autoCreateDirectory(true) ,
c -> c.poller(Pollers.fixedRate(1000))
)
.split(Files.splitter())
.channel(c -> c.executor(Executors.newWorkStealingPool()))
.handle((p, h) -> new MyColumnSelector().getCol((String) p, 1))
.split(s -> s.applySequence(true).delimiters(","))
.channel(c -> c.executor(Executors.newWorkStealingPool()))
.get()

The issue was the IDE and console logging overhead that was slowing things down. I tested this with the same file without any the IDE or any extra logging and it processed significantly faster.

Related

Reactive Quarkus app behaving differently when run as Java or native

I have a reactive quarkus app with hibernate-panache-reactive. The problem is it behaves differently when I run it as a Java app or a native app.
The app
loads a lot of data from a MySQL DB via hibernate-panache-reactive
builds a graph based on the data loaded
runs some time consuming algorithm on the graph
loads some more data from the DB based on the results returned from 3)
So initially the code looked something like this:
GraphProcessor graphProcessor = createInitialProcessor();
return Uni.createFrom().item(graphProcessor)
// 1) loading of initial data
.onItem().transformToUni(this::loadDataViaPanaceReactive1)
.onItem().transformToUni(this::loadDataViaPanaceReactive2)
.onItem().transformToUni(this::loadDataViaPanaceReactive3)
// 2) building of graph
.onItem().transform(graphProcessor::processLoadedData)
.onItem().invoke(graphProcessor::loadingComplete) //sync
// 3) running time consuming algorithm on graph
.onItem().transformToMulti(this::runTimeConsumingTask)
.onItem().invoke(this::prepareDBQueries)
// 4) load more data from DB
.onItem().transformToUniAndConcatenate(this::loadMoreData1)
.onItem().transformToUniAndConcatenate(this::loadMoreData2)
.onItem().transformToUniAndConcatenate(this::transformToPublicForm)
.onFailure().invoke(log::error);
That worked fine when run as a Java app but when I tried to run it as a native app it first complained that the computation in 2 and 3 were taking too long and this was blocking the calling thread.
I fixed that by using
.emitOn(Infrastructure.getDefaultWorkerPool())
Between 1 and 2
This time I got another error
java.lang.IllegalStateException: HR000069: Detected use of the
reactive Session from a different Thread than the one which was used
to open the reactive Session - this suggests an invalid integration;
original thread: 'vert.x-eventloop-thread-0' current Thread:
'vert.x-eventloop-thread-1'
I've fixed that by inserting
.emitOn(Infrastructure.getDefaultExecutor())
between 3 and 4.
GraphProcessor graphProcessor = createInitialProcessor();
return Uni.createFrom().item(graphProcessor)
// 1) loading of initial data
.onItem().transformToUni(this::loadDataViaPanaceReactive1)
.onItem().transformToUni(this::loadDataViaPanaceReactive2)
.onItem().transformToUni(this::loadDataViaPanaceReactive3)
// 2) building of graph
.emitOn(Infrastructure.getDefaultWorkerPool()) // Required for native mode
.onItem().transform(graphProcessor::processLoadedData)
.onItem().invoke(graphProcessor::loadingComplete)
// 3) running time consuming algorithm on graph
.onItem().transformToMulti(this::runTimeConsumingTask)
.onItem().invoke(this::prepareDBQueries)
.emitOn(Infrastructure.getDefaultExecutor()) // Required for native mode
// 4) load more data from DB
.onItem().transformToUniAndConcatenate(this::loadMoreData1)
.onItem().transformToUniAndConcatenate(this::loadMoreData2)
.onItem().transformToUniAndConcatenate(this::transformToPublicForm)
.onFailure().invoke(log::error);
That worked when run in native mode but now when I run it in Java I get the same exception (Detected use of the
reactive Session from a different Thread than the one which was used
to open the reactive Session)
The emitOn(Infrastructure.getDefaultExcecutor()) should have switched back to the original thread.
The odd thing is also that this exception is not thrown every time I hit the app.
So what am I doing wrong here? What is the best way to handle time consuming tasks and then having to do some more DB queries after?
You could use .runSubscriptionOn(Executor) but I would need to switch back to the original thread for part 4 again.
Thanks for you help.

Spring Batch - Is it possible to write a custom restart in spring batch

I have been writing a spring batch wherein I have to perform some error-handling.
I know that spring batch has its own way of handling errors and restarts.
However, when the batch fails and restarts again, I want to pass my own values, conditions and parameters (for restart) which need to be followed before starting/executing the first step.
So, is it possible to write such custom restart in spring batch?
UPDATE1:(Providing a better explanation for above question.)
Let's say that the input to my reader in Step 1 is in the following format:
Table with following columns:
CompanyName1 -> VehicleId1
CN1 -> VID2
CN1 -> VID3
.
.
CN1 -> VID30
CN2 -> VID1
CN2 -> VID2
.
.
CNn -> VIDn
The reader reads this table row by row for a chunk size 1 (so in this case the row retrieved will be CN -> VID ) processes it and writes it to a File object.
This process goes on until all the CN1 type data is written into the File object. When the reader sends the row with company Name of type CN2 , the File object that was created earlier (for company name of type CN1) will be stored in a remote location. Then the process of File Object creation will continue for CN2 until we encounter CN3, in which case CN2 File Object will be sent for storage to a remote location
and the process will continue.
Now, once you understand this, here's a catch.
Let's say the data is currently being written by the writer for company Name 2 (CN2) and vehicle ID is VID20 (CN2 -> VID20)
in the File object. Then, due to some reason we had to stop the job/the job fails. In that case, the instance that will be saved will be CN2 -> VID20. So, next time when the job runs, it will start from CN2->VID20
As you might have guessed, all the 19 entries before CN2->VID20 which were written in the File Object were deleted permanently when the file Object got destroyed and these entries were never sent through the File to remote location.
So my question here is this:
Is there a way where I can write my custom restart for the batch where I could tell the job to start from CN2->VID1 instead of CN2->VID20?
If you could think of any other way to handle this scenario then such suggestions are also welcome.
Since you want to write the data of each company in a separate file, I would use the company name as a job parameter. With this approach:
Each job will read the data of a single company and write it to a file
If a job fails, you can restart it and it would resume where it left off (Spring Batch will truncate the file to last known written offset before writing new data). So there is no need for a custom restart "policy".
No need to set the chunk-size to 1, this is not efficient. You can use a reasonable chunk size with this approach
If the number of companies is small enough, you can run jobs manually. Otherwise, you can always get the distinct values with something like select distinct(company_name) from yourTable and write a script/loop to launch batch jobs with different parameters. Bottom line, make each job do one thing and do it well.

Replicate logic or using existent in tests?

I have some concerns about writing too much logic in the spec tests.
So let's assume we have a student with statuses and steps
Student statuses starts from
pending -> learning -> graduated -> completed
and steps are:
nil -> learning_step1 -> learning_step2 -> learning_step3 -> monitoring_step1 -> monitoring_step2
With each step going forward a lot of things are happening depending where you are: e.g.
nil -> learning_step1
Student status changes to learning
Writes an action history ( which is used by report stats )
Update a contact schedule
learning_step1 -> learning_step2
....the same...
and so ..... until
learning_step3 -> monitoring_step1
Student status changes to graduated
Writes different action histories ( which is used by report stats )
Update a contact schedule
and when
monitoring_step2 -> there is no next step
Student status changes to completed
Writes different action histories ( which is used by report stats )
Delete any contact schedule
So imagine that I need a test case of a completed student, I would have to think all the possibilities that may come and achieve that student as completed and also I can forget to write an action history and will mess with the reports.
Or ....
Using an already implemented class:
# assuming we have like in the above example 5 steps I do
StepManager.new(student).proceed # goes status learning and step1
StepManager.new(student).proceed
StepManager.new(student).proceed
StepManager.new(student).proceed # goes status graduated and monitoring1
StepManager.new(student).proceed # this will proceed the student in the 5th step which is monitoring2
StepManager.new(student).next_step # here will go completed
or to make my job easier with something like:
StepManager.new(student).proceed.proceed.proceed.proceed.proceed.proceed
or
StepManager.new(student).complete_student # which in background does the same thing
And by doing that I am sure I will never miss something. But then the tests wouldn't be so clear about what I am doing.
So should I replicate the logic or using my classes?
Use TDD best practices. In a rails project for example, write unit tests for your models and controller actions. Same for services. Test each unit for it's expectations. If you need more complex data state to check, it's recommended to use factories using https://github.com/thoughtbot/factory_bot or https://github.com/thoughtbot/factory_bot_rails if you're using rails.

Creating single object DataFrame for predictions

once I got my classification models trained, I'd like them to use in my web application to make classification predictions on the data that has been collected for a given session.
That is:
1) I have some session data structure that I need to map to a DataFrame row
2) feed tha DataFrame row into my ML model to predict the classification
3) use the prediction with the origination session to show it to the user in front of the browser.
The examples to create a DataFrame as input to a Spark pipeline that I've seen so far create it from a data source like a file. Now it seems a bit unwieldy to first create a single POJO or JsonNode, serialize it to file containing just on record and then use that file to create the DataFrame to feed the model.
Writing this I also get the feeling that it might not be a great idea to create and tear down the ML pipeline for each request, which seems to follow from this approach.
So maybe I should better think "Spark Streaming"?
Feed the mapped session data into some kind of message queue and feed that into my Spark pipeline? What kind of "stream" would be appropriate here?
I read somewhere that Spark streaming consumes the stream in micro batches and not record by record - that implies some delay until enough records have been collected to fill up the micro batch (or some preconfigured delay to wait until the micro batch is considered to be "full enough"). What does that mean for the responsiveness of the web application? Can I trigger the micro batches like every 100 milliseconds?
I would appreciate if someone could point me in the right direction.
Maybe Spark is not a good fit here and I should switch to Apache Flink?
Thanks in advance, Bernd
Ok, by now I have found some ways to solve my problem, maybe that
helps someone else:
Use a Sequence containing one tuple and name the columns separately
val df= spark.createDataFrame(
Seq("val1", "val2")
).toDF("label1", "label2")
Using a JSON-String
val sqlContext = spark.sqlContext
val jsonData= """{ "label1": "val1", "label2": "val2" }"""
val rdd= sparkSession.sparkContext.parallelize(Seq(jsonData))
val df= sqlContext.read.json(rdd)
NOT Working: create from Sequence case class Objects:
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
val myData= Seq(Feat("value1", "value2"))
val ds: Dataset[Feat]= myData.toDS()
ds.show(10, false)
This compiles ok, but yields an Exception at runtime:
[error] a.a.OneForOneStrategy - java.lang.RuntimeException:
Error while encoding: java.lang.ClassCastException:
es.core.recommender.Feat cannot be cast to es.core.recommender.Feat
I'd love to include more of the stacktrace, but this glorious editor
won't let me...
It would be nice to know why this alternative did not work...

carbon-tagger does not translate received metrics to Grapthite

I have configured monitoring system as bunch of next stuff:
my_app -> pystatsd -> statsdaemon -> carbon-tagger -> graphite (via carbon-cache) -> graph-explorer
But it looks like carbon-tagger does only dumping metrics to ElasticSearch but not to Graphite. In the same time carbon-tagger successfully send his internal metrics to carbon-cache and they appear in Graph Explorer well. I have look at the source code of the carbon-tagger and could not find place where it send any received from statsdaemon metrics to graphite. So now I'm confused! How should I configure my monitoring system to dump metrics both to the ElasticSearch and to the Graphite?
In a nutshell, correct configuration of described system should looks likes this:
That is, statsd/statsdaemon should pass in data to the carbon-relay (or carbon-relay-ng), not to the carbon-cache directly. And carbon-relay will broadcast data to the carbon-tagger and carbon-cache. Also, don't forget that carbon-tagger doesn't work with pickle format, while original carbon-relay produces data only through pickle protocol.

Resources