Spring XD stream failure handling - spring-xd

I have a stream as follows,
source(jms-ibmmq) -> Process -> Process -> sink(jdbc-oracle)
Data ingestion works fine. But as part of my stream there is a possibility that my sink(jdbc-oracle) will be down (or) that there is some problem in the network which prevents persistence to the oracle db.
What I am asking is how to handle this failure and what option spring xd can provide out of the box? Is there a pattern thats commonly used to handle these failures in the streams which has caused processing / sink modules?

Please see the comments on this JIRA issue they explain documentation changes we are adding to explain how to configurre dead-lettering in the message bus.
In addition, we have provided mechanisms such that, if all four modules are deployed to the same container (and all containers that match the deployment criteria), we will directly connect the modules such that an error in the sink will be thrown back to the source (causing the JMS message to be rolled back in your case).
This is achieved by setting the module count property to 0 (meaning deploy on all containers that match the criteria - if any - or all containers, if no criteria).
This feature is available on master (it was added after M7).

Related

Restarting Partition Step in Spring Batch

Our application is a Spring batch running in openshift. The application calls another service via REST to fetch records from database. Both use nginx side car for handling the traffic. Both side cars restarted for some reason and the Spring batch job terminated suddenly .I already implemented retry mechanism using #Retryable but the logic has not even reached the retry part. The only log I found in the application is given below
"Encountered an error executing step myPartitionStep in job myJob","level":"ERROR","thread":"main","logClass":"o.s.batch.core.step.AbstractStep","logMethod":"execute","stack_trace":"o.s.b.core.JobExecutionException: Partition handler returned an unsuccessful step
o.s.b.c.p.support.PartitionStep.doExecute(PartitionStep.java:112)
o.s.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
o.s.b.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
o.s.b.c.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:68)
o.s.b.c.j.f.s.state.StepState.handle(StepState.java:68)
o.s.b.c.j.f.support.SimpleFlow.resume(SimpleFlow.java:169)
o.s.b.c.j.f.support.SimpleFlow.start(SimpleFlow.java:144)
o.s.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:137)
o.s.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
o.s.b.c.l.s.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
o.s.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
o.s.b.c.l.s.SimpleJobLauncher.run(SimpleJobLauncher.java:140)
j.i.r.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java)
j.i.r.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:566)
o.s.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
o.s.a.f.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
o.s.b.c.c.a.SimpleBatchConfiguration$PassthruAdvice.invoke(SimpleBatchConfiguration.java:128)
... 13 frames truncated\n"
I am not able to point what exactly is the reason for this error. It stopped at partition step which uses itemReader to call another service and fetche the records,FlatFileItemWriter which writes the records. We cannot afford to have duplicates in our file. Is it possible to restart the app exactly where it stopped without having duplicates?
The stacktrace you shared is truncated, so it is not possible to see the root cause from what you shared.
Spring Batch supports restarting a failed paritioned step, as long as you use a persistent job repository. You need to restart the same job instance, ie use the same job parameter that you used in your first run (that failed). Only failed partitions will be rerun. Any failed partition will resume from where it left off.

How to get Validation error message into a Attribute from ValidateResult Processor of Nifi

I am trying to validate a json using ValidateRecord Processor via Avroschemaregistry. I need to store validation error message into a sql table, so i tried to capture the error message in attribute but i am unable to capture the error message in attribute, any idea how to do it
After your ValidateRecord Processor, you can choose to route flow files which are 'invalid' to a separte log and route them to your sql table, you can do the same if they 'fail'. I am assuming from the 'error message' you mean the 'bullentin' which would occur when the Processor can neither validate or invalidate the flow file based on your schema.
A potential solution to this is to use the SiteToSiteBulletinReportingTask
Screenshot of SiteToSiteBulletinReportingTask
You can build a dataflow to receive these bulletin events, manipulate them as you want and store them in a location of your choice for your auditing needs.
From the sounds of it, the SiteToSiteBulletinReportingTask should be able to achieve what you want. To implement this, add a iteToSiteBulletinReportingTask to the 'Reporting Tasks' in the NiFi Settings: Reporting Tasks in NiFi Settings
You can name your input port and have it flow towards your SQL store and you should have what you're after.
You need to allow NiFi nodes to receive data via site-to-site on the input port and you also need to grant the correct permissions on the root process group so the nodes are able to see the component, view and modify the data.
Side note: I would usually log everything, and have all failures and invalid route to log files, which I put to store, e.g. HBase/SQL. One suggestion I've seen is configure the logging subsystem to additionally send specific error categories to your destination of choice (e.g. active notification vs passive parsing of logs). NiFi is leveraging a very flexible logback system (an evolution of log4j). The best part - changes to the $NIFI_HOME/conf/logback.xml configuration file do not require an instance restart, will be picked up within 30 seconds or less.

Is it a good idea to use serilog to write logs directly to the elasticsearch

I'm evaluating different options about the distributed log server.
In the Java world, as I can see, the most popular solution is filebeat + kafka + logstash + elasticsearch + kibana.
However, in .NET world, there's a serilog which can send structure logs directly to the elasticsearch. So the only required components are elasticsearch + kibana.
I searched a lot, but there's not much information about this solution in production. I've no idea whether it's enough to handle large volumes of logs.
Can anyone give me some suggestions? Thanks.
I had the same issue exactly. Our system worked with the "classic" elk-stack architecture i.e. FileBeat -> LogStash -> Elastic ( ->Kibana).
but as we found out in big projects with a lot of logs Serilog is much better solution for the following reasons:
CI\CD - when you have different types of logs with different structure which you want to have different types, Serilog power comes in handy. in LogStash you need to create a different filter to break down a message according to the pattern. which implies that there is big coupling in the log structure aspect and the LogStash aspect - very bug prone.
maintenance - Because of the easy CI\CD and the one point of change, it is easier to maintain a large amount of logs.
Scalability - FileBeat has a problem to handle big chunks of data because of the registry file which have a tend to "explode" - reference from personal experience stack overflow flow question ; elastic-forum question
Less failure points - with serilog the log send directly to elastic when with Filebeat you have to path through LogStash. one more place to fail.
Hope it helps you with your evaluation.
Update (Dec 2021):
The Elasticsearch logger provider has been moved to the Elastic ECS DotNet project.
Find the latest version here: https://github.com/elastic/ecs-dotnet/blob/master/src/Elasticsearch.Extensions.Logging/ReadMe.md
The nuget package is here: https://www.nuget.org/packages/Elasticsearch.Extensions.Logging/1.6.0-alpha1
It is still labelled an alpha release (although it has more functionality than the Essential's version), so currently (Dec 2021) you need to specify the version when adding the package:
dotnet add package Elasticsearch.Extensions.Logging --version 1.6.0-alpha1
Disclaimer: I am the author
ORIGINAL ANSWER
There is now also a stand alone logger provider that will write .NET Core logging direct to Elasticsearch, following the Elasticsearch Common Schema (ECS) field specifications, https://github.com/sgryphon/essential-logging/tree/master/src/Essential.LoggerProvider.Elasticsearch
To use this from your .NET Core application, add a reference to the Essential.LoggerProvider.Elasticsearch package:
dotnet add package Essential.LoggerProvider.Elasticsearch
Then, add the provider to the loggingBuilder during host construction, using the provided extension method.
using Essential.LoggerProvider;
// ...
.ConfigureLogging((hostContext, loggingBuilder) =>
{
loggingBuilder.AddElasticsearch();
})
The default configuration will write to a local Elasticsearch running at http://localhost:9200/.
Once you have sent some log events, open Kibana (e.g. http://localhost:5601/) and define an index pattern for "dotnet-*" with the time filter "#timestamp".
This reduces the dependencies even more, as rather than pull in the entire Serilog infrastructure (App -> Microsoft ILogger -> Serilog provider/adapter -> Elasticsearch sink -> Elasticsearch) you now only have (App -> Microsoft ILogger -> Elasticsearch provider -> Elasticsearch).
The ElasticsearchLoggerProvider also writes events following the Elasticsearch Common Schema (ECS) conventions, so is compatible with events logged from other sources, e.g. Beats.

Nifi Error: Failed to establish the connection with AMQP broker

I am trying to read the data from following cap files.
everyhing from alerts folder
http://dd2.weather.gc.ca/alerts/cap/20180205/CWHX/14/
I am using AMQP from http://metpx.sourceforge.net. And when I am trying to connect to subscriber from nifi, I am getting the following error.
Failed to establish the connection with AMQP broker
this is my cap.conf file.
broker amqp://anonymous:anonymous#dd.weather.gc.ca
directory /data
subtopic alerts.cap.#
accept .*
mirror True
Over the summer, the broker migrated to SSL, so the current URL would be: amqps://anonymous:anonymous#dd.weather.gc.ca
The web page has also moved to: https://github.com/MetPX/sarracenia
best practice to put authentication info in ~/.config/sarra/credentials.conf
a line like: amqps://anonymous:anonymous#dd.weather.gc.ca
Installing a version from the past year is likely to be a much
better experience. It now comes with sample configurations,
one of them being ddc_cap-xml.conf which is the same data you
are trying to download.
So the work is:
blacklab% sr_subscribe add ddc_cap-xml.conf
blacklab% sr_subscribe edit ddc_cap-xml.conf
# Change the directory option to suit your case.
blacklab% sr_subscribe foreground ddc_cap-xml.conf
And it should work. It could take many hours to prove it because this particular set (severe weather warnings in Common Alerting Protocol format) is produced only when needed, rather than continuously. (Use start instead of foreground to run as a background daemon.)
To test things, it might be easier to start with dd_swob, which will be a continuous feed.
blacklab% sr_subscribe list dd_swob
broker amqp://anonymous#dd.weather.gc.ca
exchange xpublic
#msg_skip_threshold 60
#on_msg ../msg_skip_old.py
subtopic observations.swob-ml.#
accept .*
In this configuration you need to add a directory option just before the accept line. and should start downloading data immediately. Once you know it works, switch back to the data set you actually want.

Spring Batch: Horizontal scaling of Job Repository

I read a lot about how to enable parallel processing and chunking of an individual job, using Master/Slave paradigm. Consider an already implemented Spring Batch solution that was intended to run on a standalone server. With minimal refactoring I would like to enable this to horizontally scale and be more resilient in production operation. Speed and efficiency is not a goal.
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
In the following example a Job Repository is used that connects to an initializes a database schema for the Job Repository. Job initiation requests are fed to a message queue, that a single server, with a single Java process is listening on via Spring JMS. When encountering this it executes a new Java process that is the Spring Batch job. If the job has not been started according to the Job Repository it will begin. If the job had failed it will pick up where the job left off. If the job is in process it will ignore.
The single point of failure is the single server and single listening process for job initiation. I would like to increase resiliency by horizontally scaling identical server instances all competing for who can first grab the job initiation message when it first appears in the queue. That server instance will now attempt to run the job.
I was conceiving that all instances of the JobRepository would share the same schema, so they can all query for when the status is currently in process and decide what they will do. I am unsure though if this schema or JobRepository implementation is meant to be utilized by multiple instances.
Is there a risk in pursuing this that this approach could result in deadlocking the database? There are other constraints to where the Partition features of Spring Batch will not work for my application.
I decided to build a prototype to test if the condition that the Spring Batch Job Repository schema and SimpleJobRepository can be used in a load balanced way with multiple Spring Batch Java processes running concurrently. I was afraid that deadlock scenarios might have occurred at the database to where all running job processes get stuck.
My Test
I started with the mkyong Spring Batch HelloWorld example and made some changes to it where it could be packaged into a Jar that can be executed from the command line. I also removed the initialize database step defined in the database.config file and manually established a local MySQL server with the proper schema elements. I added a Job parameter for time to be the current time in millis so that each job instance would be unique.
Next, I wrote a separate Java main class that used Apache Commons Exec framework to create 50 sub processes with no wait between them. Each of these processes have a Thread.sleep for 1 second within their Processor objects as well so that a number of processes will all kick off at the same time and all attempt to access the database at the same time.
Results
After running this test a number of times in a row I see that all 50 Spring batch processes consistently complete successfully and update the same database schema correctly. I don't see any indication that if there were multiple Spring Batch job processes running on multiple servers connecting to the same database that they would interfere with each other on the schema nor do I see any indication that a deadlock could happen at this time.
So it sounds as if load balancing of Spring Batch jobs without the use of advanced Master/Slave and Step Partitioning approaches is a valid use case.
If anybody would like to comment on my test or suggest ways to improve it I would appreciate it.
Here is excerpt from
Spring Batch docs on how Spring Batch handles database updates for its repository:
Spring Batch employs an optimistic locking strategy when dealing with updates to the database. This means that each time a record is 'touched' (updated) the value in the version column is incremented by one. When the repository goes back to save the value, if the version number has changed it throws an OptimisticLockingFailureException, indicating there has been an error with concurrent access. This check is necessary, since, even though different batch jobs may be running in different machines, they all use the same database tables.

Resources