Safe processing data coming from KafkaListener - spring-boot

I'm implementing Spring Boot App which reads some data from kafka to provide it for all requesting clients. Let's say I have a following class:
#Component
public class DataProvider {
private Prices prices;
public DataProvider() {
this.prices = Prices.of();
}
public Prices getPrices() {
return prices;
}
}
Each client may perform GET /api/prices to get info about newest prices. Live updates about prices are consumed from kafka. Due to the fact, that update comes every 5 seconds, which is not super often, the topic has only one partition.
I tried the very basic option using Kafka Listener:
#Component
public class DataProvider {
private Prices prices;
public DataProvider() {
this.prices = Prices.of();
}
public Prices getPrices() {
return prices;
}
#KafkaListener(topics = "test-topic")
public void consume(String message) {
Prices prices = Prices.of(message);
this.prices = prices;
}
}
Is this approach safe?

The prices must be volatile. But again: you need to be sure that an actual data for prices is OK to be dispersed. One HTTP request may return one data, but another concurrent may return other. Just because it has been just update by the Kafka consumer.
You may have your consume() and getPrices() as synchronized. So, every one is going to get an actual data at the same moment. However they are not going to be parallel since synchronized ensures only one thread can get access to the object.
Another way for consistency is to look into a ReadWriteLock barrier. So, getPrices() calls can be parallel, but as long as consume() takes a WriteLock, everyone is blocked until it is done.
So, technically your code is really safe. Only the problem if it is safe from a business purpose.

Related

Spring-Boot: scalability of a component

I am trying Spring Boot and think about scalabilty.
Lets say I have a component that does a job (e.g. checking for new mails).
It is done by a scheduled method.
e.g.
#Component
public class MailMan
{
#Scheduled (fixedRateString = "5000")
private void run () throws Exception
{ //... }
}
Now the application gets a new customer. So the job has to be done twice.
How can I scale this component to exist or run twice?
Interesting question but why Multiple components per customer? Can scheduler not pull the data for every customer on scheduled run and process the record for each customer? You component scaling should not be decided based on the entities evolved in your application but the resources utilization by the component. You can have dedicated components type for processing the messages for queues and same for REST. Scale them based on how much each of them is getting utilized.
Instead of using annotations to schedule a task, you could do the same thing programmatically by using a ScheduledTaskRegistrar. You can register the same bean multiple time, even if it is a singleton.
public class SomeSchedulingConfigurer implements SchedulingConfigurer {
private final SomeJob someJob; <-- a bean that is Runnable
public SomeSchedulingConfigurer(SomeJob someJob) {
this.someJob = someJob;
}
#Override
public void configureTasks(#NonNull ScheduledTaskRegistrar taskRegistrar) {
int concurrency = 2;
IntStream.range(0, concurrency)).forEach(
__ -> taskRegistrar.addFixedDelayTask(someJob, 5000));
}
}
Make sure the thread executor you are using is large enough to process the amount of jobs concurrently. The default executor has exactly one thead :-). Be aware that this approach has scaling limits.
I also recommend to add a delay or skew between jobs, so that not all jobs run at exactly the same moment.
See SchedulingConfigurer
and
ScheduledTaskRegistrar
for reference.
The job needs to run only once even with multiple customers. The component itself doesn't need to scale at all. It just a mechanism to "signal" that some logic needs to be run at some moment in time. I would keep the component really thin and just call the desired business logic that handles all the rest e.g.
#Component
public class MailMan {
#Autowired
private NewMailCollector newMailCollector;
#Scheduled (fixedRateString = "5000")
private void run () throws Exception {
// Collects emails for customers
newMailCollector.collect();
}
}
If you want to check for new e-mails per customer you might want to avoid using scheduled tasks in a backend service as it will make the implementation very inflexible.
Better make an endpoint available for clients to call to trigger that logic.

#InboundChannelAdapter in Spring-integration is not running continously?

i am working in spring cloud data flow,there i am having a scenario like reading from the database and send the data to the kafka topic using the #InboundChannelAdapter
Below is the strategy i followed.
->Created common list to store the objects if the list was empty
->if the list have the data i won't poll
->i am sending the values to kafka one by one by using index and after that i will remove the index
if i keep the #Bean it is inserting only the first object in the list to kafka topic.
{"id":101443442,"name":"Mobile1","price":8000}
if i remove the #Bean then it will insert all empty data into kafka.
{}
public static List<Product> products;
#Bean
public void initList() {
products = new ArrayList<>();
}
#Bean
#InboundChannelAdapter(channel = TbeSource.PR1)
public MessageSource<Product> addProducts() {
if (products.size() == 0) {
products.add(new Product(101443442, "Mobile1", 8000));
products.add(new Product(102235434, "book111", 6000));
}
MessageBuilder<Product> message = MessageBuilder.withPayload(products.get(0));
products.remove(0);
return message::build;
}
what am i doing wrong?
i need to send the data frequently by reading from db ?
Really not clear what you are asking.
If you talk about JDBC then you may consider to use a JDBC Source from tout-of-the-box applications for Data Flow.
If you are doing logic yourself to take data from data base, you may consider to use a JdbcPollingChannelAdapter from Spring Integration for the same #InboundChannelAdapter reason.
The rest of your logic with that list is not clear. It is strange to see a #Bean on a void method. If you need to initialize that products and get access from the MessageSource implementation, you just need to do private List<Product> products = new ArrayList<>();. Having property as public is really a bad practice.

Collect Request and sent it in Bulk

I'm using ASP.Net WebAPI. What I'm trying to achieve is to gathers all Request in a List<T>, and sent it by bulk to somewhere else. Basically my requirement is to sent it by bulk only when the list reaches some number or some period of time.
Since List<T> is not a thread safe, so I assume I must use ConcurrentBag<T>. But how do I get the instance of previous created Bag?
public class MyController : ApiController
{
private IList<object> _requests;
public MyController(){
_requests = new List<object>();
}
public void Post()
{
if (_requests.Count < SomeCounter)
_requests.Add(Request);
else
...Send Bulk..
}
}

Spring Data Solr #Transaction Commits

I currently have a setup where data is inserted into a database, as well as indexed into Solr. These two steps are wrapped in a spring-managed transaction via the #Transaction annotation. What I've noticed is that spring-data-solr issues an update with the following parameters whenever the transaction is closed : params{commit=true&softCommit=false&waitSearcher=true}
#Transactional
public void save(Object toSave){
dbRepository.save(toSave);
solrRepository.save(toSave);
}
The rate of commits into solr is fairly high, so ideally I'd like send data to the solr index, and have solr auto commit at regular intervals. I have the autoCommit (and autoSoftCommit) set in my solrconfig.xml, but since spring-data-solr is sending those commit parameters, it does a hard commit every time.
I'm aware that I can drop down to the SolrTemplate API and issue commits manually, I would like to keep the solr repository.save call within a spring-managed transaction if possible. Is there a way to modify the parameters that are sent to solr on commit?
After putting in an IDE debug breakpoint in org.springframework.data.solr.repository.support.SimpleSolrRepository here:
private void commitIfTransactionSynchronisationIsInactive() {
if (!TransactionSynchronizationManager.isSynchronizationActive()) {
this.solrOperations.commit(solrCollectionName);
}
}
I discovered that wrapping my code as #Transactional (and other details to actually enable the framework to begin/end code as a transaction) doesn't achieve what we expect with "Spring Data for Apache Solr". The stacktrace shows the Proxy and Transaction Interceptor classes for our code's Transactional scope but then it also shows the framework starting its own nested transaction with another Proxy and Transaction Interceptor of its own. When the framework exits its CrudRepository.save() method my code calls, the action to commit to Solr is done by the framework's nested transaction. It happens before our outer transaction is exited. So, the attempt to batch-process many saves with one commit at the end instead of one commit for every save is futile. It seems, for this area in my code, I'll have to make use of SolrJ to save (update) my entities to Solr and then have "my" transaction's exit be followed with a commit.
If using Spring Solr, I found using the SolrTemplate bean allows you to 'batch' updates when adding data to the Solr index. By using the bean for SolrTemplate, you can use "addBeans" method, which will add a collection to the index and not commit until the end of the transaction. In my case, I started out using solrClient.add() and taking up to 4 hours for my collection to get saved to the index by iterating over it, as it commits after every single save. By using solrTemplate.addBeans(Collect<?>), it finishes in just over 1 second, as the commit is on the entire collection. Here is a code snippet:
#Resource
SolrTemplate solrTemplate;
public void doReindexing(List<Image> images) {
if (images != null) {
/* CMSSolrImage is a class with #SolrDocument mappings.
* the List<Image> images is a collection pulled from my database
* I want indexed in Solr.
*/
List<CMSSolrImage> sImages = new ArrayList<CMSSolrImage>();
for (Image image : images) {
CMSSolrImage sImage = new CMSSolrImage(image);
sImages.add(sImage);
}
solrTemplate.saveBeans(sImages);
}
}
The way I've done something similar is to create a custom repository implementation of the save methods.
Interface for the repository:
public interface FooRepository extends SolrCrudRepository<Foo, String>, FooRepositoryCustom {
}
Interface for the custom overrides:
public interface FooRepositoryCustom {
public Foo save(Foo entity);
public Iterable<Foo> save(Iterable<Foo> entities);
}
Implementation of the custom overrides:
public class FooRepositoryImpl {
private SolrOperations solrOperations;
public SolrSampleRepositoryImpl(SolrOperations fooSolrOperations) {
this.solrOperations = fooSolrOperations;
}
#Override
public Foo save(Foo entity) {
Assert.notNull(entity, "Cannot save 'null' entity.");
registerTransactionSynchronisationIfSynchronisationActive();
this.solrOperations.saveBean(entity, 1000);
commitIfTransactionSynchronisationIsInactive();
return entity;
}
#Override
public Iterable<Foo> save(Iterable<Foo> entities) {
Assert.notNull(entities, "Cannot insert 'null' as a List.");
if (!(entities instanceof Collection<?>)) {
throw new InvalidDataAccessApiUsageException("Entities have to be inside a collection");
}
registerTransactionSynchronisationIfSynchronisationActive();
this.solrOperations.saveBeans((Collection<? extends T>) entities, 1000);
commitIfTransactionSynchronisationIsInactive();
return entities;
}
private void registerTransactionSynchronisationIfSynchronisationActive() {
if (TransactionSynchronizationManager.isSynchronizationActive()) {
registerTransactionSynchronisationAdapter();
}
}
private void registerTransactionSynchronisationAdapter() {
TransactionSynchronizationManager.registerSynchronization(SolrTransactionSynchronizationAdapterBuilder
.forOperations(this.solrOperations).withDefaultBehaviour());
}
private void commitIfTransactionSynchronisationIsInactive() {
if (!TransactionSynchronizationManager.isSynchronizationActive()) {
this.solrOperations.commit();
}
}
}
and you also need to provide a SolrOperations bean for the right solr core:
#Configuration
public class FooSolrConfig {
#Bean
public SolrOperations getFooSolrOperations(SolrClient solrClient) {
return new SolrTemplate(solrClient, "foo");
}
}
Footnote: auto commit is (to my mind) conceptually incompatible with a transaction. An auto commit is a promise from solr that it will try to start to write it to disk within a certain time limit. Many things might stop that from actually happening however - a timely power or hardware failure, errors between the document and the schema, etc. But the client won't know that solr failed to keep its promise, and the transaction will see a success when it actually failed.

Web API concurrency and scalability

We are faced with the task to convert a REST service based on custom code to Web API. The service has a substantial amount of requests and operates on data that could take some time to load, but once loaded it can be cached and used to serve all of the incoming requests. The previous version of the service would have one thread responsible for loading the data and getting it into the cache. To prevent the IIS from running out of worker threads clients would get a "come back later" response until the cache was ready.
My understanding of Web API is that it has an asynchronous behavior built in by operating on tasks, and as a result the number of requests will not directly relate to the number of physical threads being held.
In the new implementation of the service I am planning to let the requests wait until the cache is ready and then make a valid reply. I have made a very rough sketch of the code to illustrate:
public class ContactsController : ApiController
{
private readonly IContactRepository _contactRepository;
public ContactsController(IContactRepository contactRepository)
{
if (contactRepository == null)
throw new ArgumentNullException("contactRepository");
_contactRepository = contactRepository;
}
public IEnumerable<Contact> Get()
{
return _contactRepository.Get();
}
}
public class ContactRepository : IContactRepository
{
private readonly Lazy<IEnumerable<Contact>> _contactsLazy;
public ContactRepository()
{
_contactsLazy = new Lazy<IEnumerable<Contact>>(LoadFromDatabase,
LazyThreadSafetyMode.ExecutionAndPublication);
}
public IEnumerable<Contact> Get()
{
return _contactsLazy.Value;
}
private IEnumerable<Contact> LoadFromDatabase()
{
// This method could be take a long time to execute.
throw new NotImplementedException();
}
}
Please do not put too much value in the design of the code - it is only constructed to illustrate the problem and is not how we did it in the actual solution. IContactRepository is registered in the IoC container as a singleton and is injected into the controller. The Lazy with LazyThreadSafetyMode.ExecutionAndPublication ensures only the first thread/request is running the initialization code, the following rquests are blocked until the initialization completes.
Would Web API be able to handle 1000 requests waiting for the initialization to complete while other requests not hitting this Lazy are being service and without the IIS running out of worker threads?
Returning Task<T> from the action will allow the code to run on the background thread (ThreadPool) and release the IIS thread. So in this case, I would change
public IEnumerable<Contact> Get()
to
public Task<IEnumerable<Contact>> Get()
Remember to return a started task otherwise the thread will just sit and do nothing.
Lazy implementation while can be useful, has got little to do with the behaviour of the Web API. So I am not gonna comment on that. With or without lazy, task based return type is the way to go for long running operations.
I have got two blog posts on this which are probably useful to you: here and here.

Resources