How to deal with access collision while loading data from external sources? - spring-boot

I have a Spring service that retrieves very often (every 1000-2000 ms) data from several external APIs. Loading data takes 100-200ms. If at that moment of loading the client asks for data, then the data would not be available, because it take some time to process returned data (I am converting it from the loaded format from the API to the internal storage format of the java.util.Map instance).
This service is heavily used in the system and clients query the data very often, so a collision (when the data will not be available at the moment) is very likely. Maybe it would be ideal to pause the method that clients use to access data if data is being loaded from an external API. Or use somme type of semaphore?
Q: Is there a general method / best practise to avoid a collision in a Spring application in such situation?

Related

Does storing another service's data violate the Single Responsibility Principle of Microservice

Say I have a service that manages warehouses(that is not very frequently updated). I have a sales service that requires the list of stores( to search through and use as necessary). If I get the list of stores from the store service and save it( lets say in redis) inside my sales service but ensure that redis is updated if the list of stores changes. Would it violate the single responsibility principle of Microservice architecture?
No it does not, actually it is quite common approach in microservice architecture when service stores a copy of related data from another services and uses some mechanism to sync it (usually using some async communications via message broker).
Storing the copy of data does not transfer ownership of that data from service which manages it.
It is common and you have a microservice pattern (CQRS).
If you need some information from other services / microservices to join with your data, then you need to store that information.
Whenever you are making design decision whether always issue requests against the downstream system or use a local copy then you are basically making trade-off analysis between performance and data freshness.
If you always issue RPC calls then you prefer data freshness over performance
The frequency of how often do you need to issue RPC calls has direct impact on performance
If you utilize caching to gain performance then there is a chance to use stale data (depending on your business it might be okay or unacceptable)
Cache invalidation is a pretty tough problem domain so, it can cause headache
Caching one microservice's data does not violate data ownership because caching just reads the data, it does not delete or update existing ones. It is similar to have a single leader (master) - multiple followers setup or a read-write lock. Until there is only one place where data can be created, modified or deleted then data ownership is implemented in a right way.

When to use Redis and when to use DataLoader in a GraphQL server setup

I've been working on a GraphQL server for a while now and although I understand most of the aspects, I cannot seem to get a grasp on caching.
When it comes to caching, I see both DataLoader mentioned as well as Redis but it's not clear to me when I should use what and how I should use them.
I take it that DataLoader is used more on a field level to counter the n+1 problem? And I guess Redis is on a higher level then?
If anyone could shed some light on this, I would be most grateful.
Thank you.
DataLoader is primarily a means of batching requests to some data source. However, it does optionally utilize caching on a per request basis. This means, while executing the same GraphQL query, you only ever fetch a particular entity once. For example, we can call load(1) and load(2) concurrently and these will be batched into a single request to get two entities matching those ids. If another field calls load(1) later on while executing the same request, then that call will simply return the entity with ID 1 we fetched previously without making another request to our data source.
DataLoader's cache is specific to an individual request. Even if two requests are processed at the same time, they will not share a cache. DataLoader's cache does not have an expiration -- and it has no need to since the cache will be deleted once the request completes.
Redis is a key-value store that's used for caching, queues, PubSub and more. We can use it to provide response caching, which would let us effectively bypass the resolver for one or more fields and use the cached value instead (until it expires or is invalidated). We can use it as a cache layer between GraphQL and the database, API or other data source -- for example, this is what RESTDataSource does. We can use it as part of a PubSub implementation when implementing subscriptions.
DataLoader is a small library used to tackle a particular problem, namely generating too many requests to a data source. The alternative to using DataLoader is to fetch everything you need (based on the requested fields) at the root level and then letting the default resolver logic handle the rest. Redis is a key-value store that has a number of uses. Whether you need one or the other, or both, depends on your particular business case.

How to solve two generals issue between event store and persistence layer?

Two General Problems - EventStore and persistence layer?
I would like to understand how industry is actually dealing with this problems!
If a microservice 1 persists object X into Database A. In the same time, for micro-service 2 to feed on the data from micro-service 1, micro-service 1 writes the same object X to an event store B.
Now, the question I have is, where do I write object X first?
Database A first and then to event store B, is it fair to roll back the thread at the app level if Database A is down? Also, what should be the ideal error handle if Database A is online and persisted object X but event store B is down?
What should be the error handle look like if we go vice-versa of point 1?
I do understand that in today's world of distributed high-available systems, systems going down is questionable thing. But, it can happen. I want to understand what needs to be done when either database or event store system/cluster is down?
In general you want to avoid relying on a two-phase commit of the kind you describe.
In general, (presuming an event-sourced system; not sure if that's implicit in your question/an option for you - perhaps SqlStreamStore might be relevant in your context?), this is typically managed by having something project from from a single authoritative set of events on a pull basis - each event being written that requires an associated action against some downstream maintains a pointer to how far it has got projecting events from the base stream, and restarts from there if interrupted.
First of all, an Event store is a type of Persistence, which stores the applications state as a series of events as opposed to a flat persistence that stores the last projected state.
If a microservice 1 persists object X into Database A. In the same time, for micro-service 2 to feed on the data from micro-service 1, micro-service 1 writes the same object X to an event store B.
You are trying to have two sources of truth that must be kept in sync by some sort of distributed transaction which is not very scalable.
This is an unusual mode of using an Event store. In general an Event store is the canonical source of information, the single source of truth. You are trying to use it as an communication channel. The Event store is the persistence of an event-sourced Aggregate (see Domain Driven Design).
I see to options:
you could refactor your architecture and make the object X and event-sourced entity having as persistence the Event store. Then have a Read-model subscribe to the Event store and build a flat representation of the object X that is persisted in the database A. In other words, write first to the Event store and then in the Database A (but in an eventually consistent manner!). This is a big jump and you should really think if you want to go event-sourced.
you could use CQRS without Event sourcing. This means that after every modification, the object X emits one or more Domain events, that are persisted in the Database A in the same local transaction as the object X itself. The microservice 2 could subscribe to the Database A to get the emitted events. The actual subscribing depends on the type of database.
I have a feeling you are using event store as a channel of communication, instead of using it as a database. If you want micro-service 2 to feed on the data from micro-service 1, then you should communicate with REST services.
Of course, relying on REST services might make you less resilient to outages. In that case, using a piece of technology dedicated to communication would be the right way to go. (I'm thinking MQ/Topics, such as RabbitMQ, Kafka, etc.)
Then, once your services are talking to each other, you will still need to persist your data... but only at one single location.
Therefore, you will need to define where you want to store the data.
Ask yourself:
Who will have the governance of the data persistance ?
Is it Microservice1 ? if so, then everytime Microservice2 needs to read the data, it will make a REST call to Microservice1.
is it the other way around ? Microservice2 has the governance of the data, and Microservice1 consumes it ?
It could be a third microservice that you haven't even created yet. It depends how you applied your separation of concerns.
Let's take an example :
Microservice1's responsibility is to process our data to export them in PDF and other formats
Microservice2's responsibility is to expose a service for a legacy partner, that requires our data to be returned in a very proprietary representation.
who is going to store the data, here ?
Microservice1 should not be the one to persist the data : its job is only to convert the data to other formats. If it requires some data, it will fetch them from the one having the governance of the data.
Microservice2 should not be the one to persist the data. After all, maybe we have a number of other Microservices similar to this one, but for other partners, with different proprietary formats.
If there is a service where you can do CRUD operations, this is your guy. If you don't have such a service, maybe you can find an existing Microservice who wouldn't have conflicting responsibilities.
For instance : if I have a Microservice3 that makes sure everytime an my ObjectX is changed, it will send a PDF-representation of it to some address, and notify all my partners that the data are out-of-date. In that scenario, this Microservice looks like a good candidate to become the "governor of the data" for this part of the domain, and be the one-stop-shop for writing/reading in the database.

How to ensure data is eventually written to two Azure blobs?

I'm designing a multi-tenant Azure Service Fabric application in which we'll be storing event data in Azure Append-Only blobs.
There'll be two kinds of blobs; merge blobs (one per tenant); and instance blobs (one for each "object" owned by a tenant - there'll be 100K+ of these per tenant)
There'll be a single writer per instance blob. This writer keeps track of the last written blob position and can thereby ensure (using conditional writes) that no other writer has written to the blob since the last successful write. This is an important aspect that we'll use to provide strong consistency per instance.
However, all writes to an instance blob must also eventually (but as soon as possible) reach the single (per tenant) merge blob.
Under normal operation I'd like these merge writes to take place within ~100 ms.
My question is about how we best should implement this guaranteed double-write feature:
The implementation must guarantee that data written to an instance blob will eventually also be written to the corresponding merge blob exactly once.
The following inconsistencies must be avoided:
Data is successfully written to an instance blob but never written to the corresponding merge blob.
Data is written more than once to the merge blob.
Most easiest way as for me is to use events: Service Bus or Event Hubs or any other provider to guaranty that an event will be stored and reachable at least somewhere. Plus, it will give a possibility to write events to Blob Storage in batches. Also, I think it will significantly reduce pressure on Service Fabric and will allow to process events at desired timing.
So you could have a lot of Stateless Services or just Web Workers that will pick up new messages from a queue and in batch send them to a Statefull Service.
Let's say that it will be a Merge service. You would need to partition these services and the best way to send a batch of events grouped by one partition is to make such Stateless Service or Web Worker.
Than you can have a separate Statefull Actor for each object. But on your place I would try to create 100k actors or any other real workload and see how expensive it would be. If it is too expensive and you cannot afford such machines, then everything could be handled in another partitioned Stateless Service.
Okay, now we have the next scheme: something puts logs into ESB, something peaks these evetns from ESB in batches or very frequently, handling transactions and processing errors. After that something peaks bunch of events from a queue, it sends it to a particular Merge service that stores data in its state and calls particular actor to do the same thing.
Once actor writes its data to its state and service does the same, then such sevent in ESB can be marked as processed and removed from the queue. Then you just need to write stored data from Merge service and actors to Blob storage once in a while.
If actor is unable to store event, then operation is not complete and Merge service should not store data too. If Blob storage is unreachable for actors or Merge services, it will become reachable in the future and logs will be stored as they are saved in state or at least they could be retrieved from actors/service manually.
If Merge service is unreachable, I would store such event in a poison message queue for later processing, or try to write logs directly to Blob storage but it is a little bit dangerous though chances to write at that moment only to one kind of storage are pretty low.
You could use a Stateful Actor for this. You won't need to worry about concurrency, because there is none. In the state of the Actor you can keep track of which operations were successfully completed. (write 1, write 2)
Still, writing 'exactly once' in a distributed system (without a DTC) is never 100% waterproof.
Some more info about that:
link
link

Java8 Stream or Reactive / Observer for Database Requests

I'm rethinking our Spring MVC application behavior, whether it's better to pull (Java8 Stream) data from the database or let the database push (Reactive / Observable) it's data and use backpressure to control the amount.
Current situation:
User requests the 30 most recent articles
Service does a database query and puts the 30 results into a List
Jackson iterates over the List and generates the JSON response
Why switch the implementation?
It's quite memory consuming, because we keep those 30 objects in memory all the time. That's not needed, because the application processes one object at a time. Though the application should be able to retrieve one object, process it, throw it away, and get the next one.
Java8 Streams? (pull)
With java.util.Stream this is quite easy: The Service creates a Stream, which uses a database cursor behind the scenes. And each time Jackson has written the JSON String for one element of the Stream, it will ask for the next one, which then triggers the database cursor to return the next entry.
RxJava / Reactive / Observable? (push)
Here we have the opposite scenario: The database has to push entry by entry and Jackson has to create the JSON String for each element until the onComplete method has been called.
i.e. the Controller tells the Service: give me an Observable<Article>. Then Jackson can ask for as many database entries as it can process.
Differences and concern:
With Streams there's always some delay between asking for next database entry and retrieving / processing it. This could slow down the JSON response time if the network connection is slow or there is a huge amount of database requests that have to be made to fulfill the response.
Using RxJava there should be always data available to process. And if it's too much, we can use backpressure to slow down the data transfer from database to our application. In the worst case scenario the buffer/queue will contain all requested database entries. Then the memory consumption will be equal to our current solution using a List.
Why am I asking / What am I asking for?
What did I miss? Are there any other pros / cons?
Why did (especially) the Spring Data Team extend their API to support Stream responses from the database, if there's always a (short) delay between each database request/response? This could sum up to some noticeable delay for a huge amount of requested entries.
Is it recommended to go for RxJava (or some other reactive implementation) for this scenario? Or did I miss any drawbacks?
You seem to be talking about the fetch size for an underlying database engine.
If you reduce it to one (fetching and processing one row at a time), yes you will save some space during the request time...
But it usually makes sense to have a reasonable chunk size.
If it is too small you will have a lot of expensive network roundtrips. If the chunk size is too large, you are risking to run out of memory or introduce too much of a latency per fetch. So it is a compromise, and the right chunk/fetch size depends on your specific use case.
Regarding reactive approach or not, I believe it is not relevant. Like with RxJava and say Cassandra, one can create an Observable from an asynchronous result set, and it is up to the query (configuration) how many items should be fetched and pushed at a time.

Resources