I'm rethinking our Spring MVC application behavior, whether it's better to pull (Java8 Stream) data from the database or let the database push (Reactive / Observable) it's data and use backpressure to control the amount.
Current situation:
User requests the 30 most recent articles
Service does a database query and puts the 30 results into a List
Jackson iterates over the List and generates the JSON response
Why switch the implementation?
It's quite memory consuming, because we keep those 30 objects in memory all the time. That's not needed, because the application processes one object at a time. Though the application should be able to retrieve one object, process it, throw it away, and get the next one.
Java8 Streams? (pull)
With java.util.Stream this is quite easy: The Service creates a Stream, which uses a database cursor behind the scenes. And each time Jackson has written the JSON String for one element of the Stream, it will ask for the next one, which then triggers the database cursor to return the next entry.
RxJava / Reactive / Observable? (push)
Here we have the opposite scenario: The database has to push entry by entry and Jackson has to create the JSON String for each element until the onComplete method has been called.
i.e. the Controller tells the Service: give me an Observable<Article>. Then Jackson can ask for as many database entries as it can process.
Differences and concern:
With Streams there's always some delay between asking for next database entry and retrieving / processing it. This could slow down the JSON response time if the network connection is slow or there is a huge amount of database requests that have to be made to fulfill the response.
Using RxJava there should be always data available to process. And if it's too much, we can use backpressure to slow down the data transfer from database to our application. In the worst case scenario the buffer/queue will contain all requested database entries. Then the memory consumption will be equal to our current solution using a List.
Why am I asking / What am I asking for?
What did I miss? Are there any other pros / cons?
Why did (especially) the Spring Data Team extend their API to support Stream responses from the database, if there's always a (short) delay between each database request/response? This could sum up to some noticeable delay for a huge amount of requested entries.
Is it recommended to go for RxJava (or some other reactive implementation) for this scenario? Or did I miss any drawbacks?
You seem to be talking about the fetch size for an underlying database engine.
If you reduce it to one (fetching and processing one row at a time), yes you will save some space during the request time...
But it usually makes sense to have a reasonable chunk size.
If it is too small you will have a lot of expensive network roundtrips. If the chunk size is too large, you are risking to run out of memory or introduce too much of a latency per fetch. So it is a compromise, and the right chunk/fetch size depends on your specific use case.
Regarding reactive approach or not, I believe it is not relevant. Like with RxJava and say Cassandra, one can create an Observable from an asynchronous result set, and it is up to the query (configuration) how many items should be fetched and pushed at a time.
Related
Is there a way to redefine the database "transactional" boundary on a spring batch job?
Context:
We have a simple payment processing job that reads x number of payment records, processes and marks the records in the database as processed. Currently, the writer does a REST API call (to the payment gateway), processes the API response and marks the records as processed. We're doing a chunk oriented approach so the updates aren't flushed to the database until the whole chunk has completed. Since, basically the whole read/write is within a transaction, we are starting to see excessive database locks and contentions. For example, if the API takes a long time to respond (say 30 seconds), the whole application starts to suffer.
We can obviously reduce the timeout for the API call to be a smaller value.. but that still doesn't solve the issue of the tables potentially getting locked for longer than desirable duration. Ideally, we want to keep the database transaction as short lived as possible. Our thought is that if the "meat" of what the job does can be done outside of the database transaction, we could get around this issue. So, if the API call happens outside of a database transaction.. we can afford it to take a few more seconds to accept the response and not cause/add to the long lock duration.
Is this the right approach? If not, what would be the recommended way to approach this "simple" job in spring-batch fashion? Are there other batch tools better suited for the task? (if spring-batch is not the right choice).
Open to providing more context if needed.
I don't have a precise answer to all your questions but I will try to give some guidelines.
Since, basically the whole read/write is within a transaction, we are starting to see excessive database locks and contentions. For example, if the API takes a long time to respond (say 30 seconds), the whole application starts to suffer.
Since its inception, the term batch processing or processing data in "batches" is based on the idea that a batch of records is treated as a unit: either all records are processed (whatever the term "process" means) or none of the records is processed. This "all or nothing" semantic is exactly what Spring Batch implements in its chunk-oriented processing model. Achieving such a (powerful) property comes with trade-offs. In your case, you need to make a trade-off between consistency and responsiveness.
We can obviously reduce the timeout for the API call to be a smaller value.. but that still doesn't solve the issue of the tables potentially getting locked for longer than desirable duration.
The chunk-size is the most impactful parameter on the transaction behaviour. What you can do is try to reduce the number of records to be processed within a single transaction and see the result. There is no best value, this is an empirical process. This will also depend on the responsiveness of the API you are calling during the processing of a chunk.
Our thought is that if the "meat" of what the job does can be done outside of the database transaction, we could get around this issue. So, if the API call happens outside of a database transaction.. we can afford it to take a few more seconds to accept the response and not cause/add to the long lock duration.
A common technique to avoid doing such updates on a live system is to offload the processing against another datastore and then replicate the updates in a single transaction. The idea is to mark records with a given batch id and copy those records to a different datastore (or even a temporary table within the same datastore) that the batch process can use without impacting the live datastore. Once the processing is done (which could be done in parallel to improve performance), records can be marked as processed in the live system within in a single transaction (this is usually very fast and could be based on the batch id to identify which records to update).
A system is being implemented using microservices. In order to decrease interactions between microservices implemented "at the same level" in an architecture, some microservices will locally cache copies of tables managed by other services. The assumption is that the locally cached table (a) is frequently accessed in a "read mode" by the microservice, and (b) has relatively static content (i.e., more of a "lookup table" vice a transactional content).
The local caches will maintain synch using inter-service messaging. As the content should be fairly static, this should not be a significant issue/workload. However, on startup of a microservice, there is a possibility that the local cache has gone stale.
I'd like to implement some sort of rolling revision number on the source table, so that microservices with local caches can check this revision number to potentially avoid a re-synch event.
Is there a "best practice" to this approach? Or, a "better alternative", given that each microservice is backed by it's own database (i.e., no shared database)?
In my opinion you shouldn't be loading the data at start up. It might be bit complicated to maintain version.
Cache-Aside Pattern
Generally in microservices architecture you consider "cache-aside pattern". You don't build the cache at front but on demand. When you get a request you check the cache , if it's not there you update the cache with latest value and return response, from there it's always returned from cache. The benefit is you don't need to load everything at front. Say you have 200 records, while services are only using 50 of them frequently , you are maintaining the extra cache that may not be required.
Let the requests build the cache , it's the one time DB hit . You can set the expiry on cache and incoming request build it again.
If you have data which is totally static (never ever change) then this pattern may not be worth a discussion , but if you have a lookup table that can change even once a week, month, then you should be using this pattern with longer cache expiration time. Maintaining the version could be costly. But really upto you how you may want to implement.
https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside
We ran into this same issue and have temporarily solved it by using a LastUpdated timestamp comparison (same concept as your VersionNumber). Every night (when our application tends to be slow) each service publishes a ServiceXLastUpdated message that includes the most recent timestamp when the data it owns was added/edited. Any other service that subscribes to this data processes the message and if there's a mismatch it requests all rows "touched" since it's last local update so that it can get back in sync.
For us, for now, this is okay as new services don't tend to come online and be in use same day. But, our plan going forward is that any time a service starts up, it can publish a message for each subscribed service indicating it's most recent cache update timestamp. If a "source" service sees the timestamp is not current, it can send updates to re-sync the data. This has the advantage of only sending the needed updates to the specific service(s) that need it even though (at least for us) all services subscribed have access to the messages.
We started with using persistent Queues so if all instances of a Microservice were down, the messages would just build up in it's queue. There are 2 issues with this that led us to build something better:
1) It obviously doesn't solve the "first startup" scenario as there is no queue for messages to build up in
2) If ANYTHING goes wrong either in storing queued messages or processing them, you end up out of sync. If that happens, you still need a proactive mechanism like we have now to bring things back in sync. So, it seemed worth going this route
I wouldn't say our method is a "best practice" and if there is one I'm not aware of it. But, the way we're doing it (including planned future work) has so far proven simple to build, easy to understand and monitor, and robust in that it's extremely rare we get an event caused by out-of-sync local data.
I'm trying to find an architecture for the following scenario. I'm building a REST service that performs some computation that can be quickly batch computed. Let's say that computing 1 "item" takes 50ms, and computing 100 "items" takes 60ms.
However, the nature of the client is that only 1 item needs to be processed at a time. So if I have 100 simultaneous clients, and I write the typical request handler that sends one item and generates a response, I'll end up using 5000ms, but I know I could compute the same in 60ms.
I'm trying to find an architecture that works well in this scenario. I.e., I would like to have something that merges data from many independent requests, processes that batch, and generates the equivalent responses for each individual client.
If you're curious, the service in question is python+django+DRF based, but I'm curious about what kind of architectural solutions/patterns apply here and if anything solving this is already available.
At first you could think of a reverse proxy detecting all pattern-specific queries, collecting all theses queries and sending it to your application in an HTTP 1.1 pipeline (pipelining is a way to send a big number of queries one after another and receiving all HTTP responses in the same order at the end, without waiting for a response after each query).
But:
Pipelining is very hard to do well
you would have to code the reverse proxy as I do not know a way to do it
one slow response in the pipeline block all the other responses
you need an http server able to give several queries to your application language, something which never happens if the http server is not directly coded in your application, because usually http is made to work on only one query (like you never receive 2 queries in a PHP env, you receive the 1st one, send the response, and then receive the next one, even if the connection contain 2 queries).
So the good idea would be to do that on the application side. You could identify matching queries, and wait for a small amount of time (10ms?) to see if some other queries are also incoming. You will need a way to communicate between several parallel workers here (like you have 50 application workers and 10 of them have received queries that could be treated in the same batch). This way of communication could be a database (a very fast one) or some shared memory, depends on the technology used.
Then when too much time waiting has been spend (10ms?) or when a big amount of queries are received, one of the worker could collect all queries, run the batch, and tell every other workers that a result is there (here again you need a central point of communication, like LISTEN/NOTIFY in PostgreSQL, a shared memory thing, a message queue service, etc.).
Finally every worker is responsible for sending the right HTTP response.
The key here is having a system where the time you loose in trying to share requests treatment is less important than the time saved in batching several queries together, and in case of low traffic this time should stay reasonnable (as here you will always loose time waiting for nothing). And of course you are also adding some complexity on the system, harder to maintain, etc.
I have a question regarding the performance of RequestFactory and GWT. I have a Domain Entity with 8 fields that returns around 1000 EntityProxies. The time between the request fires and it responds is around 20 seconds. I do the same but returning 10 EntityProxies and the time is 17 seconds, almost the same.
Is this because I'm working in development mode, or when I release the code to the web the time will be the same?
Is there any way to improve the performance? , I'm only reading data so perhaps something that only read and doesn't writes may be the solution?
I read this post with something similar to my problem:
GWT Requestfactory performance suggestions
Thanks a lot.
PD: I read somewhere that one solution could be to create an xml in the server, send it to the client and recreate the object there, I don't want to do this because it would really change the design of my app.
Thank you all for the help, I realize now that perhaps using Request Factory to retrieve thousands of records was a mistake.
I initially used a Locator to override isLive() and Find() methods according to this post:
gwt-requestfactory-performance-suggestions
The response time was reduced to about 13 seconds, but it is still too high.
But I solved it easily. Instead of returning 1000+ Entities , I created a new database table which each field has all the same field records (1000+) concatenated by a separator (each db field has a length of about 10000 ) and I only have one record in the table with around 8 fields.
Something like this:
Field1 | Field2 | Field3
Field1val;Field1val;Field1val;....... | Field2val;Field2val;Field2val;...... | Field3val;Field3val;Field3val;......
I return that One record through RequestFactory to my client and it reduced the speed a lot!, around 1 second. I parse this large String in the client and the duration of that is about 500ms. So instead of wasting around 20 seconds now it takes around 1-2 seconds to accomplish the same.
By the way I am only displaying information, it is not necessary to Insert, Delete or Update records so this solution works for me.
Thought I could share this solution.
Performance Profiling and Fixing issues in GWT is tricky. Avoid all profiling in GWT Hosted mode. They do not mean anything useful.
You should profile only in WEB mode.
GWT RequestFactory by design is slower than GWT RPC and GWT JSON etc. This is a trade off w.r.t GWT RF ability to calculate delta and send only small amount information to server on save.
You should recheck you application design to avoid loading 1000's of proxies. RF is mean for "Form" like applications. The only reason you might need 1000's of proxies is for a Grid display. You probably can use paginated async grid in that scenario.
You should profile your app in order to find out how much time is spent on following steps:
Entities retrieved from the database (server): This can be improved using second level cache and optimized queries
Entities serialized to JSON (server): There is a overhead here because RequestFactory and AutoBean respectively rely on reflections. You can try to only transmit the Entities that you are also going to display on the client. Another optimization which greatly reduces latency is to override the isLive method of your EntitiyLocator and to return true
HTTP request from server to client to tranmit the data (wire): You can think about using gzip compression to reduce the amount of data that has to be transferred (important if you send a lof of objects over the wire).
De-serialization on the client (client): This should be quite fast. There was a benchmark that showed that AutoBean serialization was one of the fastest ways to serialize JSON. Again this will benefit from not sending the whole object graph over the wire.
One way to improve performance is to use caching. You can use HTML5 localstorage to cache data on the client. This applies specifically to data that doesn't change often.
I have a web page which, upon loading, needs to do a lot of JSON fetches from the server to populate various things dynamically. In particular, it updates parts of a large-ish data structure from which I derive a graphical representation of the data.
So it works great in Chrome; however, Safari and Firefox appear to suffer somewhat. Upon the querying of the numerous JSON requests, the browsers become sluggish and unusable. I am under the assumption that this is due to the rather expensive iteration of said data structure. Is this a valid assumption?
How can I mitigate this without changing the query language so that it's a single fetch?
I was thinking of applying a queue that could limit the number of concurrent Ajax queries (and hence also limit the number of concurrent updates to the data structure)... Any thoughts? Useful pointers? Other suggestions?
In browser-side JS, create a wrapper around jQuery.post() (or whichever method you are using)
that appends the requests to a queue.
Also create a function 'queue_send' that will actually call jQuery.post() passing the entire queue structure.
On server create a proxy function called 'queue_receive' that replays the JSON to your server interfaces as though it came from the browser, collects the results into a single response, sends back to browser.
Browser-side queue_send_success() (success handler for queue_send) must decode this response and populate your data structure.
With this, you should be able to reduce your initialization traffic to one actual request, and maybe consolidate some other requests on your website as well.
in particular, it updates parts of a largish data structure from which i derive a graphical representation of the data.
I'd try:
Queuing responses as they come in, then update the structure once
Hiding the representation invisible until the responses are in
Magicianeer's answer is also good - but I'm not sure if it fits your definition of "without changing the query language so that it's a single fetch" - it would avoid re-engineering existing logic.