mongodb many inserts\updates performance - performance

I am using mongodb to store user's events, there's a document for every user, containing an array of events. The system processes thousands of events a minute and inserts each one of them to mongo.
The problem is that I get poor performance for the update operation, using a profiler, I notice that the WriteResult.getError is the one that incur the performance impact.
That makes sense, the update is async, but if one wants to retrieve the operation result he needs to wait until the operation is completed.
My question, is there a way to keep the update async, but only get an exception if error occurs (99.999 of the times there is no error, so the system waits for nothing). I understand it means the exception will be raised somewhere further down the process flow, but I can live with that.
Any other suggestions?
The application is written in Java so we're using the Java driver, but I am not sure it's related.

have you done indexing on your records?
it may be a problem to your performance.
if not done before you should do Indexing on ur collection like
db.collectionName.ensureIndex({"event.type":1})
for more help visit http://www.mongodb.org/display/DOCS/Indexes

Related

Redefine database "transactional" boundary on a spring batch job

Is there a way to redefine the database "transactional" boundary on a spring batch job?
Context:
We have a simple payment processing job that reads x number of payment records, processes and marks the records in the database as processed. Currently, the writer does a REST API call (to the payment gateway), processes the API response and marks the records as processed. We're doing a chunk oriented approach so the updates aren't flushed to the database until the whole chunk has completed. Since, basically the whole read/write is within a transaction, we are starting to see excessive database locks and contentions. For example, if the API takes a long time to respond (say 30 seconds), the whole application starts to suffer.
We can obviously reduce the timeout for the API call to be a smaller value.. but that still doesn't solve the issue of the tables potentially getting locked for longer than desirable duration. Ideally, we want to keep the database transaction as short lived as possible. Our thought is that if the "meat" of what the job does can be done outside of the database transaction, we could get around this issue. So, if the API call happens outside of a database transaction.. we can afford it to take a few more seconds to accept the response and not cause/add to the long lock duration.
Is this the right approach? If not, what would be the recommended way to approach this "simple" job in spring-batch fashion? Are there other batch tools better suited for the task? (if spring-batch is not the right choice).
Open to providing more context if needed.
I don't have a precise answer to all your questions but I will try to give some guidelines.
Since, basically the whole read/write is within a transaction, we are starting to see excessive database locks and contentions. For example, if the API takes a long time to respond (say 30 seconds), the whole application starts to suffer.
Since its inception, the term batch processing or processing data in "batches" is based on the idea that a batch of records is treated as a unit: either all records are processed (whatever the term "process" means) or none of the records is processed. This "all or nothing" semantic is exactly what Spring Batch implements in its chunk-oriented processing model. Achieving such a (powerful) property comes with trade-offs. In your case, you need to make a trade-off between consistency and responsiveness.
We can obviously reduce the timeout for the API call to be a smaller value.. but that still doesn't solve the issue of the tables potentially getting locked for longer than desirable duration.
The chunk-size is the most impactful parameter on the transaction behaviour. What you can do is try to reduce the number of records to be processed within a single transaction and see the result. There is no best value, this is an empirical process. This will also depend on the responsiveness of the API you are calling during the processing of a chunk.
Our thought is that if the "meat" of what the job does can be done outside of the database transaction, we could get around this issue. So, if the API call happens outside of a database transaction.. we can afford it to take a few more seconds to accept the response and not cause/add to the long lock duration.
A common technique to avoid doing such updates on a live system is to offload the processing against another datastore and then replicate the updates in a single transaction. The idea is to mark records with a given batch id and copy those records to a different datastore (or even a temporary table within the same datastore) that the batch process can use without impacting the live datastore. Once the processing is done (which could be done in parallel to improve performance), records can be marked as processed in the live system within in a single transaction (this is usually very fast and could be based on the batch id to identify which records to update).

Auto save performance for rdbms

In my app user types in some content which I would like to auto save as the user types. The save call is not for every keystroke, rather I do autosave only when user pauses for more than 200ms. So in a typical paragraph there are 15-20 server calls. The content will not be read very often, so I need to optimize the writes.
I have to save data on MSSQL Server because of legacy code reasons. I'm getting 10 seconds avg response time in my load test. How do I improve the performance?
One approach I'm considering is instead of directly saving data in mssql I'll save it in Cassandra or redis, then eventually(maybe at regular time intervals) write it to mssql.
Another approach is instead of doing frequent updates, I'll insert new record for each auto save. Then a background process will clean up all records except for latest, every few minutes.
Update:
I replaced the existing logic with simple update calls to 2 tables and now I am seeing improvements. There was a long stored procedure which was taking upto 10 seconds under load. SO for now I have hold on the problem. Still I would like to know is there something I can do on application server layer to reduce frequent DB calls.
It is quite hard to answer yor question directly but here are some hints based on what we do in a multiple active user situation.
If you are writing/triggering on every keystroke, pass the keystroke to a background thread and do not perform the database write, or any network call, while blocking the users typing. A fast typist can hit 20 keystrokes/second, and you cannot afford to introduce latency.
If recording on a web page, you might be able to use localStorage. Do not issue an AJAX style call on every keystroke as there is a limit to outstanding requests. You need to implement some kind of buffered send. Remember that network calls in the real world can be 300mS sort of scale just to traverse the network.
Do you really need to save every keystroke, or is every N seconds acceptable? Every save operation will eventually turn into a disk operation, so you really want to coalesce as many saves as possible. The quickest way to do something is not to do it at all.
If you are recording to a database, then it is often quicker to update an existing row, if you can fetch it by direct key first. Unfortunatly it can sometimes be quicker to insert a new row and clean up excess later. This tends to be true if the table has few indexes. Which is quicker depends on database engine in use and how it is being used. We use both methods.
When using a database keep in mind that they often keep journals of some kind, so if you are updating frequently you might create a large load on the journal files.
If you are using techniques (Using C terminology) like fopen, fwrite these can perform very well, but if you are worried about system failure recovery, you may need to call fsync, which then limits your maximum performance rate. If you need fsync, a database might be better.
You might like to consider writing to a transactionlog table very frequently, and then posting to the real storage every N seconds. For example, if I am typing a customers name I might record every keystroke into a keylog table, and then have a background job read the keylog table and transfer the data to customers table. This helps reduce the operations to the customers table while also allowing the keylog table to be optimised to recording keystrokes. But, at the cost of more code server side.
Overall, you want logic like this
On keyup handler
Add keystroke to background queue
Wake background thread
Background thread
Read/remove ALL data from background queue
If no data, wait for wakeup and repeat
Write to database/network/file etc as one operation. (this can now be syncronous calls)
Optionally some velocity control, simple one is sleep(50mS) or sleep(2s)
Repeat
Keep in mind with the above the user can type and immediately hit close, so your final buffer write might not have flushed yet. You need to handle this.
If you get this correct, the user will not notice any delay. In our usage, we are recording around 1000 keystrokes/sec average, all of which ar routed over private networks to central points. This load is barely a blip, even network monitoring does not see such a small amount of traffic.
Good luck.

Consisntent N1QL Query Couchbase GOCB sdk

I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.

Posting an update request to ElasticSearch without waiting for completion

I have an ElasticSearch index that stores files, sometimes very large ones. Because the underlying Lucene engine is actually doing a complete replacement each time a document is updated, even if I am just modifying the value of one field, the entire document needs to be updated behind the scenes.
For large, multi-MB files this can take a fairly long time (several hundred ms). Since this is done as part of a web application this is not really acceptable. What I am doing right now is forking the process, so the update is called on a separate thread while the request finishes.
This works, but I'm not really happy with this as a long term solution, partially because it means that every time I create a new interface to the search engine I'll have to recode the forking logic. Also it means I basically can't know whether the request is successful or not, or if some kind of error occurred, without writing additional code to log successful or unsuccessful requests somewhere.
So I'm wondering if there is an unknown feature where you can post an UPDATE request to ElasticSearch, and have them return an acknowledgement without waiting for the update task to actually complete.
If you look at the documentation for Snapshot and Restore you'll see when you make a request you can add wait_for_completion=true in order to have the entire process run before receiving the result.
What I want is the reverse — the ability to add ?wait_for_completion=false to a POST request.

Pause the execution of a process for a while in ejb

I think I may not be the first one with this problem.
Sometimes, the user submits a bunch of data to the server, and these data
is going to be displayed in the response page. In order to give users the illusion
that the data submission and process is fast. We usually do this asynchronously.
Now the problem is, for some reason, these data need to go to database first,
and be fetched to appear in the response page. If the response page is displayed
to the user too fast, asynchronous submission may not finish; Now I call
Thread.sleep();
before I call I setResponsePage().
but native thread is not recommended in EJB. Anyone knows alternatives ? Thanks
It's just been discussed in this question: Thread.sleep() in an EJB.
I'd split the logic into two EJBs: one for inserting the user data into DB, and one for fetching it. Your web layer would call one after the other, resulting in two separate transactions, which should be ordered properly by the database (still, it might depend on other factors, like transaction isolation).
EDIT
The problem with sleep() is that you never know how long to wait, so it's almost always a bad idea. I see a case here for Ajax push — your EJB should return immediately with a page to which data will be pushed when processing is complete. I won't advise you further on this topic, as I'm far from expertise in this area.
A still imperfect, but better than sleep(), could be syncing on database locks: first EJB would insert data and lock some record in its transaction, the second EJB would try to lock the same record and read the data. This way the second EJB would wait for minimal time that's needed.

Resources