Optimistic locking over multiple documents - go

I need to update some documents at once, like a RDBMS transaction. The best way to do this for a single document in a key-value store like couchbase seems to be using optimistic locking. This would work for me. However, I need to update multiple documents at once.
I need all documents to be updated, or none. Is this possible in couchbase or some similar highly scalable database?
(by the way, I'm using Go)

There are three approaches to resolve it:
You should take another look at your key/document designs and identify if its possible to combine your multiple docs into one. Then you will be able to do a single transactional update in Couchbase.
Simulate Transaction the effect can be simulated by writing a suitable document and view definition that produces the effect while still only requiring a single document update to be applied.
Simulate Multi-phase Transactions to use the transaction record to record each stage of the update process

Related

What is the best way of updating data in SpringBoot?

For put request, Do I always have to check the old data and change the changed fields in order to update the existing data? Is it the right way to check for each data change?
I do not know of any project which takes the effort to only update fields that were actually changed.
Usually what you'd do is that you just override all fields in your table with the new value as this is the easiest and most reliable way of doing so.
Also consider, that custom logic that decides what to update also needs to be maintained and can have bugs. If you end up having a bug in that logic, most likely, you'll realize that you have data consistency errors which might be unfixable.
Most likely, when you use Spring Boot, you would probably also use Spring Data JPA and Hibernate which are going to take care of mapping your objects to your database. In that case, Hibernate is going to decide on the update strategy you use anyways.
If you are worried about data consistency and concurrent updates to the same record, I would recommend looking into Optimistic Locking, which is an easy way to handle that issue. It's very easy to setup by just adding a version column to your table.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

neo4j slows down after lots of inserts

I'm the owner of the Blockchain2graph project that reads data from Bitcoin core rest API and insert Blocks, Addresses and Transactions as Graph objects in Neo4j.
After some imports, the process is slowing down until the memory is full. I don't want to use CSV imports. My problem is not performance, my goal is to insert things without the application stopping because of memory (even if it takes quite a lot of time)
I'm using spring-boot-starter-data-neo4j.
In my code, I try to make session.clear from times to times but it doesn't seem to have an impact. After restarting tomcat8, things go fast again.
As your project is about mass inserts, I wouldn't use an OGM like Spring Data Neo4j for writing the data.
You don't want a session to keep your data around on the client.
Instead, use Cypher directly sending updates you get from the BlockChain API directly as a batch per request, see my blog post for some examples (some of which we also use in SDN/Neo4j-OGM under the hood).
You can still use SDN for individual entity handling (CRUD) that's what OGMs are good for in my book to reduce the boilerplate.
But for more complex read operations that have aggregation, filtering, projection and path matches I'd still use Cypher on an annotated repository method, returning rows that can be mapped to a list of DTOs.

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

Data Synchronization from Relational Database to Couch DB

I need to synchronize my Relational database(Oracle or Mysql) to CouchDb. Do anyone has any idea how its possible. if its possbile than how we can notify the CouchDb for any changes happened on the relational DB.
Thanks in advance.
First of all, you need to change the way you think about database modeling. Synchronizing to CouchDB is not just creating documents of all your tables, and pushing them to Couch.
I'm using CouchDB for a site in production, I'll describe what I did, maybe it will help you:
From the start, we have been using MySQL as our primary database. I had entities mapped out, including their relations. In an attempt to speed up the front-end I decided to use CouchDB as a content repository. The benefit was to have fully prepared documents, that contained all the relational data, so data could be fetched with much less overhead.
Because the documents can contain related entities - say a question document that contains all answers - I first decided what top-level entities I wanted to push to Couch. In my example, only questions would be pushed to Couch, and those documents would contain the answers, and possible some metadata, such as tags, user info, etc. When requesting a question on the frontend, I would only need to fetch one document to have all the information I need at that point.
Now for your second question: how to notify CouchDB of changes. In our case, all the changes in our data are done using a CMS. I have a single point in my code which all edit actions call. That's the place where I hooked in a function that persisted the object being saved to CouchDB. The function determines if this object needs persisting (ie: is it a top level entity), then creates a document of this object (think about some sort of toArray function), and fetches all its relations, recursively. The complete document is then pushed to CouchDB.
Now, in your case, the variables here may be completely different, but the basic idea is the same: figure out what documents you want saved, and how they look like. Then write a function that composes these documents and make sure this is called when changes are made to your relational database.
Notifying CouchDB of a change
CouchDB is very simple. Probably the easiest thing is directly updating an existing document. Two ways to implement this come to mind:
The easiest way is a normal CouchDB update: Fetch the current document by id; modify it; then send it back to Couch with HTTP PUT or POST.
If you have clear application-specific changes (e.g. "the views value was incremented") then writing an _update function seems prudent. Update function are very simple: they receive an HTTP query and a document; they modify the document; and then CouchDB stores the new version. You write update functions in Javascript and they run on the server. It is a great way to "compress" common actions into simpler (and fewer) HTTP queries.

Resources