what are the risks and ramifications of changing the document validation criteria in a running Couch database? - validation

To take the simplest possible example:
Start with an empty database.
Add a document
Add a design document with validation function that rejects everything
Replicate that database.
To ask a concrete question to begin with, one with an answer that I hope can be given very quickly by pointing me to the right url: is the result of this replication defined by some rule, for example that the documents are always replicated in the order they were saved, or does the successful replication of the first document depend on whether the design document happened to arrive at the destination first? In the quick experiment I did, both documents did get successfully validated, but I'm trying to find out if that outcome is defined in a spec somewhere or it's implementation dependent.
To ask a followup question that's more handwavey and may not have a single answer, what else can happen and what sorts of solutions have emerged to manage those problems? It's obviously possible for different servers to simultaneously (and I use that word hesitantly) have different versions of a validation function. I suppose the validators could be backwards compatible, where every new version adds a case to a switch statement that looks up a say a schema_version attribute of the document. Then if a version 2 document arrives at a server where the version 3 validator is the gatekeeper, it'll be allowed in. If a version 3 document arrives at a version 2 validator, it's a bit more tricky, it presumably depends on whether strictness or leniency is an appropriate default for the application. But can either of those things even happen, or do the replication rules ensure that even if servers are going up and down, updates and deletes are being done all over the place, and replication connections are intermittent and indirect, that a document will never arrive on a given server before its appropriate validation function, and that a validation function will never arrive too late to handle one of the documents it was supposed to check?
I could well be overcomplicating this or missing out on some Zen insight, but painful experience has taught me that I'm not clever enough to predict what sorts of states concurrent systems can get themselves into.
EDIT:
As Marcello says in a comment, updates on individual servers have sequence numbers, and replication applies the updates in sequence number order. I had a vague idea that that was the case, but I'm still fuzzy on the details. I'm trying to find the simplest possible model that will give me an idea about what can and can't happen in a complex CouchDB system.
Suppose I take the state of server A that's started off empty and has three document writes made to it. So its state can be represented as the following string:
A1,A2,A3
Suppose server B also has three writes: B1,B2,B3
We replicate A to B, so the state of B is now: B1,B2,B3,A1,A2,A3. Although presumably the A updates have taken a sequence number on entering B, so the state is now: B1, B2, B3, B4(A1), B5(A2), B6(A3).
If I understand correctly, the replicator also makes a record of the fact that everything up to A3 has been replicated to B, and it happens to store this record as part of B's internal state, but I'm wondering if this is an implementation detail that can be disregarded in the simple model.
If you operate those sets of rules, the A updates and the B updates would stay in order on any server they were replicated to. Perhaps the only way they could get out of order is if you did something like replicating A to B, deleting A1 on A and A2 on B, replicating A to C, then replicating B to C, leaving a state on C of: A2, A3, B1, B2, B3, B4(A1).
Is this making any sense at all? Maybe strings aren't the right way of visualising it, maybe it's better to think of, I don't know, a bunch of queues (servers) in an airport , airport staff (replicators) moving people from queue to queue according to certain rules , and put yourself into the mind of someone trying to skip the queue, ie somehow get into a queue before someone who's ahead of them in their current queue. That has the advantage of personalising the model, but we probably don't want to replicate people in airports.
Or maybe there's some way of expliaining it as a Towers of Hanoi type game, although with FIFO queues instead of LIFO stacks.
It's a model I'm hoping to find - absolutely precise as far as behavior is concerned, all irrelevant implementation details stripped away, and using whatever metaphor or imagery makes it easiest to intuit.

The basic use case is simple. CouchDB uses sequence numbers to index database changes and to ask what changes need to be replicated. Order is implicit in this algorithm and what you fear should not happen. As a side note, the replication process only copies the last revision of a document, but this does not change anything about order.

Related

How does immutable data make eventual consistency trivial?

I have been reading Nathan Marz' article about how to beat the CAP theorem with the Lambda Architecture and don't understand how immutable data will make eventual consistency less complex.
The following paragraph is taken from the article:
The key is that data is immutable. Immutable data means there's no such thing as an update, so it's impossible for different replicas of a piece of data to become inconsistent. This means there are no divergent values, vector clocks, or read-repair. From the perspective of queries, a piece of data either exists or doesn't exist. There is just data and functions on that data. There's nothing you need to do to enforce eventual consistency, and eventual consistency does not get in the way of reasoning about the system.
Imagine the following example: I have a distributed insert-only database with two nodes A and B and both hold the record [timestamp=1; id=1; value=10]. Then at the same time, there is an insert against node A which results in [timestamp=2; id=1; value=20] and a read against node B for record with id=1.
How is solving the problem of eventual consistency less complex with that example than for databases with update possibility?
I'm not 100% I got it right, but I'll try to explain anyway.
Consider an example - you have 2 databases accepting writes/reads, connected with a network link. The link goes down, resulting in a network partition. We want our system to be CAP available, so we accept writes/reads in both databases.
When working with mutable data structures: suppose a client, connected to the 1st database, wants to update value for record X to A and another client, connected to the 2nd database, wants to update that value to B. Since our system is available, we accept both writes in both databases, but we will have to resolve the conflict once the network parittion is gone. This will result in one of the update being lost.
With immutable data structures, you wouldn't update the data but insert, so both writes would be there after the network paritition is gone. You'd still need some kind of time synchronization though in order to preserve the operation order which can be very tricky (see the comment in the article from Sebastien Diot).

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

If Paxos algorithm is modified such that the acceptors accept the first value, or the most recent value, does the approach fail?

I've tried to reason and understand if the algorithm fails in these cases but can't seem to find an example where they would.
If they don't then why isn't any of these followed?
Yes.
Don't forget that in later rounds, leaders may be proposing different values than in earlier rounds. Therefore the first message may have the wrong value.
Furthermore messages may arrive reordered. (Consider a node that goes offline, then comes back online to find messages coming in random order.) The most recent message may not be the most recently sent message.
And finally, don't forget that leaders change. The faster an acceptor can be convinced that it is on the wrong leader, the better.
Rather than asking whether the algorithm fails in such a scenario consider that if each node sees different messages lost, delayed, or reordered, is it correct for a node to just accept the first it happens to recieve? Clearly the answer is no.
The algorithm is designed to work when "first" cannot be decided by looking at the timestamp on a message as clocks on different machines may be out of sync. The algorithm is designed to work when the network paths, distances and congestion, may be different between nodes. Nodes may crash and restart else hang and resume making things even more "hostile".
So a five node cluster could have all two nodes try to be leader and all three see a random ordering of which leaders message is "first". So what's the right answer in that case? The algorithm has a "correct" answer based on its rules which ensures a consistent outcome under all "hostile" comditions.
In summary the point of Paxos is that our logical mental model of "first" as a programmer is based on an assumption of a perfect set of clocks, machines and networks. That doesn't exist in the real world. To try to see if things break if you change the algorithm you need "attack" the message flow with all those things said above. You will likely find some way to "break" things under any change.

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

Resources