Distributed algorithm design

Distributed algorithm design - algorithm

I've been reading Introduction to Algorithms and started to get a few ideas and questions popping up in my head. The one that's baffled me most is how you would approach designing an algorithm to schedule items/messages in a queue that is distributed.
My thoughts have lead me to browsing Wikipedia on topics such as Sorting,Message queues,Sheduling, Distributed hashtables, to name a few.
The scenario:
Say you wanted to have a system that queued messages (strings or some serialized object for example). A key feature of this system is to avoid any single point of failure. The system had to be distributed across multiple nodes within some cluster and had to consistently (or as best as possible) even the work load of each node within the cluster to avoid hotspots.
You want to avoid the use of a master/slave design for replication and scaling (no single point of failure). The system totally avoids writing to disc and maintains in memory data structures.
Since this is meant to be a queue of some sort the system should be able to use varying scheduling algorithms (FIFO,Earliest deadline,round robin etc...) to determine which message should be returned on the next request regardless of which node in the cluster the request is made to.
My initial thoughts
I can imagine how this would work on a single machine but when I start thinking about how you'd distribute something like this questions like:
How would I hash each message?
How would I know which node a message was sent to?
How would I schedule each item so that I can determine which message and from which node should be returned next?
I started reading about distributed hash tables and how projects like Apache Cassandra use some sort of consistent hashing to distribute data but then I thought, since the query won't supply a key I need to know where the next item is and just supply it...
This lead into reading about peer to peer protocols and how they approach the synchronization problem across nodes.
So my question is, how would you approach a problem like the one described above, or is this too far fetched and is simply a stupid idea...?
Just an overview, pointers,different approaches, pitfalls and benefits of each. The technologies/concepts/design/theory that may be appropriate. Basically anything that could be of use in understanding how something like this may work.
And if you're wondering, no I'm not intending to implement anything like this, its just popped into my head while reading (It happens, I get distracted by wild ideas when I read a good book).
UPDATE
Another interesting point that would become an issue is distributed deletes.I know systems like Cassandra have tackled this by implementing HintedHandoff,Read Repair and AntiEntropy and it seems to work work well but are there any other (viable and efficient) means of tackling this?

Overview, as you wanted
There are some popular techniques for distributed algorithms, e.g. using clocks, waves or general purpose routing algorithms.
You can find these in the great distributed algorithm books Introduction to distributed algorithms by Tel and Distributed Algorithms by Lynch.
Reductions
are particularly useful since general distributed algorithms can become quite complex. You might be able to use a reduction to a simpler, more specific case.
If, for instance, you want to avoid having a single point of failure, but a symmetric distributed algorithm is too complex, you can use the standard distributed algorithm of (leader) election and afterwards use a simpler asymmetric algorithm, i.e. one which can make use of a master.
Similarly, you can use synchronizers to transform a synchronous network model to an asynchronous one.
You can use snapshots to be able to analyze offline instead of having to deal with varying online process states.

Related

Is it possible to apply Machine Learning algorithm to predict Failure in large HPC systems based on a years of systematic data collection?

The provided CSV dataset categories look like the following:
DATE | Hardware Identifier | What Failed | Description of Failure | Action Taken
The complete data can be easily downloaded from the Dropbox service using this link: data.csv
The data is very systematic, the input is very consistent and nicely structured. This data comes from a Computer Failure Data Repository. Additional details can be found on this link at USENIX: PNNL
About the data:
There are somewhat little over 2800 entries of single failure events that were collected over 4 years. Each event is described by the exact date and time when the event took place, what Node in the system failed, what hardware component of that node failed.
About the system:
Consists of 980 nodes processing some heavy calculation for the Molecular Science Computing Facility. Each node is designated by its own, unique ID.
My question:
Is it possible to perform any meaningful Machine Learning technique on such dataset, that would, in the end, be capable of predicting future failures in the system?
For example, would it be possible to train the ML algorithm on the provided dataset in order to predict either:
What node might fail soon (based on Hardware Identifier field)
What (node-piece of hardware) combination might fail soon (based on Hardware Identifier and either What failed or Description of Failure field)
What kind of failure might occur next anywhere in the system (based on What Failed field)
To me, this sounds like a huge classification problem. For example, in the case of (node-piece of hardware that failed), there are several thousands of different possibilities (classes). Having in mind that there are only little over 2800 single failure events described in the table, I don't feel like this would work.
Also, I am confused about how I should feed the data into the algorithm. Should the only input to the algorithm be the DATE field (converted to numeric linear growing time)? That doesn't seem right. Is it possible to feed the algorithm somehow with the time variable combined with some history of recent failure events? Should I restructure data to feed the algorithm with time variable + failure history (that might be limited, for example, to the last 30 days, or to feed the whole failure history of the system)?
May I hear your opinion? Is it possible to train an algorithm from this dataset that could predict any of the above-mentioned failure events (like, i.e. What node will fail next) given some input of the system (I can only think of time as an input for now, but that sounds wrong).
Since I am just starting to get involved with the ML algorithms, my thinking on the topic is probably very narrow and limited, so please feel free to suggest if you feel I should take a completely different approach on this.

Before we go on, remember that these failures are generally considered fairly random, so any results you get will likely be fairly unreliable.
The main problem to consider, is that you have very little data compared to the amount of nodes, slightly less than 3 on average, which means that you have to use some incredibly simple models, that would not give you much advantage over a random guess, for you to even have any certainty in your variables (separate mean time between failure would not have a determinable error, if it is even calculateable). For this I would probably treat each node as a separate test point, and then train a tree based algorithm to try to predict when the last failure in the nodes sequence of failures is, but that also mean that it would only be applicable to a subset of the database. This might be able to vaguely predict whether the node will fall in the near future and what type it would most likely be, but it like be fairly close to the estimate of mean time to failure and most common failure for all nodes.
If you want some meaningful results, you will need to have some attributes of the nodes that you can do the machine learning on, such as hardware components and when they were installed, and then have that as input in the classification. Since the problem will likely behave fairly randomly, you would get more information from trying to solve the regression problem instead of the classification problem, since you can still get good precision on a probabilistic model, even though the classification itself would be highly uncertain.

Developing cluster apps

I'm not sure exactly where (or even how exactly to ask) this question, so I'm hoping someone here can point me in the right direction.
I have a service that I'm building. That service has different objects in memory - each with it's own state. Whenever an object is created it loads the state from the database and hold it. When changes are made to the object they are also persistent to the database.
I would like to scale this service. I have looked at solutions such as akka.net (actor model) and they have a clustering solution. From what I've read, it synchronizes the state with something they call "gossip" where each node sends the state to the other node. I'm not sure that it really possible to convert my working application to akka.net at this point.
I'm wondering exactly how clusters keep state synced between different nodes (I get the gossip concept), what happens if I have machine A that receives a message and at the same time, machine B also receives a message - both change the same state of an object - that will make problems with data integrity between states. My only thought about this is to lock a shared resource, but that defeats the purpose of the cluster.
Keeping state in the database is also not an option since the database becomes a bottleneck and a single point of failure.
I can't seem to find any relevant reading materials online - but I'm also lacking the technical phrases I need to focus on.
In case it's relevant, I'm using .NET Core and c# for development.
Can anyone explain the concept of clustering, how it works and make sure nodes are at sync? or can point to the right direction?

You have a big problem. I think that the way you are thinking about the problem is a bigger problem. Let's go through some basics.
Clustering is used to solve big problems, much like the "eat an elephant" problem. You could to solve this problem design a unique bigger predator with a huge mouth. But history and paleontology has shown us that big predators are not easily sustained (they are expensive on the environment).
So to solve your problem, you could take a bigger stronger server.
Or, you could use clustering.
Clustering solves the "eat the elephant" problem in a very different way. Instead of sending a unique huge predator with a huge mouth to eat the elephant, it will use a concept of distributed and shared processing to eat it one bite at a time. When done properly, ants could eat the elephant. If there are enough of them and the circumstances are correct.
But notice in my example, ants are very small... A single ant will never carry the entire elephant. You could carry the entire elephant if all the ants worked together but then you run into concurrency and locking problems (you must coordinate the ants).
Ants have shown us a much better way to deal with this. They will take a piece of the elephant and deal with the problem in smaller chunks.
In your system you ask how you can sync data across nodes... My question would be why? If you are syncing data then you are mirroring and your problem becomes even bigger (you are cloning the elephant but can only eat the original).
The solution to your problem is to rethink the solution and see if you can break down the problem into smaller pieces.
In Akka and in the Actor pattern the best way to deal with problems is to use smaller "processes" (a single ant). While the process on its own is almost useless, when used in a large scale they can become very powerful. When the architecture is properly done you will notice that taking a flamethrower to ants will not defeat them... More ants will come, they will continue to work on the problem.
Copying and syncing data is not your solution, clustering it is. You must take your data and break it down to a point where you can give it to a single ant. If you can do this then you can use Akka. If this approach seems ludicrous then Akka is not for you.
But consider this... You obviously have concerns over your database backend - you don't want to increase IO and introduce a single point of failure. I would have to agree with you. But you need to rethink things. You could have database mirroring to remove the single point of failure but you are correct that this won't remove the bottleneck. So let's say that mirror removes the single point of failure... Now let's attack the bottleneck portion.
If you can split up your data into small enough chunks that ants can handle it then I would urge you to tell your ants to only report to the database when the data changes... You can read it once on initialization (you need a backend store, don't kid yourself, electricity can be quickly lost... it must be saved somewhere) but if you tell your ants to persist only changed data then you will remove all the queries from the equation which will drastically shift where the load is coming from. Once you only have updates, inserts and deletes to deal with... the whole landscape will be much simpler.
Clustering should be the solution for you, but only if you can take the concept of mirror away from your mind.
Cluster nodes can and will crash... But they can be respawned elsewhere to other nodes, so that you always have a quick system. Only when you deal with a crash or loss of a node/worker process/ant will you have to reload data...
Good luck... you have outlined a formidable problem that I have seen people with software engineering degrees fail at solving.

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!

How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.

I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

Differences between OT and CRDT

Can someone explain me simply the main differences between Operational Transform and CRDT?
As far as I understand, both are algorithms that permits data to converge without conflict on different nodes of a distributed system.
In which usecase would you use which algorithm?
As far as I understand, OT is mostly used for text and CRDT is more general and can handle more advanced structures right?
Is CRDT more powerful than OT?
I ask this question because I am trying to see how to implement a collaborative editor for HTML documents, and not sure in which direction to look first. I saw the ShareJS project, and their attempts to support rich text collaboration on the browser on contenteditables elements. Nowhere in ShareJS I see any attempt to use CRDT for that.
We also know that Google Docs is using OT and it's working pretty well for real-time edition of rich documents.
Is Google's choice of using OT because CRDT was not very known at that time? Or would it be a good choice today too?
I'm also interested to hear about other use cases too, like using these algorithms on databases. Riak seems to use CRDT. Can OT be used to sync nodes of a database too and be an alternative to Paxos/Zab/Raft?

Both approaches are similar in that they provide eventual consistency. The difference is in how they do it. One way of looking at it is:
OT does it by changing operations. Operations are sent over the wire and concurrent operations are transformed once they are received.
CRDTs do it by changing state. Operations are made on the local CRDT. Its state is sent over the wire and is merged with the state of a copy. It doesn't matter how many times or in what order merges are made - all copies converge.
You're right, OT is mostly used for text and does predate CRDTs but research shows that:
many OT algorithms in the literature do not satisfy convergence properties
unlike what was stated by their authors
In other words CRDT merging is commutative while OT transformation functions sometimes are not.
From the Wikipedia article on CRDT:
OTs are generally complex and non-scalable
There are different kinds of CRDTs (sets, counters, ...) suited for different kinds of problems. There are some that are designed for text editing. For example, Treedoc - A commutative replicated data type for cooperative editing.

One another notable difference is that:
OT requires a central server for co-ordination.
CRDT can adopt any network topology like P2P over WebRTC and it is resilient to network partitions, which makes it decentralized.
Reference: https://youtu.be/B5NULPSiOGw?t=643 by Martin Kleppmann, author of "Designing Data-Intensive Applications".

How p2p search engines could prevent corruption of distributed index by malicious peers?

As a hobby I'm writing simple and primitive distributed web search engine and it occurred to me it currently has no protection against malicious peers trying to skew search results.
Current architecture of the project is storing inverse index and ranking factors in kad dht with peers updating this inverse index as they crawl web.
I've used google scholar in attempt to find some solution but it seems most of the authors of proposed p2p web search ignore above-mentioned problem.
I think I need some kind of reputation system or trust metrics, but my knowledge in this domain is sufficiently lacking and I would very much appreciate a few pointers.

One way you could avoid this is to only use reliable nodes for storing and retrieving values. The reliability of a node will have to be computed by known-good nodes, and it could be something like the similarity of a node's last few computed ranking factors compared to the same ranking factors computed by known-good nodes (i.e. compare the node's scores for google.com to known-good scores for google.com). Using this approach, you'll need to avoid the "rogue reliable node" problem (for example, by using random checks or reducing all reliability scores randomly).
Another way you could approach this is to duplicate computation of ranking factors across multiple nodes, fetch all of the values at search time, and rank them on the client side (using variance, for example). You could also limit searches to sites that only have >10 duplicate values computed, so that there is some time before new sites are ranked. Additionally, any nodes with values outside of the normal range could be reported by the client in the background, and their reliability scores could be computed this way. This approach is time-consuming for the end user (unless you replicate known-good results to known-good nodes for faster lookups).
Also, take a look at this paper which describes a sybil-proof weak-trust system (which, as the authors explain, is more robust than the impossible sybil-proof strong-trust system): http://www.eecs.harvard.edu/econcs/pubs/Seuken_aamas14.pdf

The problem you are describing is Byzantine General’s problem or Byzantine Fault Tolerance. You can read more about it on wikipedia but there must be plenty of papers written about it.
I don’t remember the exact algorithm, but basically it’s mathematically proven that for t traitors (malicious peers) you will need 3*t + 1 peers in total, in order to detect the traitors.
My general thought would be, this is a huge overhead in implementation and resource waste on the indexing side, and while there is enough research to be done in distributed indexing and distributed search, not many people are tackling it yet. Also the problem has been basically solved with the Byzantine General’s it “just" needs to be implemented on top of an existing (and working) distributed search engine.

If you don't mind having a time delay on index updates, you could opt for a block-chain algorithm similar to what bitcoin uses to secure funds.
Changes to the index (deltas only!) can be represented in a text or binary file format, and crunched by peers who accept a given block of deltas. A malicious peer would have to out-compute the rest of the network for a period of time in order to skew the index in their favor.
I believe the bitcoin hashing algorithm (SHA-256) to be flawed in that custom hardware renders the common users' hardware useless. A block chain using the litecoin's algorithm (scrypt) would work well, because cpus and gpus are effective tools in the computation.
You would weigh the difficulty accordingly, so that news block are produced on a fairly regular schedule -- maybe 2-5 minutes. A user of the search engine could posibly choose to use the index at least 30 minutes old, to guarantee that enough users in the network vouch for its contents.
more info:
https://en.bitcoin.it/wiki/Block_chain
https://en.bitcoin.it/wiki/Block_hashing_algorithm
https://litecoin.info/block_hashing_algorithm
https://www.coinpursuit.com/pages/bitcoin-altcoin-SHA-256-scrypt-mining-algorithms/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio