Distributed caching systems and how they distribute data

Distributed caching systems and how they distribute data - caching

I'm looking for information on things like ehcache and other alternatives to memcached for a project that will likely involve 3-4 webservers and something like 2-10 million distributed objects that need to be available to all servers.
Specifically, I'm trying to understand how other systems distribute data, whether or not memcached is unique in distributing data among multiple caches, or other caches perform similarly (that is, the property that a given key may exist on any of N servers, and the clients don't care, as opposed to updates on a single server propagating to other caches that essentially act as copies).
For example, in looking at documentation for things like ehcache it's not clear to me if by "distributed" they mean a strategy similar to memcached or something more like "replicated/synchronized".
Edit: although the refs on distributed computing are useful, I'm more interested in how specific implementations behave. e.g. will I be paying for synchronization overhead in some systems?

You are not extremely precise in your question, although I might see where you want to go this is a pretty large field in itself.
You might want to start here: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
Also having a look at Dynamo, BigTable, and all the theoritical questions associated with this (CAP theorem and the presentation by Werner Vogels on this that you can find on infoq).
You have more and more information about this thanks to the multiple videos found about the NoSQL meetups.
Hope it helps,
Edit: about the synchronization overheads, it really depends on the system. every system has specific requirements, Dynamo for example aims at a high availability system that might not be always fully consistent (eventual consistency), so it is meant (by design and because of its requirements) to be a distributed systems in which every write must be accepted and fast. Other systems might behave differently,

I suspect you are after a discussion on consistency across "distributed data". This topic is vast but a good reference on the trade-offs is available here.
In other words, it pretty much depends on your requirements (which aren't very detailed here). If I have misunderstood your question, you can safely disregard my contribution ;-)

The feature or property you are probably looking for is a "shared nothing" architecture. Memcached is an example, e. g. there is no single point of failure, no synchronization or any other traffic between nodes, nodes don't even know each other.
So if this is what you want and you're evaluating a product/project, look for the "shared nothing" term. If it is not mentioned on the first screen, it probably is not a shared nothing architecture ;)

Related

Apache Beam and ETL processes

Given following processes:
manually transforming huge .csv's files via rules (using MS excel or excel like software) & sharing them via ftp
scripts (usually written in Perl or Python) which basically transform data preparing them for other processes.
API's batch reading from files or other origin sources & updating their corresponding data model.
Springboot deployments used (or abused) to in part regularly collect & aggregate data from files or other sources.
And given these problems/ areas of improvement:
Standardization: I'd like to (as far as it makes sense), to propose a unified powerful tool that natively deals with these types of (kind of big) data transformation workflows.
Rising the abstraction level of the processes (related to the point above): Many of the "tasks/jobs" I mentioned above, are seen by the teams using them, in a very technical low level task-like way. I believe having a higher level view of these processes/flows highlighting their business meaning would help self document these processes better, and would also help to establish a ubiquitous language different stakeholders can refer to and think unambiguously about.
IO bottlenecks and resource utilization (technical): Some of those processes do fail more often that what would be desirable, (or take a very long time to finish) due to some memory or network bottleneck. Though it is clear that hardware has limits, resource utilization doesn't seem to have been a priority in many of these data transformation scripts.
Do the Dataflow model and specifically the Apache Beam implementation paired with either Flink or Google Cloud Dataflow as a backend runner, offer a proven solution to those "mundane" topics? The material on the internet mainly focuses on discussing the unified streaming/batch model and also typically cover more advanced features like streaming/event windowing/watermarks/late events/etc, which do look very elegant and promising indeed, but I have some concerns regarding tool maturity and community long term support.

It's hard to give a concrete answer to such a broad question, but I would say that, yes, Beam/Dataflow is a tool that handle this kind of thing. Even though the documentation focuses on "advanced" features like windowing and streaming, lots of people are using it for more "mundane" ETL. For questions about tool maturity and community you could consider sources like Forrester reports that often speak of Dataflow.
You may also want to consider pairing it with other technologies like Arflow/Composer.

How to represent data in an efficient way ? (Graphically Talking)

Before going for further reading, just to let you know this question is vague and do not need one precise answer. To the contrary more answer I get better it will be for me.
The question is : How to represent data in an efficient way ?
I am not talking about representing data into a database or any language.
I am talking about when a program, a report, a page needs to be shown to a user (Static - report- and Dynamic - web pages -) how one should represent the data in order to the user to catch as many information as possible from - almost - the first look. Is there any best-practices, pitfalls to avoid and stuff ?
Edit: Any book/link that can help or that treat about this subject are welcome.

"how one should represent the data in order to the user to catch as many information as
possible from - almost - the first look."
To me, this screams that you need to be speaking to your end-users more. My suggestion would be to mock up the initial layout using something like Balsamiq Mockups (This can be done even if it's a public facing site). Using the mockups will help you visualise the design of the overall page.
"First-look" type views indicates a dashboard which provide overall, high level results.
Now, just to be clear, this is the design and layout of the page and don't confuse this with any web UI tools eg JqueryUI that bring fancy effects to the page.
In terms of links, my suggestion would be thoroughly read through Designing User Interfaces For Business Web Applications from Smashing Magazine (incl. the related links). The one that is probably most relevant is 12 Standard Screen Patterns.
It is a brilliant read and should be, IMO, added to your saved bookmarks.

Effectiveness is always matter then efficiency. Before I express my opinions, I suppose that your question already based on effective solution from user's perspective.
First, data retrieving is about the storage of computer system. If your data can reside totally in the fastest storage(like main memory), keeping data in it is a better strategy than others. But the problem about performance issue is mostly because of non-enough main memories, so the data should be retrived from secondary storages(the slower one) and replace other data in main memory, and produce what you want. So you have to deal with multi-level storage systems.
Second, when you are dealing with multi-level storage systems(as most computer systems), the efficiency ways depend on how much the reductions of access in secondary storages. It's not noly about the gain in loading data from slower storage to faster one, but also, there are sacrifices that the data get kicked out.
In XML, DOM and SAX are two extremities of dealing with multi-level storage systems. In database systems, fully cached indexes are a good solution for performance(when indexes are small enough). In operating systems, file cache is alwasy the one of the most challenging things in computer science.
You can pre-calculating some data before required. You can using more efficient data structures to improve retriving data. You can rudely allocating more main memories to your application. You can... well, buying more memory modules or SSD. Whatever solutions you choose, it's definitely art of fusion in computer science.
Algorithms, data structues, database systems, operating systems, even theories of compilers, these hard metals can help you build a sword which kicks the dragon's ass.

Getting started with massive data

I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!

Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.

I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.

Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.

Relation between language and scalability

I came across the following statement in Trapexit, an Erlang community website:
Erlang is a programming language used
to build massively scalable soft
real-time systems with requirements on
high availability.
Also I recall reading somewhere that Twitter switched from Ruby to Scala to address scalability problem.
Hence, I wonder what is the relation between a programming language and scalability?
I would think that scalability depends only on the system design, exception handling etc. Is it because of the way a language is implemented, the libraries, or some other reasons?
Hope for enlightenment. Thanks.

Erlang is highly optimized for a telecommunications environment, running at 5 9s uptime or so.
It contains a set of libraries called OTP, and it is possible to reload code into the application 'on the fly' without shutting down the application! In addition, there is a framework of supervisor modules and so on, so that when something fails, it gets automatically restarted, or else the failure can gradually work itself up the chain until it gets to a supervisor module that can deal with it.
That would be possible in other languages of course too. In C++, you can reload dlls on the fly, load plugsin. In Python you can reload modules. In C#, you can load code in on-the-fly, use reflection and so on.
It's just that that functionality is built in to Erlang, which means that:
it's more standard, any erlang developer knows how it works
less stuff to re-implement oneself
That said, there are some fundamental differences between languages, to the extent that some are interpreted, some run off bytecode, some are native compiled, so the performance, and the availability of type information and so on at runtime differs.
Python has a global interpreter lock around its runtime library so cannot make use of SMP.
Erlang only recently had changes added to take advantage of SMP.
Generally I would agree with you in that I feel that a significant difference is down to the built-in libraries rather than a fundamental difference between the languages themselves.
Ultimately I feel that any project that gets very large risks getting 'bogged down' no matter what language it is written in. As you say I feel architecture and design are pretty fundamental to scalability and choosing one language over another will not I feel magically give awesome scalability...

Erlang comes from another culture in thinking about reliability and how to achieve it. Understanding the culture is important, since Erlang code does not become fault-tolerant by magic just because its Erlang.
A fundamental idea is that high uptime does not only come from a very long mean-time-between-failures, it also comes from a very short mean-time-to-recovery, if a failure happened.
One then realize that one need automatic restarts when a failure is detected. And one realize that at the first detection of something not being quite right then one should "crash" to cause a restart. The recovery needs to be optimized, and the possible information losses need to be minimal.
This strategy is followed by many successful softwares, such as journaling filesystems or transaction-logging databases. But overwhelmingly, software tends to only consider the mean-time-between-failure and send messages to the system log about error-indications then try to keep on running until it is not possible anymore. Typically requiring human monitoring the system and manually reboot.
Most of these strategies are in the form of libraries in Erlang. The part that is a language feature is that processes can "link" and "monitor" each other. The first one is a bi-directional contract that "if you crash, then I get your crash message, which if not trapped will crash me", and the second is a "if you crash, i get a message about it".
Linking and monitoring are the mechanisms that the libraries use to make sure that other processes have not crashed (yet). Processes are organized into "supervision" trees. If a worker process in the tree fails, the supervisor will attempt to restart it, or all workers at the same level of that branch in the tree. If that fails it will escalate up, etc. If the top level supervisor gives up the application crashes and the virtual machine quits, at which point the system operator should make the computer restart.
The complete isolation between process heaps is another reason Erlang fares well. With few exceptions, it is not possible to "share values" between processes. This means that all processes are very self-contained and are often not affected by another process crashing. This property also holds between nodes in an Erlang cluster, so it is low-risk to handle a node failing out of the cluster. Replicate and send out change events rather than have a single point of failure.
The philosophies adopted by Erlang has many names, "fail fast", "crash-only system", "recovery oriented programming", "expose errors", "micro-restarts", "replication", ...

Erlang is a language designed with concurrency in mind. While most languages depend on the OS for multi-threading, concurrency is built into Erlang. Erlang programs can be made from thousands to millions of extremely lightweight processes that can run on a single processor, can run on a multicore processor, or can run on a network of processors. Erlang also has language level support for message passing between processes, fault-tolerance etc. The core of Erlang is a functional language and functional programming is the best paradigm for building concurrent systems.
In short, making a distributed, reliable and scalable system in Erlang is easy as it is a language designed specially for that purpose.

In short, the "language" primarily affects the vertical axii of scaling but not all aspects as you already eluded to in your question. Two things here:
1) Scalability needs to be defined in relation to a tangible metric. I propose money.
S = # of users / cost
Without an adequate definition, we will discussing this point ad vitam eternam. Using my proposed definition, it becomes easier to compare system implementations. For a system to be scalable (read: profitable), then:
Scalability grows with S
2) A system can be made to scale based on 2 primary axis:
a) Vertical
b) Horizontal
a) Vertical scaling relates to enhancing nodes in isolation i.e. bigger server, more RAM etc.
b) Horizontal scaling relates to enhancing a system by adding nodes. This process is more involving since it requires dealing with real world properties such as speed of light (latency), tolerance to partition, failures of many kinds etc.
(Node => physical separation, different "fate sharing" from another)
The term scalability is too often abused unfortunately.
Too many times folks confuse language with libraries & implementation. These are all different things. What makes a language a good fit for a particular system has often more to do with the support around the said language: libraries, development tools, efficiency of the implementation (i.e. memory footprint, performance of builtin functions etc.)
In the case of Erlang, it just happens to have been designed with real world constraints (e.g. distributed environment, failures, need for availability to meet liquidated damages exposure etc.) as input requirements.
Anyways, I could go on for too long here.

First you have to distinguish between languages and their implementations. For instance ruby language supports threads, but in the official implementation, the thread will not make use of multicore chips.
Then, a language/implementation/algorithm is often termed scalable when it supports parallel computation (for instance via multithread) AND if it exhibits a good speedup increase when the number of CPU goes up (see Amdahl Law).
Some languages like Erlang, Scala, Oz etc. have also syntax (or nice library) which help writing clear and nice parallel code.

In addition to the points made here about Erlang (Which I was not aware of) there is a sense in which some languages are more suited for scripting and smaller tasks.
Languages like ruby and python have some features which are great for prototyping and creativity but terrible for large scale projects. Arguably their best features are their lack of "formality", which hurts you in large projects.
For example, static typing is a hassle on small script-type things, and makes languages like java very verbose. But on a project with hundreds or thousands of classes you can easily see variable types. Compare this to maps and arrays that can hold heterogeneous collections, where as a consumer of a class you can't easily tell what kind of data it's holding. This kind of thing gets compounded as systems get larger. e.g. You can also do things that are really difficult to trace, like dynamically add bits to classes at runtime (which can be fun but is a nightmare if you're trying to figure out where a piece of data comes from) or call methods that raise exceptions without being forced by the compiler to declare the exception. Not that you couldn't solve these kinds of things with good design and disciplined programming - it's just harder to do.
As an extreme case, you could (performance issues aside) build a large system out of shell scripts, and you could probably deal with some of the issues of the messiness, lack of typing and global variables by being very strict and careful with coding and naming conventions ( in which case you'd sort of be creating a static typing system "by convention"), but it wouldn't be a fun exercise.

Twitter switched some parts of their architecture from Ruby to Scala because when they started they used the wrong tool for the job. They were using Ruby on Rails—which is highly optimised for building green field CRUD Web applications—to try to build a messaging system. AFAIK, they're still using Rails for the CRUD parts of Twitter e.g. creating a new user account, but have moved the messaging components to more suitable technologies.

Erlang is at its core based on asynchronous communication (both for co-located and distributed interactions), and that is the key to the scalability made possible by the platform. You can program with asynchronous communication on many platforms, but Erlang the language and the Erlang/OTP framework provides the structure to make it manageable - both technically and in your head. For instance: Without the isolation provided by erlang processes, you will shoot yourself in the foot. With the link/monitor mechanism you can react on failures sooner.

What are the faster Paxos-related algorithms for consensus in distributed systems?

I've read Lamport's paper on Paxos. I've also heard that it isn't used much in practice, for reasons of performance. What algorithms are commonly used for consensus in distributed systems?

Not sure if this is helpful (since this is not from actual production information), but in our "distributed systems" course we've studied, along with Paxos, the Chandra-Toueg and Mostefaoui-Raynal algorithms (of the latter our professor was especially fond).

Check out the Raft algorithm for a consensus algorithm that is optimized for ease of understanding and clarity of implementation. Oh... it is pretty fast as well.
https://ramcloud.stanford.edu/wiki/display/logcabin/LogCabin
https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

If performance is an issue, consider whether you need all of the strong consistency guarantees Paxos gives you. See e.g. http://queue.acm.org/detail.cfm?id=1466448 and http://incubator.apache.org/cassandra/. Searching on Paxos optimised gets me hits, but I suspect that relaxing some of the requirements will buy you more than tuning the protocol.

The Paxos system I run (which supports really, really big web sites) is halfway in-between Basic-Paxos Multi-paxos. I plan on moving it to a full Multi-Paxos implementation.
Paxos isn't that great as a high-throughput data storage system, but it excels in supporting those systems by providing leader election. For example, say you have a replicated data store where you want a single master for performance reasons. Your data store nodes will use the Paxos system to choose the master.
Like Google Chubby, my system is run as a service and can also store data as configuration container. (I use configuration loosely; I hear Google uses Chubby for DNS.) This data doesn't change as often as user input so it doesn't need high throughput write SLAs. Reading, on the other hand, is extremely quick because it is fully replicated and you can read from any node.
Update
Since writing this, I have upgraded my Paxos system. I am now using a chain-consensus protocol as the primary consensus system. The chain system still utilizes Basic-Paxos for re-configuration—including notifying chain nodes when the chain membership changes.

Paxos is optimal in terms of performance of consensus protocols, at least in terms of the number of network delays (which is often the dominating factor). It's clearly not possible to reliably achieve consensus while tolerating up to f failures without a single round-trip communication to at least (f-1) other nodes in between a client request and the corresponding confirmation, and Paxos achieves this lower bound. This gives a hard bound on the latency of each request to a consensus-based protocol regardless of implementation. In particular, Raft, Zab, Viewstamped Replication and all other variants on consensus protocols all have the same performance constraint.
One thing that can be improved from standard Paxos (also Raft, Zab, ...) is that there is a distinguished leader which ends up doing more than its fair share of the work and may therefore end up being a bit of a bottleneck. There is a protocol known as Egalitarian Paxos which spreads the load out across multiple leaders, although it's mindbendingly complicated IMO, is only applicable to certain domains, and still must obey the lower bound on the number of round-trips within each request. See the paper "There Is More Consensus in Egalitarian Parliaments" by Moraru et al for more details.
When you hear that Paxos is rarely used due to its poor performance, it is frequently meant that consensus itself is rarely used due to poor performance, and this is a fair criticism: it is possible to achieve much higher performance if you can avoid the need for consensus-based coordination between nodes as much as possible, because this allows for horizontal scalability.
Snarkily, it's also possible to achieve better performance by claiming to be using a proper consensus protocol but actually doing something that fails in some cases. Aphyr's blog is littered with examples of these failures not being as rare as you might like, where database implementations have either introduced bugs into good consensus-like protocols by way of "optimisation", or else developed custom consensus-like protocols that fail to be fully correct in some subtle fashion. This stuff is hard.

You should check the Apache Zookeeper project. It is used in production by Yahoo! and Facebook among others.
http://hadoop.apache.org/zookeeper/
If you look for academic papers describing it, it is described in a paper at usenix ATC'10. The consensus protocol (a variant of Paxos) is described in a paper at DSN'11.

Google documented how they did fast paxos for their megastore in the following paper: Link.

With Multi-Paxos when the leader is galloping it can respond to the client write when it has heard that the majority of nodes have written the value to disk. This is as good and efficient as you can get to maintain the consistency guarantees that Paxos makes.
Typically though people use something paxos-like such as zookeeper as an external service (dedicated cluster) to keep critical information consistent (who has locked what, who is leader, who is in a cluster, what's the configuration of the cluster) then run a less strict algorithm with less consistency guarantees which relies upon application specifics (eg vector clocks and merged siblings). The short ebook distributed systems for fun and profit as a good overview of the alternatives.
Note that lots of databases compete on speed by using risky defaults which risk consistency and can loose data under network partitions. The Aphry blog series on Jepson shows whether well know opensouce systems loose data. One cannot cheat the CAP Theorem; if you configure systems for safety then they end up doing about the same messaging and same disk writes as paxos. So really you cannot say Paxos is slow you have to say "a part of a system which needs consistency under network partitions requires a minimum number of messages and disk flushes per operation and that is slow".

There are two general blockchain consensus systems:
Those that produce unambiguous 100% finality given a defined set of
validators
Those which do not provide 100% finality but instead
rely on high probability of finality.
The first generation blockchain consensus algorithms (Proof of Work, Proof of Stake, and BitShares’ Delegated Proof of Stake) only offer high probability of finality that grows with time. In theory someone could pay enough money to mine an alternative “longer” Bitcoin blockchain that goes all the way back to genesis.
More recent consensus algorithms, whether HashGraph, Casper, Tendermint, or DPOS BFT all adopt long-established principles of Paxos and related consensus algorithms. Under these models it is possible to reach unambiguous finality under all network conditions so long as more than ⅔ of participants are honest.
Objective and unambiguous 100% finality is a critical property for all blockchains that wish to support inter-blockchain communication. Absent 100% finality, a reversion on one chain could have irreconcilable ripple effects across all interconnected chains.
The abstract protocol for these more recent protocols involves:
Propose block
All participants acknowledge block (pre-commitment)
All participants acknowledge when ⅔+ have sent them pre-commitments
(commitment)
A block is final once a node has received ⅔+ commitments
Unanimous agreement on finality is guaranteed unless ⅓+
are bad and evidence of bad behavior is available to all
It is the technical differences in the protocols that give rise to real-world impact on user experience. This includes things such as latency until finality, degrees of finality, bandwidth, and proof generation / validation overhead.
Look for more details on delegated proof of stake by eos here

Raft is more understandable, and faster alternative of Paxos. One of the most popular distributed systems which uses Raft is Etcd. Etcd is the distributed store used in Kubernetes.
It's equivalent to Paxos in fault-tolerance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio