Can someone explain me simply the main differences between Operational Transform and CRDT?
As far as I understand, both are algorithms that permits data to converge without conflict on different nodes of a distributed system.
In which usecase would you use which algorithm?
As far as I understand, OT is mostly used for text and CRDT is more general and can handle more advanced structures right?
Is CRDT more powerful than OT?
I ask this question because I am trying to see how to implement a collaborative editor for HTML documents, and not sure in which direction to look first. I saw the ShareJS project, and their attempts to support rich text collaboration on the browser on contenteditables elements. Nowhere in ShareJS I see any attempt to use CRDT for that.
We also know that Google Docs is using OT and it's working pretty well for real-time edition of rich documents.
Is Google's choice of using OT because CRDT was not very known at that time? Or would it be a good choice today too?
I'm also interested to hear about other use cases too, like using these algorithms on databases. Riak seems to use CRDT. Can OT be used to sync nodes of a database too and be an alternative to Paxos/Zab/Raft?
Both approaches are similar in that they provide eventual consistency. The difference is in how they do it. One way of looking at it is:
OT does it by changing operations. Operations are sent over the wire and concurrent operations are transformed once they are received.
CRDTs do it by changing state. Operations are made on the local CRDT. Its state is sent over the wire and is merged with the state of a copy. It doesn't matter how many times or in what order merges are made - all copies converge.
You're right, OT is mostly used for text and does predate CRDTs but research shows that:
many OT algorithms in the literature do not satisfy convergence properties
unlike what was stated by their authors
In other words CRDT merging is commutative while OT transformation functions sometimes are not.
From the Wikipedia article on CRDT:
OTs are generally complex and non-scalable
There are different kinds of CRDTs (sets, counters, ...) suited for different kinds of problems. There are some that are designed for text editing. For example, Treedoc - A commutative replicated data type for cooperative editing.
One another notable difference is that:
OT requires a central server for co-ordination.
CRDT can adopt any network topology like P2P over WebRTC and it is resilient to network partitions, which makes it decentralized.
Reference: https://youtu.be/B5NULPSiOGw?t=643 by Martin Kleppmann, author of "Designing Data-Intensive Applications".
Related
I want to build a recommendation system, and the target is to deal with really big data set, like 1 TB data.
And each user has really huge amount of items, however the number of user is small, like thousands or 10 thousands.
I have search from google, I found there is some open-source recommendation engine based on hadoop like Mahout, I guess it may have ability to deal with such big data, however I'm not sure.
I also find some engine write in C++ python, even php, I don't think script languages can deal with such big data, cause memory can't contain the whole dataset.
Or I'm wrong? Could some give me some recommendation?
Your question title is:
Which opensource recommendation system should I choose to deal with
big dataset?
and in the first line you say
I want to build a recommendation system, and the target is to deal with really big data set, > like 1 TB data.
And you are asking for an recommendation as an answer.
To answer your second question first. In my experience of building recommender systems I would advise you do not "build" a recommender system from the ground up if you can avoid it. Recommender Systems are complex and can use a wide range of techniques to provide a user with a recommendation. So my recommendation is unless you are really committed, and have a team of people with a range of experience and knowledge in recommender systems, statistics, and software engineering then look to implement an existing recommender system rather than building your own.
In terms of which open source recommender system you should choose, this is actually pretty difficult to answer with great accuracy. Let me try to answer this by breaking it down.
Consider the open source license, its restrictions and your requirements.
Consider which algorithm you want to use to make recommendations
Consider the environment you will be running your recommender system on.
I recommend you look more into the algorithm side as it will be the determining factor as to which tool you can use, or whether you will need to roll your own. Start reading here http://www.ibm.com/developerworks/library/os-recommender1/ for a very brief insight in to the different approaches that recommender systems use. In summary the different approaches are:
Content based
Neighbourhood / Collaborative filtering based
Constraint based
Graph-based
In your case to keep things relatively straightforward it sounds like you should consider a user-user collaborative filtering algorithm for this. The reasons being:
Neighbourhood Collaborative Filtering is quite intuitive to understand and it can be relatively easy to implement.
With this method you can also justify your recommendations to your users in a basic way
There is no requirement to build a model for training, and the processing of neighbours can be done "offline", to provide quick recommendations to the end user.
Storing neighbours is actually quite memory efficient, which means better scalability. Something it sounds like you will need lots of.
The user-based part of my suggestion is because it sounds like you have less users than you do items. In a user-based nearest neighbourhood a predicted rating of a new item I for user U is calculated by looking at the other users who have also rated item I and are most similar to user U. Because you have fewer users than items in your system it will be faster to compute user-based collaborative filtering compared with item-based collaborative filtering.
Within the user-based collaborative filtering you need to consider what rating normalisation (mean-centering vs z-score) you want to use, the similarity weight computation method (e.g. Cosine vs Pearsons correlation vs other similarity measures) you want to use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in the prediction), and any Dimensionality Reduction methods (SVD, SVD++) you want to implement (with a large dataset like yours you will want to seriously consider DM).
So really instead of looking for an open source that will be able to process your data set you should consider your algorithm choice first, then look to find a tool that has an implementation of this algorithm, and then assess whether it can process your the volume involved in your dataset.
In saying all of that, if you do choose to go down the user-based collaborative filtering route then I am confident that Apache Mahout will be able to solve your problem, and if not it will certainly help you understand the complexity involved in building your own (just look at their source code).
Please note the advice is really consider the algorithm choice. "Good" recommender systems are so much more than just being able to process a large dataset. You need to think about accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, and finally scalability. You should also consider how you are going to perform experiments and evaluate your recommendations, remember if the recommendations you are churning out are rubbish and it is turning your users off then there is no point to have a recommender system!
It is such a big area with lots to think about, there is probably no one single tool that is going to help you with everything, so be prepared to do a lot of reading and research as well as implementing lots of different open source tools to help you.
In saying that, start looking at Apache Mahout. Going back to the break-down of the 3 areas I said you should think about.
It has a commercial-friendly open-source license,
it has really great implementation of the algorithms you are likely going to need to use, and
it can work on distributed environments (read scalable).
Hope that helps, and good luck.
I understood almost all types of interactions specified by the scorm data model element cmi.interactions.n.type(true_false, multiple_choice, fill_in, long_fill_in, matching, performance, sequencing, likert, numeric, other) ,it remains to understand the type performance. I found an explanation of Ostyn but it remains ambiguous .
The Performance interaction is the most flexible and rich of the
standard interaction types in SCORM. It allows the capture of a number
of arbitrary steps performed by a learner, along with information
about every step. (Claud Ostyn)
AFAIK it does exactly that, i.e. stores arbitrary data related to an ambiguous non-standard interaction (e.g. a 3D simulation). LMS are not supposed to do anything with interaction data anyway, at least not regarding completion and grading, so it is mostly used by instructional designers who need a deeper insight into what the learners are doing and then adjust the training, e.g. exercise difficulty.
If you have a list of texts and a person interested in certain topics what are the algorithms dealing with choosing the most relevant text for a given person?
I believe that this is quite a complex topic and as an answer I expect a few directions to study various methodologies of text analysis, text statistics, artificial intelligence etc.
thank you
There are quite a few algorithms out there for this task. At least way too many to mention them all here. First some starting points:
Topic discovery and recommendation are two quite distinctive tasks, although they often overlap. If you have a stable userbase, you might be able to give very good recommendations without any topic discovery.
Discovering topics and assigning names to them are also two different tasks. This means it is often easier to be able to tell that text A and text B share a similar topic, than to explicetly be able to state what this common topic might be. Giving names to the topics is best done by humans, for example by having them tag the items.
Now to some actual examples.
TF-IDF is often a good starting point, however it also has severe drawbacks. For example it will not be able to tell that "car" and "truck" in two texts mean that these two probably share a topic.
http://websom.hut.fi/websom/ A Kohonen map for automatically clustering data. It learns the topics and then organizes the texts by topics.
http://de.wikipedia.org/wiki/Latent_Semantic_Analysis Will be able to boost TF-IDF by detecting semantic similarity among different words. Also note, that this has been patented, so you might not be able to use it.
Once you have a set of topics assigned by users or experts, you can also try almost any kind of machine learning method (for example SVM) to map the TF-IDF data to topics.
As a search engine engieneer I think this problem is best solved using two techniques in conjuction.
Technology 1, Search (TF-IDF or other algorithms)
Use search to create a baseline model for content where you dont have user statistics. There are a number of technologies out there but I think the Apache Lucene/Solr code base is by fare the most mature and stable.
Technology 2, User based recommenders (k-nearest neighborhood other algorithms)
When you start getting user statistics use this to enhance your relevance model used by the text analysis system. A fast growing codebase to solv these kinds of problem is the Apache Mahout project.
Check out Programming Collective Intelligence, a really good overview of various techniques along these lines. Also very readable.
I've been reading Introduction to Algorithms and started to get a few ideas and questions popping up in my head. The one that's baffled me most is how you would approach designing an algorithm to schedule items/messages in a queue that is distributed.
My thoughts have lead me to browsing Wikipedia on topics such as Sorting,Message queues,Sheduling, Distributed hashtables, to name a few.
The scenario:
Say you wanted to have a system that queued messages (strings or some serialized object for example). A key feature of this system is to avoid any single point of failure. The system had to be distributed across multiple nodes within some cluster and had to consistently (or as best as possible) even the work load of each node within the cluster to avoid hotspots.
You want to avoid the use of a master/slave design for replication and scaling (no single point of failure). The system totally avoids writing to disc and maintains in memory data structures.
Since this is meant to be a queue of some sort the system should be able to use varying scheduling algorithms (FIFO,Earliest deadline,round robin etc...) to determine which message should be returned on the next request regardless of which node in the cluster the request is made to.
My initial thoughts
I can imagine how this would work on a single machine but when I start thinking about how you'd distribute something like this questions like:
How would I hash each message?
How would I know which node a message was sent to?
How would I schedule each item so that I can determine which message and from which node should be returned next?
I started reading about distributed hash tables and how projects like Apache Cassandra use some sort of consistent hashing to distribute data but then I thought, since the query won't supply a key I need to know where the next item is and just supply it...
This lead into reading about peer to peer protocols and how they approach the synchronization problem across nodes.
So my question is, how would you approach a problem like the one described above, or is this too far fetched and is simply a stupid idea...?
Just an overview, pointers,different approaches, pitfalls and benefits of each. The technologies/concepts/design/theory that may be appropriate. Basically anything that could be of use in understanding how something like this may work.
And if you're wondering, no I'm not intending to implement anything like this, its just popped into my head while reading (It happens, I get distracted by wild ideas when I read a good book).
UPDATE
Another interesting point that would become an issue is distributed deletes.I know systems like Cassandra have tackled this by implementing HintedHandoff,Read Repair and AntiEntropy and it seems to work work well but are there any other (viable and efficient) means of tackling this?
Overview, as you wanted
There are some popular techniques for distributed algorithms, e.g. using clocks, waves or general purpose routing algorithms.
You can find these in the great distributed algorithm books Introduction to distributed algorithms by Tel and Distributed Algorithms by Lynch.
Reductions
are particularly useful since general distributed algorithms can become quite complex. You might be able to use a reduction to a simpler, more specific case.
If, for instance, you want to avoid having a single point of failure, but a symmetric distributed algorithm is too complex, you can use the standard distributed algorithm of (leader) election and afterwards use a simpler asymmetric algorithm, i.e. one which can make use of a master.
Similarly, you can use synchronizers to transform a synchronous network model to an asynchronous one.
You can use snapshots to be able to analyze offline instead of having to deal with varying online process states.
I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.