Data consistency algorithm

Data consistency algorithm - algorithm

I am developing 2-4 players networks game.
At the heart of the model there is a data structure that acts like a google-docs spreadsheet that everybody can edit in any time.
For simplicity, each of the spreadsheet cells can contain only one letter.
Some abilities and requirements:
1. All players can edit a spreadsheet cell in any time. (that's means that there must not be a "locked cell")
2. All network transactions are reliable (but can arrive out of order)
I am having an hard time developing an algorithm for handling the shared spreadsheet-like data structure.
Does someone familiour with a similar problem that have a solution? or suggest a simple way to solve the problem?
Thank you.

I think you should try to define some criteria for the algorithm you are looking for. You may want a guaranteed response time. Or you might prefer absolute data consistency. It seems unlikely to achieve both at the same time.
The subject you are talking about is called Operational Transformation: http://en.wikipedia.org/wiki/Operational_transformation
Some open source software do amazing things in this field, like http://sharejs.org/ or http://etherpad.org/

Related

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

Algorithm choice for gaining intelligence from messages

What I'm trying to do is find an algorithm that can I can implement to generate 'intelligent' suggestions to people, by comparing messages they send to messages sent by their peers.
For example, Person A sends a message to Person B talking about Obj1. If Person C sends a message to Person D about Obj1, it will notice they are talking about the same things, and may suggest Person A talks to person C.
I have implemented collecting the statistics to capture the mentions people have in common but do not know which algorithm to use to analyse this.
Any suggestions?
(I hope this makes enough sense)

take a look at clustering algorithms
and k-means or
k-nearest neighbours for a quick start
How much data you've got? The more the better.
There are lots of approaches to this problem. You may for example take that all users, to some degree, are similar to each other and what you want to do is to find for each user the most similar ones.Vector space, cosine similarity, will give you quick results.
Give some more information on what you want to achieve.

This is exactly the same problem Twitter is battling with. You might end up with a job there if you crack this ;)
On serious note coming back, one could use some crude measures (i.e. heuristic based) to do something like this, but it has a big error percentage. As delnan said in the comment.
NLP is a sure bet. Note that using NLP too has some error %, but it's far more accurate than any heuristic you would use. If you are using python I would suggest this toolkit, I use it now and then - NLP.
For other languages I am sure there are packages which will help you in this regard.
UPDATE1: If you have a way for the users to tag their messages (like stackoverflow does), you could approach this problem barring NLP. Then you could simply take the intersection of the tags of both the messages to see if there is any commonality & suggest some top items for the common items.
But there are other issues you'll have to deal with - make tags a mandatory, plus you need to be sure that the users are actually entering correct tags etc... But nevertheless this greatly simplifies your problem.
UPDATE2: As the Q has been updated - Since you have some specific keywords/phrases only which you are interested in. This kind of simplifies it. You would need to get each of your message, split it into words, then stem each word. After stemming, intersect this set with the set of keywords you have. You'll get a set(S1). Do the same with the second message, you'll get a set(S2). Intersect S1, S2. If you find something is common, bingo! Some theme is common between message1, message2. else nothing.

Scheduling Students to Classes

I am making a website for a side project at school where students enter the classes they need to take, what days they want or don't want classes, and when they cant have or don't want classes. The basics are there are classes, and each class has many sections at different times with different professors that a student can choose from. With the freshman level classes, there can be over 30 different sections for each class. I have the classes and sections in a mysql database and I have been coding in php.
So far I have it working but I want to make it faster. I have been reading about other scheduling problems but I am looking for specifics to what I am doing. This isn't making schedules from scratch. It is making schedules from what sections are available and ranking them based on what the students inputs. Currently for few possible sections, it runs fast. But when the possible schedules get to about 300,000, it takes around 30 seconds to compare and rank everything. I have been improving it by changing how schedules are generated but I want to faster. I switched from brute force generating to using a tree based method.
I'm not asking for homework help or for someone to do this for me. I just want to be pointed in the right direction with already existing problems and algorithms that I can learn about.

Remember the eight queens puzzle? I sure hope you do, if not, go and solve it first, then come back to your scheduling task.
You have already moved from brute force to a tree structure. Now it's time for branch and bound. Whatever you mean by "good schedules", 170000 is too much — you do not prune your tree enough. I do not think that there could be more than 20-50 really good schedules for each student, unless they take very few classes and are extremely flexible.

Try metaheuristics such as tabu search or simulated annealing.
Brute force and branch and bound don't scale up enough.
Take a look at my curriculum course example in drools planner, as defined by ITC2007.
Its probably an advanced form of your use case (not counting gui/db).

Have a look at this. It may not be exactly what you want but you can get some design ideas.

Data structure/Algorithm for Streaming Data and identifying topics

I want to know the effective algorithms/data structures to identify the below information in streaming data.
Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data.
I need my queries to run on actual data but not any of the duplicates.
As I am not interested in storing the complete data, it will be difficult for me to identify the duplicate posts. However, I can hash all the posts and check against them. But I would like to identify near duplicate posts also. How can I achieve this.
Identify the top k topics being discussed by the users.
I want to identify the top topics being discussed by users. I don't want the top frequency words as shown by twitter. Instead I want to give some high level topic name of the most frequent words.
I would like my system to be real-time. I mean, my system should be able to handle any amount of traffic.
I can think of map reduce approach but I am not sure how to handle synchronization issues. For example, duplicate posts can reach different nodes and both of them could store them in the index.
In a typical news source, one will be removing any stop words in the data. In my system I would like to update my stop words list by identifying top frequent words across a wide range of topics.
What will be effective algorithm/data structure to achieve this.
I would like to store the topics over a period of time to retrieve interesting patterns in the data. Say, friday evening everyone wants to go to a movie. what will be the efficient way to store this data.
I am thinking of storing it in hadoop distributed file system, but over a period of time, these indexes become so large that I/O will be my major bottleneck.
Consider multi-lingual data from tweets around the world. How can I identify similar topics being discussed across a geographical area?
There are 2 problems here. One is identifying the language being used. It can be identified based on the person tweeting. But this information might affect the privacy of the users. Other idea, could be running it through a training algorithm. What is the best method currently followed for this. Other problem is actually looking up the word in a dictionary and associating it to common intermediate language like say english. How to take care of word sense disambiguation like a same word being used in different contests.
Identify the word boundaries
One possibility is to use some kind of training algorithm. But what is the best approach followed. This is some way similar to word sense disambiguation, because you will be able to identify word boundaries based on the actual sentence.
I am thinking of developing a prototype and evaluating the system rather than the concrete implementation. I think its not possible to scrap the real-time twitter data. I am thinking this approach can be tested on some data freely available online. Any ideas, where I can get this data.
Your feedback is appreciated.
Thanks for your time.
-- Bala

There are a couple different questions buried in here. I can't understand all that you're asking, but here's a the big one as I understand it: You want to categorize messages by topic. You also want to remove duplicates.
Removing duplicates is (relatively) easy. To remove "near" duplicates, you could first remove uninteresting parts from your data. You could start by removing capitalization and punctuation. You could also remove the most common words. Then you could add the resulting message to a Bloom filter. Hashing isn't good enough for Twitter, as the hashed messages wouldn't be much smaller than the full messages. You'd end up with a hash that doesn't fit in memory. That's why you'd use a Bloom filter instead. It might have to be a very large Bloom filter, but it will still be smaller than the hash table.
The other part is a difficult categorization problem. You probably do not want to write this part yourself. There are a number of libraries and programs available for categorization, but it might be hard to find one that fits your needs. An example is the Vowpal Wabbit project, which is a fast online algorithm for categorization. However, it only works on one category at a time. For multiple categories, you would have to run multiple copies and train them separately.
Identifying the language sounds less difficult. Don't try to do something smart like "training", instead put the most common words from each language in a dictionary. For each message, use the language whose words appeared most frequently.
If you want the algorithm to come up with categories on its own, good luck.

I'm not really sure if I'm answering your main question, but you could determine the similarity of two messages by calculating the Levenshtein distance between them. You can think of this as the "edit difference" between two strings (I.E., how many edits would need to be made to one, to convert it to the other).

Hello we have created a very similar demo using api.cortical.io functionality.
There you can create semantic fingerprints of each tweet. (you could also extract the top most keywords or some similar terms, that don't need to actually be part of the tweet).
We have used the fingerprints to filter the twitter stream based on content.
On twistiller.com you can see the result. The public 1% twitter stream is monitored for four different topic areas.

Predicting missing data values in a database

I have a database, consisting of a whole bunch of records (around 600,000) where some of the records have certain fields missing. My goal is to find a way to predict what the missing data values should be (so I can fill them in) based on the existing data.
One option I am looking at is clustering - i.e. representing the records that are all complete as points in some space, looking for clusters of points, and then when given a record with missing data values try to find out if there are any clusters that could belong in that are consistent with the existing data values. However this may not be possible because some of the data fields are on a nominal scale (e.g. color) and thus can't be put in order.
Another idea I had is to create some sort of probabilistic model that would predict the data, train it on the existing data, and then use it to extrapolate.
What algorithms are available for doing the above, and is there any freely available software that implements those algorithms (This software is going to be in c# by the way).

This is less of an algorithmic and more of a philosophical and methodological question. There are a few different techniques available to tackle this kind of question. Acock (2005) gives a good introduction to some of the methods. Although it may seem that there is a lot of math/statistics involved (and may seem like a lot of effort), it's worth thinking what would happen if you messed up.
Andrew Gelman's blog is also a good resource, although the search functionality on his blog leaves something to be desired...
Hope this helps.
Acock (2005)
http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf
Andrew Gelman's blog
http://www.stat.columbia.edu/~cook/movabletype/mlm/

Dealing with missing values is a methodical question that has to do with the actual meaning of the data.
Several methods you can use (detailed post on my blog):
Ignore the data row. This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you'll obviously get poor performance if the percentage of such rows is high
Use a global constant to fill in for missing values. Like "unknown", "N/A" or minus infinity. This is used because sometimes is just doesnt make sense to try and predict the missing value. For example if you have a DB if, say, college candidates and state of residence is missing for some, filling it in doesn't make much sense...
Use attribute mean. For example if the average income of a US family is X you can use that value to replace missing income values.
Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget cars
Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms used to generate input for step method #4 (K-Mean\Median etc.)
I'd suggest looking into regression and decision trees first (ID3 tree generation) as they're relatively easy and there are plenty of examples on the net.
As for packages, if you can afford it and you're in the Microsoft world look at SQL Server Analysis Services (SSAS for short) that implement most of the mentioned above.
Here are some links to free data minning software packages:
WEKA - http://www.cs.waikato.ac.nz/ml/weka/index.html
ORANGE - http://www.ailab.si/orange
TANAGRA - http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
Although not C# he's a pretty good intro to decision trees and baysian learning (using Ruby):
http://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/
http://www.igvita.com/2007/05/23/bayes-classification-in-ruby/
There's also this Ruby library that I find very useful (also for learning purposes):
http://ai4r.rubyforge.org/machineLearning.html
There should be plenty of samples for these algorithms online in any language so I'm sure you'll easily find C# stuff too...
Edited:
Forgot this in my original post. This is a definately MUST HAVE if you're playing with data mining...
Download Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (It requires SQL Server Analysis Services - SSAS - which isn't free but you can download a trial).
This will allow you to easily play and try out the different techniques in Excel before you go and implement this stuff yourself. Then again, since you're in the Microsoft ecosystem, you might even decide to go for an SSAS based solution and count on the SQL Server guys to do it for ya :)

Predicting missing values is generally considered to be part of data cleansing phase which needs to be done before the data is mined or analyzed further. This is quite prominent in real world data.
Please have a look at this algorithm http://arxiv.org/abs/math/0701152
Currently Microsoft SQL Server Analysis Services 2008 also comes with algorithms like these http://technet.microsoft.com/en-us/library/ms175312.aspx which help in predictive modelling of attributes.
cheers

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio