Is there an algorithm for anonymous, changeable, secure voting? - algorithm

I'd like to implement a feedback mechanism in my application--basically, a score. The requirements are:
A total exists, and can be read
A user can add his score to the total
A user cannot add a second score, but could change his original score, again updating the total by removing (subtracting) the original score, and adding the new one.
It is impossible to determine what a given user's vote was
It seems that this borders on (or even overlaps) cryptography theory, but I haven't been able to find anything that would address this. Does anyone have any specific algorithms that would address this? Or even additional search vectors I could use to pursue it?

If there is an anonymous ID, such as a hash of a value that the user supplies, then anyone who can produce something that yields the same hash could modify the corresponding vote.
In this sense, there is still anonymity, because the hash doesn't reveal the source. Instead of listing (userName, vote), list (hashValue, vote). If there is some concern that tracking the hashValue is traceable across many polls, then encode an additional poll-specific wrapping for the hash, which is not revealed publicly. Or let the user embed (e.g. prepend) that into their string to be hashed, so they are still producing a unique submission.

You can never have anonymous voting without the ability to trust that the anonymous individuals will not vote twice. By definition, true anonymity guarantees that you can never detect duplicate voting.
If you instead force the user to identify themself, you can implement a voting system that prevents duplicate voting and provides anonymity within the context of the vote.
Here is a simple algorithm.
User logs in. The onus is on your system to prevent one user from obtaining multiple user accounts.
User (not anonymous) selects an issue on which to vote.
User (not anonymous) casts a vote.
Your system stores the following:
An indication that the user voted on the selected issue. This prevents duplicate voting.
The value of the users vote on the selected issue (this is the score you mentioned). This value is stored without reference to the user who cast the vote.
The value of the user's score if they voted on an issue. You probably need this to be a calculated value
If the user wants to change their vote, they login, select the issue, then unvote (your system knows they voted because it stored this). At this point they can choose the issue again (their vote indication was cleared) and vote.
Note that your system will need to subtract the value of the user's vote from the tally for the issue when they unvote.

You don't give enough information on what a legal vote is, but if it's, say, an integer, then you can just keep a sum and allow multiple votes. This works because changing a vote from A to B has the exact same effect as voting A and then voting (B - A).

Actually, online voting is pretty tricky.
If you want the most extreme approach to voting safety, you may need to consider something like this:
https://docs.google.com/document/d/1SPYFAkVNjqDP4HOt_A_YGFZy-SFXVxHoN1hpLGNFKXI/pub
It is an algorithm that distributes the voting secret among n distinct servers that each cannot break the voting anonymity by themselves. All n servers would have to cooperate in order to break anonymity, and if only one of the server cover its tracks ans wipes all cryptographic data away, the voting secret is lost/hidden forever.
The system can also deal with re-sending of votes, with some limitations inherent to any secure system for online voting:
For voting security online there is always an ultimate limitation in that it is vulnerable to traffic analysis. For example, if one day only one person votes, it can be concluded that any update of the voting result is a result of that persons voting.
A perfect secure online voting system should be viewed as a one-time vote-mixer. It takes a number of votes. Buffers them, and when the voting is finally closed, it mixes all of them in one go. This makes it extremely difficult to associate a vote with a voter. This can be achieved with pretty solid technology.
However, when we want to update votes things get much more tricky. There would be an intrinsic need for synchronization if we want to avoid the possibility of traffic analysis. Ideally all voters would have to re-send an update at regular intervals (even if their update is actually not an update).

Related

Algorithm for determining interest in app?

Trying to work out an algorithm for grouping my users into distinct profiles based on their activity - e.g. "regular users", "occasional users", "regular poster", "lurker" for someone who doesn't post but does stuff.
For regular user, I was thinking that the algorithm would have to include the total users, how much an average user visits a site, completing actions such as "like" or "favorite" or "view" or "clicking link".
I'm not very good with algorithms so looking for some help.
Depends on the applications traffic you can try to count every action performed by your users and divide by the users count making it average. And then based on that information apply the ranks e.g.:
1. Regular user with 15% more or less than the average.
2. Active user with >= 16% than the average.
and so on..
The important thing is to do it with the percentage as you want it to become based on YOUR traffic. You can also set some static requirements like 3 clicks per day.
Your best bet is to first really consider how you would define the terminology. Once you've defined the termininology, you should then be able to rank people based on those terms. For example, a user might be "someone who accesses the site", thus a regular user is "someone who accesses the site regularly". The same idea applies to a poster. A lurker doesn't really fit as it seems pretty much like "a user, who does not post".
Thus it might be a good idea to define what actions are related to what terminology. A lurker might then be "someone who accesses the site but rarely does much in terms of actions" while a user is "someone who accesses the site and uses some of it's features, but has does not posted anything".
You then need to determine a metric for defining what regular, occasional, active and other such terms mean. ailvenge gives two good ways this can be done.

"Who to follow" algorithm

I want to give users the ability to view some personalized users they might find interesting and might follow them...
I was thinking of it like that:
- Get all users he is currently following
- Get all followers that they follow
- rank them by total posts they made (DESC), filled up personal information fields
- show 5 of them on each page load
in case user has followers then an information message will appear...
Can this kind of feature be done with this algorithm or is there a better or even easier way to do it?
In your algorithm, I'm wondering why you need to sort users based on number of posts, maybe it has something to do with reputation?
Recommendation is indeed a very large, open topic, and is also a hot academic research fields. If we are working on a practical project, I think it will be nice to to stay simple and focused.
I witnessed the following two kinds of recommendations on a very popular
social website. From my experience, the recommendation output is of high quality. Here I'm brainstorming the algorithms behind. Hope it helps.
Discover persons you might know: Recommend person whose 'following set' intersects with your 'following set'. It is based on the "clustering effect" of social network: The friend of your friend is more likely to be your friend.
Recommend person based on interests: If the users could be celebrities, companies, institutions, press media, etc., then recommendations like the following might be useful: "People following #Linus also follow #Stallman, #LinuxDeveloper, ...". Suppose you've just followed #Linus, to recommend #Stallman, #LinuxDeveloper, first we need to find out all users following #Linus, then figure out their common following list, possibly ranked by number of followers. The idea is to recommend users based on interest correlations. We calculate and discover high correlation users, assuming that users' following list are grouped by their interests.
(I'm also thinking, algorithm 1 will discover persons that share common interests with you, if users could be celebrities, etc.. This might be preferred for some scenarios.)
You're asking a very open-ended question here - how to pick a small number of recommendations out of a large set. So the answer is - you can make it as simple or as complicated as you want it to be! The simplest would be to pick a few at random (and any more complex algorithm had better prove that it produces better results than that.) Your solution of gathering all users who are two hops away, and then ranking by number of posts, is just a bit more complex, and then at the other extreme are the sophisticated algorithms used by the Amazons and Googles of the world. Companies put a lot of effort into building this sort of thing - have you heard of the Netflix Prize?
as I understand you want to follow the user that could offer high quality information about your Thema .we need an Algorithm to give this user as result to us ,but how can I find these users:
The users that have many Followers are a good choice but not always many of users in Twitter follow another users only as respect or ethiquet.
The users that his/her twitts retwitt many times with other user is a good choice
and the user that they are mentioned many times by other users.
I think ,to find theses users we should use Link based Analyse such as HITS or Page rank algorithim
You may want to consider not including people that are following the given user. I imagine might not be so interested in you, and this could potentially be problematic. However, you maybe very interested in finding more about the people that is following.
Are you considering showing the user the reason why these people were recommended to them? For example, saying like you may be interested in what little billy is saying because of his connection to your wife. If so, to potentially avoid angered users, it may be worth allowing them to in a sense opt-out.
It seems like other than that, it seems like it is a pretty good way of recommending users that someone would be interested in. The only other things that I can think of that might also help find people with similar interests, is if you allow users to tag posts. Allowing you to find users by similar interests, or by what they are posting about.
One other more problematic thing that you could look into is finding users by similar interest. for example, if person a is following person c, and person b is following person c, then maybe recommend person a to person b. though this seems like it could make for some very lengthy queries if you are not careful.

Checksum for SSN

I have a project that needs to do validation on the frontend for an American Social Security Number (format ddd-dd-dddd). One suggestion would be to use a hash algorithm, but given the tiny character set used ([0-9]), this would be disastrous. It would be acceptable to validate with some high probability that a number is correct and allow the backend to do a final == check, but I need to do far better than "has nine digits" etc etc.
In my search for better alternatives, I came upon the validation checksums for ISBN numbers and UPC. These look like a great alternative with a high probability of success on the frontend.
Given those constraints, I have three questions:
Is there a way to prove that an algorithm like ISBN13 will work with a different category of data like SSN, or whether it is more or less fit to the purpose from a security perspective? The checksum seems reasonable for my quite large sample of one real SSN, but I'd hate to find out that they aren't generally applicable for some reason.
Is this a solved problem somewhere, so that I can simply use a pre-existing validation scheme to take care of the problem?
Are there any such algorithms that would also easily accommodate validating the last 4 digits of an SSN without giving up too much extra information?
Thanks as always,
Joe
UPDATE:
In response to a question below, a little more detail. I have the customer's SSN as previously entered, stored securely on the backend of the app. What I need to do is verification (to the maximum extent possible) that the customer has entered that same value again on this page. The issue is that I need to prevent the information from being incidentally revealed to the frontend in case some non-authorized person is able to access the page.
That is why an MD5/SHA1 hash is inappropriate: namely that it can be used to derive the complete SSN without much difficulty. A checksum (say, modulo 11) provides nearly no information to the frontend while still allowing a high degree of accuracy for the field validation. However, as stated above I have concerns over its general applicability.
Wikipedia is not the best source for this kind of thing, but given that caveat, http://en.wikipedia.org/wiki/Social_Security_number says
Unlike many similar numbers, no check digit is included.
But before that it mentions some widely used filters:
The SSA publishes the last group number used for each area number. Since group numbers are allocated in a regular (if unusual) pattern, it is possible to identify an unissued SSN that contains an invalid group number. Despite these measures, many fraudulent SSNs cannot easily be detected using only publicly available information. In order to do so there are many online services that provide SSN validation.
Restating your basic requirements:
A reasonably strong checksum to protect against simple human errors.
"Expected" checksum is sent from server -> client, allowing client-side validation.
Checksum must not reveal too much information about SSN, so as to minimize leakage of sensitive information.
I might propose using a cryptographic has (SHA-1, etc), but do not send the complete hash value to the client. For example, send only the lowest 4 bits of the 160 bit hash result[1]. By sending 4 bits of checksum, your chance of detecting a data entry error are 15/16-- meaning that you'll detect mistakes 93% of the time. The flip side, though, is that you have "leaked" enough info to reduce their SSN to 1/16 of search space. It's up to you to decide if the convenience of client-side validation is worth this leakage.
By tuning the number of "checksum" bits sent, you can adjust between convenience to the user (i.e. detecting mistakes) and information leakage.
Finally, given your requirements, I suspect this convenience / leakage tradeoff is an inherent problem: Certainly, you could use a more sophisticated crypto challenge / response algorithm (as Nick ODell astutely suggests). However, doing so would require a separate round-trip request-- something you said you were trying to avoid in the first place.
[1] In a good crypto hash function, all output digits are well randomized due to avalanche effect, so the specific digits you choose don't particularly matter-- they're all effectively random.
Simple solution. Take the number mod 100001 as your checksum. There is 1/100_000 chance that you'll accidentally get the checksum right with the wrong number (and it will be very resistant to one or two digit mistakes canceling out), and 10,000 possible SSNs that it could be so you have not revealed the SSN to an attacker.
The only drawback is that the 10,000 possible other SSNs are easy to figure out. If the person can get the last 4 of the SSN from elsewhere, then they can probably figure out the SSN. If you are concerned about this then you should take the user's SSN number, add a salt, and hash it. And deliberately use an expensive hash algorithm to do so. (You can just iterate a cheaper algorithm, like MD5, a fixed number of times to increase the cost.) Then use only a certain number of bits. The point here being that while someone can certainly go through all billion possible SSNs to come up with a limited list of possibilities, it will cost them more to do so. Hopefully enough that they don't bother.

What's a good set of heuristics for threading tweets?

Everyone knows, if you want to thread emails you use Jamie
Zawinski's algorithm. But it's a new century, and there's a
new messaging service.
What's the best algorithm for threading status updates posted on
Twitter?
Things I'd definitely like it to cope with:
The easy part: using in_reply_to_status_id,
in_reply_to_user_id and in_reply_to_screen_name.
(Incidentally, finding proper documentation of these values
would be useful in itself! Such documentation isn't
obviously linked to from
here,
for example.)
Good heuristics for inferring a "reply" relationship from
messages that mention a user with the # convention but aren't
explicitly in reply to a particular message. These
"mentions" are provided in the "entities" element of
statuses now
if you request that. These heuristics might take into
account (a) the time between two status updates, (b) whether
there are subsquent replies between the two users, etc.
(Replies that consist of an old-style retweet with an
additional comment, as mentioned by user85509
below
are just an instance of this style of reply.)
Conversations that take place between more than two users.
Working with a set of tweets given to the algorithm, or all
tweets on Twitter.
... but perhaps you can think of more.
Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question.
The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id links to build many directed acyclic graphs. These relationships you can be nearly 100% sure about. (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.)
Beyond that easy step, one has to do deal with the "mentions". Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option.
To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id. Suppose an example of one of these tweets is: 1
#a #b no it isn't lol RT #c Yes, absolutely. /cc #stephenfry
... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of #a, #b, #c, and #stephenfry for the last week (say) and one between that update and a special "no reply" update. Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding:
The time that elapsed between the two updates - presumably replies are more likely to be to recent updates.
The proportion of the way through the tweet in terms of words that a mention occurs. e.g. if this is the first word, this would be a score of 0 and that's probably more likely to indicate a reply than mentions later in the update.
The number of followers of the mentioned user - celebrities are presumably more likely to be spam-mentioned.
The length of the longest common substring between the updates, which might indicate direct quoting.
Is the mention preceded by "/cc" or other signifiers that indicate that this isn't directly a reply to that person?
The following / followed ratio for the author of the original update.
etc.
etc.
The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. I'd suggest trying a random forest classifier, which is conveniently implemented in Weka.
Next one needs a training set. This can be small at first - just enough to get a service that identifies conversations up-and-running. To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. Using this data one can build a bigger training set and a more accurate classifier.
1 ... which might be typical of the level of discourse on Twitter ;)
On Twitter, people often write "RT" in front of the message they are replying to.

How to ensure correctness of data gathered via crowdsourcing?

I have a site where users are entering data of some products they buy.
How do I ensure correctness of data entered via crowdsourcing (enabling users to vote/edit products) minimizing amount of work that needs to be done by administrator? I'm looking for some how-tos, best practices, etc.
What sort of data are you collecting ?
You're talking about crowd-sourcing, and thus (I assume) aggregating of data across this crowd. As they're talking about products they buy, I suspect you're going to be athering product attributes and prices.
Some possible approaches. If you users are entering non-numerical data (e.g. colours), just record the most common entries, or the mode (the most commonly entered).
If they're entering numeric data, discard outliers. i.e. bin the lowest and highest results, and average the rest (you could do this for prices, say. This is the approach that electronic exchanges use for resolving closing prices out of many trades).
Depending on your application, you may want to have a historical bias towards the most recent entries.
But this all depends on your application, and how much storage and crunching of data you're prepared to do.
Make sure you keep a log of IP addresses with every action made, malicious users or bots would trample on session data or cookies. Doing this ensures that a single entity cannot skew any results or do anything drastic by appearing to be multiple users.
As a high level data can be gathered from the 'crowd' with an associated correctness value. Looking at SO, an answer or response from someone with 1000+ rep, has more wieght that a casual user. Look for validations and triangulation, if it's a single voice in the crowd that you're listening too, then it's probably not worth that much. If other voices join then you know you're onto something, again in SO terms we all get a chance to upvote questions.
I've recently seen some really good iPhone apps which rely in crowd sourcing for their data, and then validate it by asking other users if it's correct.

Resources