How to ensure correctness of data gathered via crowdsourcing? - crowdsourcing

I have a site where users are entering data of some products they buy.
How do I ensure correctness of data entered via crowdsourcing (enabling users to vote/edit products) minimizing amount of work that needs to be done by administrator? I'm looking for some how-tos, best practices, etc.

What sort of data are you collecting ?
You're talking about crowd-sourcing, and thus (I assume) aggregating of data across this crowd. As they're talking about products they buy, I suspect you're going to be athering product attributes and prices.
Some possible approaches. If you users are entering non-numerical data (e.g. colours), just record the most common entries, or the mode (the most commonly entered).
If they're entering numeric data, discard outliers. i.e. bin the lowest and highest results, and average the rest (you could do this for prices, say. This is the approach that electronic exchanges use for resolving closing prices out of many trades).
Depending on your application, you may want to have a historical bias towards the most recent entries.
But this all depends on your application, and how much storage and crunching of data you're prepared to do.

Make sure you keep a log of IP addresses with every action made, malicious users or bots would trample on session data or cookies. Doing this ensures that a single entity cannot skew any results or do anything drastic by appearing to be multiple users.

As a high level data can be gathered from the 'crowd' with an associated correctness value. Looking at SO, an answer or response from someone with 1000+ rep, has more wieght that a casual user. Look for validations and triangulation, if it's a single voice in the crowd that you're listening too, then it's probably not worth that much. If other voices join then you know you're onto something, again in SO terms we all get a chance to upvote questions.
I've recently seen some really good iPhone apps which rely in crowd sourcing for their data, and then validate it by asking other users if it's correct.

Related

Implementing Trending in Elasticsearch

I'm building a project that indexes celebrity-related content across sites (tmz, people, etc) because I always thought that it would be funny to "bet" on people (and maybe shows, directors, etc) like horse racing or the stock market -- only, you know, not with real money -- where the value of the person changes day to day and hour to hour and even minute to minute if we can figure this out together, stack overflow denizens.
I assign traffic values to users based on mentions in social media. I have some scrapers (probably violating some TOSes) and access to Twitter's API to get relative counts for search results for a time, so I have known "numbers" to associate w/ users outside of elasticsearch for periods of time to build the trends. Now to be clear, I am not looking to implement trending based on the number of documents in the system, that actually stays pretty consistent, but I need to rank documents that already exist based on trends.
So that's what I've got: a few hundred thousand articles with pre-determined associations to individual celebrities. Data for on-the-minute associations of a score to those celebrities which are then merged and applied to each article so that each article has a few scores associated (there's some complexity here that does not matter, but the bottom line is that I have 10 or so values that I want to assign to content to sort it when you're on the market page and I want to sort those w/ a function or script score).
So the question: How the heck do I assign these values without making elasticsearch go crazy with re-indexing? I need to use these values to sort dozens of requests per second coming from feeds on the site, but I am running this on a raspberry pi... literally, I've maxed the poor thing out for memory.
We're real write heavy, but if for some reason celebrity stock markets takes off, we're also real read heavy at the same time. I swear I remember a plugin that had metadata associated with content, but I cannot find it.
I've tried enable=false and index=false, but they seem to still thrash the read times while writing the updates. The best I've gotten to is slowing down the refresh_interval, but that's still pretty expensive and starts to affect the "real-time" nature of the app.
I believe that this is impossible as you've laid it out. Any updates to a field will update _source and fire the full update process.
There are some alternatives that you might consider:
Replication, if another cluster is available
A separate write index on the same cluster, space allowing

User Search Strategy Mobile Dev

I want to implement a search feature on my app that re-filters upon each new character entered into the search bar so users can search for other users. This is a fairly common feature on apps, but as a beginner it would seem like a very computationally complex process. It would seem that one of two things happen:
For each new character typed, the frontend queries the backend, which applies filter and returns.
The frontend loads all (or many) possible results beforehand and updates filter on the stored info as new characters are entered.
It would seem that 1) would have time complexity issues, as it makes O(n) queries (where n is number of characters) per search. This is especially problematic because it's expected that the filtered search results update near instantaneously. Additionally, my average query time is probably slower than most, as I'm using a three tier architecture (frontend<->server<->graph database)
I don't like 2)--at least in its straightforward form--as the number of possible results can get very large. We can reduce the space complexity of this by querying only for a limited set of user attributes (perhaps only uid and name, and fetching details on the fly if needed), but the point remains.
Things get more interesting if we modify 2) to load only a sample of users (and here we can use data like Location as well as ML/AI to select). The problem with this is that the searching user could always be looking for someone we didn't select. It would be a horrible (even if rare) experience for a user to know their friend was on the app but was unable to find them because our algorithm was only accurate for 99% of searches.
I am sure this is possible--other apps seem to pull it off--so what am I missing?
First, you should avoid to query the server for each character typed. Most of the times the user types a bounce of chars very fast without looking at suggested results, especially because with few chars the results wouldn't be specific enough. All the autocompletion systems adopt both of the following:
query only if the string is at least 2-3 chars long;
query only if the user is not typing more, i.e. after 300ms from the last type.
To get all the pertinent results without huge data transfer you could implement a progressive data load. Just load enough results to fill the page height, then as the user scrolls down load more results. However if you reach a high number of results you should stop retrieving them and ask the user to type a more specific search.
If you want to make your users happy, try to sort the result by relevance. For example if you know where the users are located you may sort the results by distance, because if I live in Italy and I search for "Ste" it is more likely my friend is Stefano who lives in Rome, than Steve who lives in NY.

Is there an algorithm for anonymous, changeable, secure voting?

I'd like to implement a feedback mechanism in my application--basically, a score. The requirements are:
A total exists, and can be read
A user can add his score to the total
A user cannot add a second score, but could change his original score, again updating the total by removing (subtracting) the original score, and adding the new one.
It is impossible to determine what a given user's vote was
It seems that this borders on (or even overlaps) cryptography theory, but I haven't been able to find anything that would address this. Does anyone have any specific algorithms that would address this? Or even additional search vectors I could use to pursue it?
If there is an anonymous ID, such as a hash of a value that the user supplies, then anyone who can produce something that yields the same hash could modify the corresponding vote.
In this sense, there is still anonymity, because the hash doesn't reveal the source. Instead of listing (userName, vote), list (hashValue, vote). If there is some concern that tracking the hashValue is traceable across many polls, then encode an additional poll-specific wrapping for the hash, which is not revealed publicly. Or let the user embed (e.g. prepend) that into their string to be hashed, so they are still producing a unique submission.
You can never have anonymous voting without the ability to trust that the anonymous individuals will not vote twice. By definition, true anonymity guarantees that you can never detect duplicate voting.
If you instead force the user to identify themself, you can implement a voting system that prevents duplicate voting and provides anonymity within the context of the vote.
Here is a simple algorithm.
User logs in. The onus is on your system to prevent one user from obtaining multiple user accounts.
User (not anonymous) selects an issue on which to vote.
User (not anonymous) casts a vote.
Your system stores the following:
An indication that the user voted on the selected issue. This prevents duplicate voting.
The value of the users vote on the selected issue (this is the score you mentioned). This value is stored without reference to the user who cast the vote.
The value of the user's score if they voted on an issue. You probably need this to be a calculated value
If the user wants to change their vote, they login, select the issue, then unvote (your system knows they voted because it stored this). At this point they can choose the issue again (their vote indication was cleared) and vote.
Note that your system will need to subtract the value of the user's vote from the tally for the issue when they unvote.
You don't give enough information on what a legal vote is, but if it's, say, an integer, then you can just keep a sum and allow multiple votes. This works because changing a vote from A to B has the exact same effect as voting A and then voting (B - A).
Actually, online voting is pretty tricky.
If you want the most extreme approach to voting safety, you may need to consider something like this:
https://docs.google.com/document/d/1SPYFAkVNjqDP4HOt_A_YGFZy-SFXVxHoN1hpLGNFKXI/pub
It is an algorithm that distributes the voting secret among n distinct servers that each cannot break the voting anonymity by themselves. All n servers would have to cooperate in order to break anonymity, and if only one of the server cover its tracks ans wipes all cryptographic data away, the voting secret is lost/hidden forever.
The system can also deal with re-sending of votes, with some limitations inherent to any secure system for online voting:
For voting security online there is always an ultimate limitation in that it is vulnerable to traffic analysis. For example, if one day only one person votes, it can be concluded that any update of the voting result is a result of that persons voting.
A perfect secure online voting system should be viewed as a one-time vote-mixer. It takes a number of votes. Buffers them, and when the voting is finally closed, it mixes all of them in one go. This makes it extremely difficult to associate a vote with a voter. This can be achieved with pretty solid technology.
However, when we want to update votes things get much more tricky. There would be an intrinsic need for synchronization if we want to avoid the possibility of traffic analysis. Ideally all voters would have to re-send an update at regular intervals (even if their update is actually not an update).

What is the design & architecture behind facebook's status update mechanism?

I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!
Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.

What should I do with an over-bloated select-box/drop-down

All web developers run into this problem when the amount of data in their project grows, and I have yet to see a definitive, intuitive best practice for solving it. When you start a project, you often create forms with tags to help pick related objects for one-to-many relationships.
For instance, I might have a system with Neighbors and each Neighbor belongs to a Neighborhood. In version 1 of the application I create an edit user form that has a drop down for selecting users, that simply lists the 5 possible neighborhoods in my geographically limited application.
In the beginning, this works great. So long as I have maybe 100 records or less, my select box will load quickly, and be fairly easy to use. However, lets say my application takes off and goes national. Instead of 5 neighborhoods I have 10,000. Suddenly my little drop-down takes forever to load, and once it loads, its hard to find your neighborhood in the massive alphabetically sorted list.
Now, in this particular situation, having hierarchical data, and letting users drill down using several dynamically generated drop downs would probably work okay. However, what is the best solution when the objects/records being selected are not hierarchical in nature? In the past, of done this with a popup with a search box, and a list, but this seems clunky and dated. In today's web 2.0 world, what is a good way to find one object amongst many for ones forms?
I've considered using an Ajaxifed search box, but this seems to work best for free text, and falls apart a little when the data to be saved is just a reference to another object or record.
Feel free to cite specific libraries with generic solutions to this problem, or simply share what you have done in your projects in a more general way
I think an auto-completing text box is a good approach in this case. Here on SO, they also use an auto-completing box for tags where the entry already needs to exist, i.e. not free-text but a selection. (remember that creating new tags requires reputation!)
I personally prefer this anyways, because I can type faster than select something with the mouse, but that is programmer's disease I guess :)
Auto-complete is usually the best solution in my experience for searches, but only where the user is able to provide text tokens easily, either as part of the object name or taxonomy that contains the object (such as a product category, or postcode).
However this doesn't always work, particularly where 'browse' behavior would be more suitable - to give a real example, I once wrote a page for a community site that allowed a user to send a message to their friends. We used auto-complete there, allowing multiple entries separated by commas.
It works great when you know the names of the people you want to send the message to, but we found during user acceptance that most people didn't really know who was on their friend list and couldn't use the page very well - so we added a list popup with friend icons, and that was more successful.
(this was quite some time ago - everyone just copies Facebook now...)
Different methods of organizing large amounts of data:
Hierarchies
Spatial (geography/geometry)
Tags or facets
Different methods of searching large amounts of data:
Filtering (including autocomplete)
Sorting/paging (alphabetically-sorted data can also be paged by first letter)
Drill-down (assuming the data is organized as above)
Free-text search
Hierarchies are easy to understand and (usually) easy to implement. However, they can be difficult to navigate and lead to ambiguities. Spatial visualization is by far the best option if your data is actually spatial or can be represented that way; unfortunately this applies to less than 1% of the data we normally deal with day-to-day. Tags are great, but - as we see here on SO - can often be misused, misunderstood, or otherwise rendered less effective than expected.
If it's possible for you to reorganize your data in some relatively natural way, then that should always be the first step. Whatever best communicates the natural ordering is usually the best answer.
No matter how you organize the data, you'll eventually need to start providing search capabilities, and unlike organization of data, search methods tend to be orthogonal - you can implement more than one. Filtering and sorting/paging are the easiest, and if an autocomplete textbox or paged list (grid) can achieve the desired result, go for that. If you need to provide the ability to search truly massive amounts of data with no coherent organization, then you'll need to provide a full textual search.
If I could point you to some list of "best practices", I would, but HID is rarely so clear-cut. Use the aforementioned options as a starting point and see where that takes you.

Resources