What is a reliable method to record votes from anonymous users, without allowing duplicates - session

First of all, I searched as best I could and read all SO questions that seem relevant, but nothing specifically answered this. This is not a duplicate, afaik.
Obviously if anonymous voting on a website is allowed, there is no fool proof way to prevent someone voting more than once.
However, I am wondering if someone with experience can aide me in coming up with a reasonably reliable way of tracking absolutely unique visitors and recording votes against those credentials.
Currently I am ensuring that only one vote per item/session combo is allowed, however this is easily circumvented by restarting browser, changing browsers/computers, or clearing your session data.
Recording against IP seems the next reasonable solution but I wonder if this will get false positives too often (multiple people on same LAN behind a NAT will have same external IP, etc).
Is there a middle ground to be had here or some other method/combination I am overlooking?

I'd collect as much data about the session as possible without asking any questions directly (browser, OS, installed plugins, all with versions numbers, IP address etc) and hash it.
Record the hash and increment a counter if you want multiple votes to be allowed. Include a timestamp (daily, hourly etc) in the salt to make votes time sensitive, say 5 votes per day.

The simplest answer is to use a cookie. Obviously it's vulnerable to people clearing their cookies, but anonymous voting is inherently approximate anyway.
In practice, unless the topic being voted on is in some way controversial or inflammatory, people aren't going to have a motive behind rigging the vote anyway.
IP is more 'reliable' but will produce an unacceptably high level of collisions due to NATs.
How about a more unique identifier composed of IP + user-agent (maybe a hash)? That effectively means for each IP, each exact OS/browser version pair gets 1 vote, which is a lot closer to 1 vote per person. Most browsers provide detailed version information in the user-agent -- I'm not sure, but my gut feel is that this would prevent the majority of collisions caused by NATs.
The only place that would still produce lots of collisions is a corporate environment with a standardised network, where everyone is using an identical machine.

The Chinese have to share one IPv4 address with hundreds of others; Hp/Compaq/DEC has almost 50 million addresses. IPv6 doesn't help as everyone get addresses by the billion. A person just is not the same as an IP address, and that notion is becoming ever more false.
There are just no proper ways to do this on the Internet. Persons are simply a concept unknown on the Internet, and any idea to introduce the concept is unlikely to succeed. (Too many governments would not want this to happen, for instance.)
Of course, you can relate the amount of votes per IP to the amounf of repeat page visits from that IP, especially in combination with cookie tracking. This works best if you estimate that number before you start the voting period. If the top 5% popular articles are typically read 10 times from a single IP, it's likely 10 people share that IP and they should get 10 votes. Cookies can be used to prevent them from stealing each others vote, but on the whole they can't skew your poll. (Note: this fails in small communities where a large group of voters come from a small number of IPs, in particular this happens around universities).

If you're not looking at authenticating voters, then you're going to be getting some duplicate votes no matter what you use. I'd use a cookie, and have done with it for the anonymous users.
UserVoice allows both anonymous voting and voting when logged in, but then allows the admin to filter out anonymous votes - a nice solution to this problem.

Anything based on IP addresses isn't an option - the case of NAT has been mentioned, but this seems to only be in the case of home users. There are many larger installations that use NAT - some corporations can have thousands of users pooled behind a single IP address. There are also ISP's that use proxy servers for their users - another case where you can have many thousands of users appear to your application as a single address. Adding unique UA combinations to this won't help, as there isn't enough variation.
A persistent cookie is going to be your best bet - and you'll have to live with the fact that it is easy to game. At least when the cookie is persistent (as opposed to session based) you'll catch the majority of users who run a single browser.
If you really want to rely on the results, you are going to have to add some form of identification in the process (like e-mail validation, which is still gameable).
At the end of the day any internet survey is going to have flaws (like: http://www.time.com/time/arts/article/0,8599,1894028,00.html), and you'll have to live with this.

Use a persistent cookie to allow only one vote per item
and record the IP, if there are more than 100 (1,000? 10,000?) requests in less than X mins then "soft block" the IP
The "soft block": dont show a page saying "your IP has been blocked" but show your "thank you for your vote" page and DONT record the vote in your DB. You even can increase the counter for that IP only. You want to prevent them to know that you are blocking their IP.

Two ideas not mentioned yet are:
Asking for the user's email address and emailing them a verification link
Using a captcha
Obviously the former can be circumvented with disposable email addresses and so on, but gives you an audit trail, and provides a significant hurdle to casual/bot vote-stuffing. A good captcha likewise severely limits vote-stuffing, but with all the usual caveats surrounding their use.

I have the same problem, and here's what I am planning on doing...
Set a persistent cookie. Check the cookie to decide whether a particular vote could be cast.
Additionally store some data about the vote request in the form of a combination of IP address + User Agent. And then use this value to limit the no. of votes to, say, 10 per day.
What is the best way of going about creating this hash (IP + UA String)?

Related

How to correlate the below given scenario for check boxes?

in my script i have a scenario like the page contains multiple check boxes for example 10, as per the user need user selects check boxes for example one user selects 4 check boxes and other user clicks 5 check boxes, so per each it will vary.
so how to correlate those values,
thanking you.
From the website: "Please don’t share your solutions, ask for help, or help others. This is meant to be a challenge."
So you appear to be violating one of the primary rules in this website. I have looked at this challenge and it's really good to gauge someone's knowledge.
However, to address technology generally - in reading your question I get the sense you may be missing certain fundamental knowledge for this kind of thing. Here's some fundamental knowledge. Hopefully my answer will help increase your knowledge. And hopefully you can use this increased general knowledge to address this specific question.
Definitions:
Correlation - you're taking data the SERVER sends to the browser, capturing it and sending it back. Information present on web pages would fit into this category.
Parameterization - you've got a set of values you'd like to put into web forms. This is usually values like names, addresses, etc
Also understand exactly what is happening when you conduct certain actions on your browser. When you "click" a checkbox does that actually send a message to a server? That usually doesn't (though not always) happen. So when you use phrases like 'click a checkbox' that tells me you may not appreciate the fact that performance testing is server focused, not browser focused.
Performance testing isn't intuitive so you need to understand these concepts. If you dedicate time to understanding the concepts I've outlined above you'll have the knowledge to complete the challenge.
Good luck.
What is driving the variation on check boxes being checked? Is it the result of something that comes back from the server, from a previous request? Or is it somewhat random based on whatever the user wants to do at runtime?

validating a poll for user to vote once

Creating a public poll.how do you validate that the user only vote once.I tried using an IP address but some organizations use 1 IP address.
It's no 100% solution, but you can use a browser fingerprint in combination with the IP adress. See this site for some usable and easily gettable browser properties.
Disadvantages: Some people may be left out (especially in large organizations with a very restrictive and thus homogeneous infrastructure), others may vote twice, for example by using different browsers.
If you want a 95% solution you have to require people to sign up with their email adress and proving that they received the email by klicking an embedded link, but depending on how much interest they take in voting, it may scare off a lot of potential voters.
A 100% solution for this problem does not exist, as far as I'm aware of.
Edit: Cookies are another obvious choice if you don't care too much about people gaming the poll system (just writing an auto-voter that ignores the cookies you send it).

Techniques to reduce data harvesting from AJAX/JSON services

I was wondering if anyone had come across any techniques to reduce the chances of data exposed through JSON type services on the server (intended to supply AJAX functions) from being harvested by external agents.
It seems to me that the problem is not so difficult if you had say a Flash client consuming the data. Then you could send encrypted data to the client, which would know how to decrypt it. The same method seems impossible with AJAX though, due to the open nature of the Javascript source.
Has anybody implemented a clever technique here?
Whatever the method, it should still allow a genuine AJAX function to consume the data.
Note that I'm not really talking about protecting 'sensitive' information here, the odd record leaking out is not a problem. Rather I am thinking about stopping a situation where the whole DB is hoovered up by bots (either in one go, or gradually over time).
Thanks.
First, I would like to clear on this:
It seems to me that the problem is not
so difficult if you had say a Flash
client consuming the data. Then you
could send encrypted data to the
client, which would know how to
decrypt it. The same method seems
impossible with AJAX though, due to
the open nature of the Javascrip
source.
It will be pretty obvious the information is being sent encrypted to the flash client & it won't be that hard for the attacker to find out from your flash compiled program what's being used for this - replicate & get all that data.
If the data does happens to have the value you are thinking, you can count on the above.
If this is public information, embrace that & don't combat it - instead find ways to capitalize on it.
If this is information that you are only exposing to a set of users, make sure you have the corresponding authentication / secure communication. Track usage as others have said, and have measures that act on it,
The first thing to prevent bots from stealing your data is not technological, it's legal. First, make sure you have the right language in your site's Terms of Use that what you're trying to prevent is actually disallowed and defensible from a legal standpoint. Second, make sure you design your technical strategy with legal issues in mind. For example, in the US, if you put data behind an authentication barrier and an attacker steals it, it's likely a violation of the DMCA law. Third, find a lawyer who can advise you on IP and DMCA issues... nice folks on StackOverflow aren't enough. :-)
Now, about the technology:
A reasonable solution is to require that users be authenticated before they can get access to your sensitive Ajax calls. This allows you to simply monitor per-user usage of your Ajax calls and (manually or automatically) cancel the account of any user who makes too many requests in a particular time period. (or too many total requests, if you're trying to defend against a trickle approach).
This approach of course is vulnerable to sophisticated bots who automatically sign up new "users", but with a reasonably good CAPTCHA implementation, it's quite hard to build this kind of bot. (see "circumvention" section at http://en.wikipedia.org/wiki/CAPTCHA)
If you are trying to protect public data (no authentication) then your options are much more limited. As other answers noted, you can try IP-address-based limits (and run afoul of large corporate proxy users) but sophisticated attackers can get around this by distributing the load. There's also likley sophisticated software which watches things like request timing, request patterns, etc. and tries to spot bots. Poker sites, for example, spend a lot of time on this. But don't expect these kinds of systems to be cheap. One easy thing you can do is to mine your web logs (e.g. using Splunk) and find the top N IP addresses hitting your site, and then do a reverse-IP lookup on them. Some will be legitimate corporate or ISP proxies. But if you recognize a compeitor's domain name among the list, you can block their domain or follow up with your lawyers.
In addition to pre-theft defense, you might also want to think about inserting a "honey pot": deliberately fake information that you can track later. This is how, for example, maps manufacturers catch plaigarism: they insert a fake street in their maps and see which other maps show the same fake street. While this doesn't prevent determined folks from sucking out all your data, it does let you find out later who's re-using your data. This can be done by embedding unique text strings in your text output, and then searching for those strings on Google later (assuming your data is re-usable on another public website). If your data is HTML or images, you can include an image which points back to your site, and you can track who is downloading it, and look for patterns you can use to bust the freeloaders.
Note that the javascript encryption approach noted in one of the other answers won't work for non-authenticated sessions-- an attacker can simply download the javascript and run it just like a regular browser would. Moral of the story: public data is essentially indefensible. If you want to keep data protected, put it behind an authentication barrier.
This is obvious, but if your data is publicly searchable by search engines, you'll both need a non-AJAX solution for them (Google won't read your ajax data!) and you'll want to mark those pages NOARCHIVE so your data doesn't show up in Google's cache. You'll also probably want a white list of search engine crawler IP addreses which you allow into your search-engine-crawlable pages (you can work with Google, Bing, Yahoo, etc. to get these), otherwise malicious bots could simply impersonate Google and get your data.
In conclusion, I want to echo #kdgregory above: make sure that the threat is real enough that it's worth the effort required. Many companies overestimate the interest that other people (both legitimate customers and nefarious actors) have in their business. It might be that yours is an oddball case where you have particularly important data, it's particularly valuable to obtain, it must be publicly accessible without authentication, and your legal recourses will be limited if someone steals your data. But all those together is admittedly an unusual case.
P.S. - another way to think about this problem which may or may not apply in your case. Sometimes it's easier to change how your data works which obviates securing it. For example, can you tie your data in some way to a service on your site so that the data isn't very useful unless it's being used in conjunction with your code. Or can you embed advertising in it, so that wherever it's shown you get paid? And so on. I don't know if any of these mitigations apply to your case, but many businesses have found ways to give stuff away for free on the Internet (and encourage rather than prevent wide re-distribution) and still make money, so a hybrid free/pay strategy may (or may not) be possible in your case.
If you have an internal Memcached box, you could consider using a technique where you create an entry for each IP that hits your server with an hour expiration. Then increment that value each time the IP hits your AJAX endpoint. If the value gets over a particular threshold, fry the connection. If the value expires in Memcached, you know it isn't getting "hoovered away".
This isn't a concrete answer with a proof of concept, but maybe a starting point for you. You could create a javascript function that provides encryption/decryption functions. The javascript would need to be built dynamically, and you would include an encryption key that is unique to the session. On the server side, you'd have an encryption service that uses the key from the session to encrypt your JSON before delivering it.
This would at least prevent someone from listening to your web traffic, pulling information out of your database.
I'm with kdgergory though, it sounds like your data is too open.
Some techniques are listed in Further thoughts on hindering screen scraping.
If you use PHP, Bad behavior is a nice tool to help. If you don't use PHP, it can give some ideas on how to filter (see How it works page).
Incredibill's blog is giving nice tips, lists of User-agents/IP ranges to block, etc...
Here are a variety of suggestions:
Issue tokens required for redemption along with each AJAX request. Expire the tokens.
Track how many queries are coming from each client, and throttle excessive usage based on expected normal usage of your site.
Look for patterns in usage such as sequential queries, spikes in requests, or queries that occur faster than a human could conduct.
Check user-agents. Many bots don't completely replicate the user agent info of a browser, and you can eliminate programatic scraping of your data using this method.
Change the front-end component of your website to redirect to a captcha (or some other human verifying mechanism) once a request threshold is exceeded.
Modify your logic so the respsonse data is returned in a few different ways to complicate the code required to parse.
Obsfucate your client-side javascript.
Block IPs of offending clients.
Bots usually doesn't parse Javascript, so your ajax code won't be instantly executed. And if they even do, bots usually doesn't maintain sessions/cookies as well. Knowing that, you could reject the request if it is invoked without a valid session/cookie (which is obviously set on the server side beforehand by the request on the parent page).
This does not protect you from human hazard though. The safest way is to restrict access to users with a login/password. If that is not your intent, well, then you have to live with the fact that it's a public application. You could of course scan logs and maintian blacklists with IP addresses and useragents, but that goes extreme.

How to make sure the visitor is unique

Say you have a pay-site with some online courses. And you want to make sure that one person doesn't just buy access, and then give the username and password to all his friends, so they can do the courses for free.
How would you go about this?
What we've thought of so far:
IP tracking
SMS password for each entry
Max number of runs through each course
Any other suggestions?
It's impossible to get a perfect system to do what you want. You find yourself in a situation where the stronger you make your protection (to defend against cheating customers), the more you annoy all your customers (including the honest ones).
You're going to have to ask yourself at what point the extra protection actually reduces the value of your site to the point that you're losing more honest customers than you're winning customers by converting cheaters into honest (paying) customers.. It might well be that the optimal thing to do is to use cookies, and only take remedial action if you see two concurrent sessions from different IP addresses, since that's fairly likely to be caused by cheating (though not guaranteed; it could be a dual-homed customer).
There's no way you can absolutely, positively guarantee that users are unique - even if you had some way to uniquely identify users, like biometric data (which you don't), you'd still be unable to be certain the the client wasn't just spoofing that information.
The best you can hope to do is make it a hassle for someone to "cheat" the system. IP+SMS would probably do that, although it'd also probably annoy the heck out of your users (at least, the latter part).
Your best bet is probably just to log IPs used for each account - if the number goes above a certain threshold, flag it for inspection, and close the account if it looks like the info is being widely shared.
Associate an IP address with a cookie. Then associate that cookie with the user account and require use of the cookie to login. If the user logs in in with a different IP address then associate that new IP address with the cookie and ask for some sort of verification to authenticate the user.
There is no 100% guarantee at all. Someone can just sit next to the user who bought the access and read the site over his shoulder. Your methods are good (but I personally think that SMS-authorization is a little too much), but I'd suggest maximum personalization of the information you provide, so nobody except the payer can benefit from accessing it.
I'm sure some people would try to use cookies (assuming users don't change computers)

How do you perform address validation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Is it even possible to perform address (physical, not e-mail) validation? It seems like the sheer number of address formats, even in the US alone, would make this a fairly difficult task. On the other hand it seems like a task that would be necessary for several business requirements.
Here's a free and sort of "outside the box" way to do it. Not 100% perfect, but it should reject blatantly non-existent addresses.
Submit the entire address to Google's geocoding web service. This service attempts to return the exact coordinates of the location you feed it, i.e. latitude and longitude.
In my experience if the address is invalid you will get a result of 602 from the service. There's definitely a possibility of false positives or false negatives, but used in conjunction with other consistency checks it could be useful.
(Yahoo's geocoding web service, on the other hand, will return the coordinates of the center of the town if the town exists but the rest of the address is bogus. Potentially useful as long as you pay close attention to the "precision" field in the result).
There are a number of good answers in here but most of them make the assumption that the user wants an "API" solution where they must write code to connect to a 3rd-party service and/or screen scrape the USPS. This is all well and good, but should be factored into the business requirements and costs associated with the implementation and then weighed against the desired benefits.
Depending upon the business requirements and the way that the data is received into the system, a real-time address processing solution may be the best bet. If a real-time solution is required, you will want to consider the license agreement and technical limitations of the Google Maps/Bing/Yahoo APIs. They typically limit the number of calls you can make each day. The USPS web tools API is the same in additional they restrict how/why you can use their system and how you are allowed to use the data thereafter.
At the same time, there are a handful of great service providers that can easily process a static list of addresses. Essentially, you give the service provider a CSV file or Excel file, they clean it up and get it back to you. It's a one-time deal with no long-term commitment or obligation—usually.
Full disclosure: I'm the founder of SmartyStreets. We do address verification for addresses within the United States. We are easily able to CASS certify a list and we also offer a address verification web service API. We have no hidden fees, contracts, or anything. You use our service until you no longer need it and you can walk away. (Unlike cell phone companies that require a contract.)
USPS has an address cleaner online, which someone has screen scraped into a poor man's webservice. However, if you're doing this often enough, it'd be a better idea to apply for a USPS account and call their own webservice.
I will refer you to my blog post - A lesson in address storage, I go into some of the techniques and algorithms used in the process of address validation. My key thought is "Don't be lazy with address storage, it will cause you nothing but headaches in the future!"
Also, there is another StackOverflow question that asks this question entitled How should international geographic addresses be stored in a relational database.
In the course of developing an in-house address verification service at a German company I used to work for I've come across a number of ways to tackle this issue. I'll do my best to sum up my findings below:
Free, Open Source Software
Clearly, the first approach anyone would take is an open-source one (like openstreetmap.org), which is never a bad idea. But whether or not you can really put this to good and reliable use depends very much on how much you need to rely on the results.
Addresses are an incredibly variable thing. Verifying U.S. addresses is not an easy task, but bearable, but once you're going for Europe, especially the U.K. with their extensive Postal Code system, the open-source approach will simply lack data.
Web Services / APIs
Enterprise-Class Software
Money gets it done, obviously. But not every business or developer can spend ~$0.15 per address lookup (that's $150 for 1,000 API requests) - a very expensive business model the vast majority of address validation APIs have implemented.
What I ended up integrating: streetlayer API
Since I was not willing to take on the programmatic approach of verifying address data manually I finally came to the conclusion that I was in need of an API with a price tag that would not make my boss want to fire me and still deliver solid and reliable international verification results.
Long story short, I ended up integrating an API built by apilayer, called "streetlayer API". I was easily convinced by a simple JSON integration, surprisingly accurate validation results and their developer-friendly pricing. Also, 100 requests/month are entirely free.
Hope this helps!
I have used the services of http://www.melissadata.com Their "address object" works very well. Its pricey, yes. But when you consider costs of writing your own solutions, the cost of dirty data in your application, returned mailers - lost sales, and the like - the costs can be justified.
For us-based address data my company has used GeoStan. It has bindings for C and Java (and we created a Perl binding). Note that it is a commercial product and isn't cheap. It is quite fast though (~300 addresses per second) and offers features like CASS certification (USPS bulk mail discount), DPV (Delivery point verification) flagging, and LON/LAT geocoding.
There is a Perl module Geo::PostalAddress, but it uses heuristics and doesn't have the other features mentioned for GeoStan.
Edit: some have mentioned 'doing it yourself', if you do decide to do this, a good source of information to start with is the US Census Tiger Data Set, which contains a lot of information about the US including address information.
As seen on reddit:
$address = urlencode('1600 Pennsylvania Avenue, Washington, DC');
$json = json_decode(file_get_contents("http://where.yahooapis.com/geocode?q=$address&flags=J"));
print_r($json);
Fixaddress.com service is available that provides following services,
1) Address Validation.
2) Address Correction.
3) Address spell correcting.
4) Correct addresses phonetic mistakes.
Fixaddress.com uses USPS and Tiger data as reference data.
For more detail visit below link,
http://www.fixaddress.com/
One area where address lookups have to be performed reliably is for VOIP E911 services. I know companies reliably using the following services for this:
Bandwidth.com 9-1-1 Access API MSAG Address Validation
MSAG = Master Street Address Guide
https://www.bandwidth.com/9-1-1/
SmartyStreet US Street Address API
https://smartystreets.com/docs/cloud/us-street-api
There are companies that provide this service. Service bureaus that deal with mass mailing will scrub an entire mailing list to that it's in the proper format, which results in a discount on postage. The USPS sells databases of address information that can be used to develop custom solutions. They also have lists of approved vendors who provide this kind of software and service.
There are some (but not many) packages that have APIs for hooking address validation into your software.
However, you're right that its a pretty nasty problem.
http://www.usps.com/ncsc/ziplookup/vendorslicensees.htm
As mentioned there are many services out there, if you are looking to truly validate the entire address then I highly recommend going with a Web Service type service to ensure that changes can quickly be recognized by your application.
In addition to the services listed above, webservice.net has this US Address Validation service. http://www.webservicex.net/WCF/ServiceDetails.aspx?SID=24
We have had success with Perfect Address.
Their database has all the US street names and street number ranges. Also acts as a pretty decent parser for free-form address fields, if you are lucky enough to have that kind of data.
Validating it is a valid address is one thing.
But if you're trying to validate a given person lives at a given address, your only almost-guarantee would be a test mail to the address, and even that is not certain if the person is organised or knows somebody at that address.
Otherwise people could just specify an arbitrary random address which they know exists and it would mean nothing to you.
The best you can do for immediate results is request the user send a photographed / scanned copy of the head of their bank statement or some other proof-of-recent-residence, because at least then they have to work harder to forget it, and forging said things show up easily with a basic level of image forensic analysis.
There is no global solution. For any given country it is at best rather tricky.
In the UK, the PostOffice controlls postal addresses, and can provide (at a cost) address information for validation purposes.
Government agencies also keep an extensive list of addresses, and these are centrally collated in the NLPG (National Land and Property Gazetteer).
Actually validating against these lists is very difficult. Most people don't even know exactly how their address as it is held by the PostOffice. Some businesses don't even know what number they are on a particular street.
Your best bet is to approach a company that specialises in this kind of thing.
Yahoo has also a Placemaker API. It is good only for locations but it has an universal id for all world locations.
It look that there is no standard in ISO list.
You could also try SAP's Data Quality solutions which are available in both a server platform is processing a large number of requests or as an embeddable SDK if you wanted to run it in process with your application. We use it in our application and it's very robust and scalable.
NAICS.com is coming out with an API that will add all kinds of key business data including street address. This would happen on the fly as your site's forms are processed. https://www.naics.com/business-intelligence-api/
You can try Pitney Bowes “IdentifyAddress” Api available at - https://identify.pitneybowes.com/
The service analyses and compares the input addresses against the known address databases around the world to output a standardized detail. It corrects addresses, adds missing postal information and formats it using the format preferred by the applicable postal authority. I also uses additional address databases so it can provide enhanced detail, including address quality, type of address, transliteration (such as from Chinese Kanji to Latin characters) and whether an address is validated to the premise/house number, street, or city level of reference information.
You will find a lot of samples and sdk available on the site and i found it extremely easy to integrate.
For US addresses you can require a valid state, and verify that the zip is valid. You could even check that the zip code is in the right state, but beyond that I don't think there are many tests you could run that wouldn't provide a lot of false negatives.
What are you trying to do -- prevent simple mistakes or enforcing some kind of identity check?

Resources