Techniques to reduce data harvesting from AJAX/JSON services - ajax

I was wondering if anyone had come across any techniques to reduce the chances of data exposed through JSON type services on the server (intended to supply AJAX functions) from being harvested by external agents.
It seems to me that the problem is not so difficult if you had say a Flash client consuming the data. Then you could send encrypted data to the client, which would know how to decrypt it. The same method seems impossible with AJAX though, due to the open nature of the Javascript source.
Has anybody implemented a clever technique here?
Whatever the method, it should still allow a genuine AJAX function to consume the data.
Note that I'm not really talking about protecting 'sensitive' information here, the odd record leaking out is not a problem. Rather I am thinking about stopping a situation where the whole DB is hoovered up by bots (either in one go, or gradually over time).
Thanks.

First, I would like to clear on this:
It seems to me that the problem is not
so difficult if you had say a Flash
client consuming the data. Then you
could send encrypted data to the
client, which would know how to
decrypt it. The same method seems
impossible with AJAX though, due to
the open nature of the Javascrip
source.
It will be pretty obvious the information is being sent encrypted to the flash client & it won't be that hard for the attacker to find out from your flash compiled program what's being used for this - replicate & get all that data.
If the data does happens to have the value you are thinking, you can count on the above.
If this is public information, embrace that & don't combat it - instead find ways to capitalize on it.
If this is information that you are only exposing to a set of users, make sure you have the corresponding authentication / secure communication. Track usage as others have said, and have measures that act on it,

The first thing to prevent bots from stealing your data is not technological, it's legal. First, make sure you have the right language in your site's Terms of Use that what you're trying to prevent is actually disallowed and defensible from a legal standpoint. Second, make sure you design your technical strategy with legal issues in mind. For example, in the US, if you put data behind an authentication barrier and an attacker steals it, it's likely a violation of the DMCA law. Third, find a lawyer who can advise you on IP and DMCA issues... nice folks on StackOverflow aren't enough. :-)
Now, about the technology:
A reasonable solution is to require that users be authenticated before they can get access to your sensitive Ajax calls. This allows you to simply monitor per-user usage of your Ajax calls and (manually or automatically) cancel the account of any user who makes too many requests in a particular time period. (or too many total requests, if you're trying to defend against a trickle approach).
This approach of course is vulnerable to sophisticated bots who automatically sign up new "users", but with a reasonably good CAPTCHA implementation, it's quite hard to build this kind of bot. (see "circumvention" section at http://en.wikipedia.org/wiki/CAPTCHA)
If you are trying to protect public data (no authentication) then your options are much more limited. As other answers noted, you can try IP-address-based limits (and run afoul of large corporate proxy users) but sophisticated attackers can get around this by distributing the load. There's also likley sophisticated software which watches things like request timing, request patterns, etc. and tries to spot bots. Poker sites, for example, spend a lot of time on this. But don't expect these kinds of systems to be cheap. One easy thing you can do is to mine your web logs (e.g. using Splunk) and find the top N IP addresses hitting your site, and then do a reverse-IP lookup on them. Some will be legitimate corporate or ISP proxies. But if you recognize a compeitor's domain name among the list, you can block their domain or follow up with your lawyers.
In addition to pre-theft defense, you might also want to think about inserting a "honey pot": deliberately fake information that you can track later. This is how, for example, maps manufacturers catch plaigarism: they insert a fake street in their maps and see which other maps show the same fake street. While this doesn't prevent determined folks from sucking out all your data, it does let you find out later who's re-using your data. This can be done by embedding unique text strings in your text output, and then searching for those strings on Google later (assuming your data is re-usable on another public website). If your data is HTML or images, you can include an image which points back to your site, and you can track who is downloading it, and look for patterns you can use to bust the freeloaders.
Note that the javascript encryption approach noted in one of the other answers won't work for non-authenticated sessions-- an attacker can simply download the javascript and run it just like a regular browser would. Moral of the story: public data is essentially indefensible. If you want to keep data protected, put it behind an authentication barrier.
This is obvious, but if your data is publicly searchable by search engines, you'll both need a non-AJAX solution for them (Google won't read your ajax data!) and you'll want to mark those pages NOARCHIVE so your data doesn't show up in Google's cache. You'll also probably want a white list of search engine crawler IP addreses which you allow into your search-engine-crawlable pages (you can work with Google, Bing, Yahoo, etc. to get these), otherwise malicious bots could simply impersonate Google and get your data.
In conclusion, I want to echo #kdgregory above: make sure that the threat is real enough that it's worth the effort required. Many companies overestimate the interest that other people (both legitimate customers and nefarious actors) have in their business. It might be that yours is an oddball case where you have particularly important data, it's particularly valuable to obtain, it must be publicly accessible without authentication, and your legal recourses will be limited if someone steals your data. But all those together is admittedly an unusual case.
P.S. - another way to think about this problem which may or may not apply in your case. Sometimes it's easier to change how your data works which obviates securing it. For example, can you tie your data in some way to a service on your site so that the data isn't very useful unless it's being used in conjunction with your code. Or can you embed advertising in it, so that wherever it's shown you get paid? And so on. I don't know if any of these mitigations apply to your case, but many businesses have found ways to give stuff away for free on the Internet (and encourage rather than prevent wide re-distribution) and still make money, so a hybrid free/pay strategy may (or may not) be possible in your case.

If you have an internal Memcached box, you could consider using a technique where you create an entry for each IP that hits your server with an hour expiration. Then increment that value each time the IP hits your AJAX endpoint. If the value gets over a particular threshold, fry the connection. If the value expires in Memcached, you know it isn't getting "hoovered away".

This isn't a concrete answer with a proof of concept, but maybe a starting point for you. You could create a javascript function that provides encryption/decryption functions. The javascript would need to be built dynamically, and you would include an encryption key that is unique to the session. On the server side, you'd have an encryption service that uses the key from the session to encrypt your JSON before delivering it.
This would at least prevent someone from listening to your web traffic, pulling information out of your database.
I'm with kdgergory though, it sounds like your data is too open.

Some techniques are listed in Further thoughts on hindering screen scraping.
If you use PHP, Bad behavior is a nice tool to help. If you don't use PHP, it can give some ideas on how to filter (see How it works page).
Incredibill's blog is giving nice tips, lists of User-agents/IP ranges to block, etc...

Here are a variety of suggestions:
Issue tokens required for redemption along with each AJAX request. Expire the tokens.
Track how many queries are coming from each client, and throttle excessive usage based on expected normal usage of your site.
Look for patterns in usage such as sequential queries, spikes in requests, or queries that occur faster than a human could conduct.
Check user-agents. Many bots don't completely replicate the user agent info of a browser, and you can eliminate programatic scraping of your data using this method.
Change the front-end component of your website to redirect to a captcha (or some other human verifying mechanism) once a request threshold is exceeded.
Modify your logic so the respsonse data is returned in a few different ways to complicate the code required to parse.
Obsfucate your client-side javascript.
Block IPs of offending clients.

Bots usually doesn't parse Javascript, so your ajax code won't be instantly executed. And if they even do, bots usually doesn't maintain sessions/cookies as well. Knowing that, you could reject the request if it is invoked without a valid session/cookie (which is obviously set on the server side beforehand by the request on the parent page).
This does not protect you from human hazard though. The safest way is to restrict access to users with a login/password. If that is not your intent, well, then you have to live with the fact that it's a public application. You could of course scan logs and maintian blacklists with IP addresses and useragents, but that goes extreme.

Related

In a Web API, isn't providing a DELETE ALL too dangerous?

In studying Web API design (regardless of the specific technology), I often come across these two uses of the DELETE verb:
DELETE /SomeResource/123 /* deletes entity with ID 123 */
DELETE /SomeResource/ /* deletes all entities */
I always get the feeling that there's something wrong with providing the latter as an operation in most applications. In rare cases where the resource is trivial, sure, why not just blow the whole collection away without a second thought? And yes, I understand that typically it's the client app's job to present an "Are you sure?" confirmation. But I like to envision my API being driven in a safe way even by some low-level agent like Fiddler.
So is there some mechanism I'm missing, like a way for the server/API to initiate some kind of dialog with the client agent to get confirmation before blowing away 10,000 customer records?
EDIT
Yes, assume that I do want to provide the functionality to remove all entities in a given scenario, but feel compelled to avoid it because of the perceived danger (typos, not paying attention, etc; the things that a UI confirmation dialog is for)
You'll NEVER present the option to do a delete all function in a web api.
While there may be exceptions, its usually an awful idea to do this because it allows a user to perform an irreversible action. On the other hand, think of how awful it would be to have someone with ill intent find this endpoint!
Most libraries such as ASP.NET will require each method to be implemented explicitly before accepting a request (this is handled in ASP.NET during routing). So, if you don't provide DELETE /someresource/ the problem will never come up. This is particularly great, because you can't accidentally blow away data with a typo.
If you really, really want this functionality, you'll want to be 100% sure you have a reason to have it because this is a dangerous, dangerous endpoint to have.
Bottom line? I wouldn't provide it to protect your data.

html form post/get data security

Okay, this question isnt exactly very clear, because i cant write it as a single question.
I have a game that i am designing using javascript, and it is basically a multiplayer game.
So say there is two players, player darklord and angel
angel shoots darklord
so darklord loses 1 life.
Now what happens is that i use ajax to submit the number of life that darklord loses.
And the request is GET /shootout.php?shooter=darklord&life=-1
so this allows me to store the new life of darklord.
Now the problem is say angel knows about computer, and he starts requesting /shootout.php?shooter=darklord&life=-3
Thus darklord loses more life then he should have. So angel cheated in the game.
No i want to prevent this kind of requests, and i am trying to get a way so that my requests can be hidden. I mean i know i can encrypt the url. So say i encrypted it such that the request should be GET /enc.php?e=934ufj30jf for darklord to lose a life, and different values of e for angel to lose a life, or gain a point. However for this to work i will need to send the data to the client, as in tell the javascript to request this url.
Now the user can easily go around reading the source of the file in order to find out what are the new requests for doing things,
I have found and thought of many other ways, but they all limit the amount of cheating or effect the game-play etc.. None of them eliminate this security completely.
So now my question is how do i make sure that users dont send data that is not real. How do i stop them from cheating?
I have thought of the best way being that i use server side scripts to actually calculate the possibility of someone shooting someone else and then matching it with the client input, but that will effect execution time by a LOT, so i am trying to find other ways, some public key encryptions?? (problem is the user can put the data as they want and then encrypt it) tokens? (problem is the user can put the data as they want and then put the current token)
so any other ideas anyone??
This isn't about hiding requests, it's about implementing proper access controls. Your example is referred to as an insecure direct object reference in that manipulating values in the querystring relating to direct DB objects causes an unintended outcome (have a look at OWASP Top 10 for .NET developers part 4: Insecure direct object reference).
There are a couple of things you can do but the most important is implementing proper access controls. You must authenticate the caller of the service and authorise them to perform the requested activity (and this all has to happen on the server). In this case, angle should not be able to perform an action on behalf of darklord.
The other thing you can do is use an indirect object reference map (refer to the link above), which obfuscates the IDs of the player with cryptographically strong, user-specific alternatives. You probably don't need this in addition to the access controls but it does give you more unpredictability.
Finally, think about the flip-side as well - if darklord is able to pass the amount of damage as a parameter, what's to stop him from re-issuing the request manually with "life=-100"? It will depend on the specifics of how the attack action is performed, but you're going to want to avoid people gaming this action too.
You have to assume that the user is completely in control of the client JavaScript. The only way to make this secure is to do the check on the server side.
You should not send result of action. You should send action.
i.e angel shoot darkangel from point (7,15) with angle 36 degree
than server checks is it correct shoot and decrease lifes of darklord
There was an excellent answer given on this subject a couple of years ago. It actually refers to Flash rather than JavaScript, but the security concerns and techniques are going to be applicable to this situation too.
What is the best way to stop people hacking the PHP-based highscore table of a Flash game
You should never have the client tell the server what changes to make in player state (eg. remove X amount of health) because the client could always be cheating. Instead only have the client tell the server what input the player has made and then the server determines what happens as a result of that input.
Although this doesn't remove the possibility of cheating by writing a bot that plays the game automatically (and is better at the game than any human player) you at least remove overt cheating of the "I did 10,000 damage, trust me" variety.
Detecting bots is best done by tracking behavioral data and doing data mining to find cheaters. And if there is no behavioral difference between bots and human players, then who cares about the bots.

Cookie-Only Sessions Pros/Cons/Summary

Beaker offers an option to use encrypted cookie-only sessions. These sessions are encrypted in such a way that allegedly the user cannot view or modify the information inside the cookie. The documentation discusses these in little details, and I am having trouble finding a list of pros/cons regarding these types of sessions.
I can see the benefit in that it allows your servers to be more disposable, allowing for a greater degree of horizontal scalability. Also, there is lightened complexity on the server-side architecture since you don't need to account for storage/management of the sessions.
On the other hand, there is some request overhead due to the fact all the info needs to be sent on every request. The session values cannot be changed purely server-side, so require a request to be modified. I have concerns with session hijacking, and also, there is a size limit.
I would imagine this topic has been covered somewhere in some type of summary. Does anybody know of such a summary? Would anybody have any additional pros/cons to add? Does anybody know of any mainstream sites that use such an approach?

validating a poll for user to vote once

Creating a public poll.how do you validate that the user only vote once.I tried using an IP address but some organizations use 1 IP address.
It's no 100% solution, but you can use a browser fingerprint in combination with the IP adress. See this site for some usable and easily gettable browser properties.
Disadvantages: Some people may be left out (especially in large organizations with a very restrictive and thus homogeneous infrastructure), others may vote twice, for example by using different browsers.
If you want a 95% solution you have to require people to sign up with their email adress and proving that they received the email by klicking an embedded link, but depending on how much interest they take in voting, it may scare off a lot of potential voters.
A 100% solution for this problem does not exist, as far as I'm aware of.
Edit: Cookies are another obvious choice if you don't care too much about people gaming the poll system (just writing an auto-voter that ignores the cookies you send it).

JSONP and Cross-Domain queries - How to Update/Manipulate instead of just read

So I'm reading The Art & Science of Javascript, which is a good book, and it has a good section on JSONP. I've been reading all I can about it today, and even looking through every question here on StackOverflow. JSONP is a great idea, but it only seems to resolve the "Same Origin Problem" for getting data, but doesn't address it for changing data.
Did I just miss all the blogs that talked about this, or is JSONP not the solution I was hoping for?
JSONP results in a SCRIPT tag being generated to another server with any parameters that might be required as a GET request. e.g.
<script src="http://myserver.com/getjson?customer=232&callback=jsonp543354" type="text/javascript">
</script>
There is technically nothing to stop this sort of request altering data on the server, e.g. specifying newName=Tony. Your response could then be whether the update succeeded or not. You will be limited by whatever you can fit on a querystring. If you are going with this approach add some random element as a parameter so that proxy's won't cache it.
Some people may consider this goes against the way GET's are supposed to work i.e. they shouldn't cause data to change.
Yes, and honestly I would like to stick to that paradigm. However, I might bend the rule and say that, requests which do not alter/deal with CRUCIAL data will be accessible via GET calls... hm...
For instance, I am building a shopping cart system, and I think that allowing the adding/removing/etc of items to/from a cart could very easily be exposed via GETs, since even though you can change data, you cannot do anything critical with it. If someone maliciously added 1,000 flatscreen monitors to your shopping cart, there would be at least one verification step that would NOT be vulnerable to any attacks (a standard ASP.NET page at that point, with verification and all that jazz).
Is this a good/workable solution in anyones' opinion?

Resources