Should the length of a URL string be limited to increase security? - ajax

I am using ColdFusion 8 and jQuery 1.7.2.
I am using CFAJAXPROXY to pass data to a CFC. Doing so creates a JSON array (argument collection) and passes it through the URL. The string can be very long, since quite a bit of data is being passed.
The site that I am working has existing code that limits the length of any URL query string to 250 characters. This is done in the application.cfm file by testing the length of the query string. If any query string is great than 250 characters, the request is aborted. The purpose of this was to ensure that hackers or other malicious code wouldn't be passed through the URL string.
Now that we are using the query string to pass JSON arrays in the URL, we discovered that the Ajax request was being aborted quite frequently.
We have many other security practices in place, such as stripping any "<>" tags from code and using CFQUERYPARAM.
My question is whether limiting the length of a URL string for the sake of security a good idea or is simply ineffective?

There is absolutely no correlation between URI length and security rather more a question of:
Limiting the amount of information that you provide to a user agent to a 'Need to know basis'. This covers things such as the type of application server you run and associated conventions, the web server you run and associated conventions and the operating system on the host machine. These are essentially things that can be considered vulnerabilities.
Reducing the impact of exploiting those vulnerabilities i.e introducing patches, ensuring correct configuration etc.
As alluded to above, at the web tier, this doesn't only cover GET's (your concern), but also POST's, PUT's, DELETE's on just about any other operation on a HTTP resource.

Moved this into an answer for Evik -
That seems (at best) completely unnecessary if the inputs are being properly sanitized. I'm sure someone clever can quickly defeat a "security by small doorway" defense, assuming that's the only defense.
OWASP has some good, sane guidelines for web security. As far as I've read, limiting the size of the url is not on the list. For more information, see: https://www.owasp.org/index.php/Top_10_2010-Main
I would also like to echo Hereblur's comment that this makes internationalization tricky, or maybe impossible.

I'm not a ColdFusion developer. But I think it's the same with other language.
I think It's help just a little bit. The problem of malicious code or sql injection should be handle by your application.
I agree that limited length of query string value is safer and add more difficult to hackers. But you cant do this with POST data. and It's limit some functionality. For example,
For one utf-8 character, It may take 9 characters after encoded. that's mean you can put only 27 non-english characters.

The only reason to limit has to do with performance and DOS attack - not security per se (though DOS is a security threat by bringing down your server). Web servers and App servers (including CF) allow you to limit the size of POST data so that your server won't be degraded by very large file uploads. URL data if substantial can result in long running requests as the server struggles to parse or handle or write or whatever.
So there is some modest risk here related to such things. Back in the NT days IIS 3 had a number of flaws that were "locked down" by limiting the length of the URL - but those days are long gone. There are far more exploits representing low hanging fruit that I would look at first before examining this issue too closely - unless of course you feel like you are hanging a specific problem with folks probing you (with long URLs I mean :).

Related

How to work around 100K character limit for the StanfordNLP server?

I am trying to parse book-length blocks of text with StanfordNLP. The http requests work great, but there is a non-configurable 100KB limit to the text length, MAX_CHAR_LENGTH in StanfordCoreNLPServer.java.
For now, I am chopping up the text before I send it to the server, but even if I try to split between sentences and paragraphs, there is some useful coreference information that gets lost between these chunks. Presumably, I could parse chunks with large overlap and link them together, but that seems (1) inelegant and (2) like quite a bit of maintenance.
Is there a better way to configure the server or the requests to either remove the manual chunking or preserve the information across chunks?
BTW, I am POSTing using the python requests module, but I doubt that makes a difference unless a corenlp python wrapper deals with this problem somehow.
You should be able to start the server with the flag -maxCharLength -1 and that'll get rid of the sentence length limit. Note that this is inadvisable in production: arbitrarily large documents can consume arbitrarily large amounts of memory (and time), especially with things like coref.
The list of options to the server should be accessible by calling the server with -help, and are documented in code here.

Why do WebDAV implementations not support GETing a folder

RFC 2518 states:
The semantics of GET are unchanged when applied to a collection,
since GET is defined as, "retrieve whatever information (in the form
of an entity) is identified by the Request-URI" [RFC2068]. GET when
applied to a collection may return the contents of an "index.html"
resource, a human-readable view of the contents of the collection, or
something else altogether. Hence it is possible that the result of a
GET on a collection will bear no correlation to the membership of the
collection.
As a user of owncloud I often find myself suffering from the low performance of an initial sync of a folder containing lots of small files (See owncloud bugtracker for others reporting the same issue). After some investigation I came to the conclusion that the culprit is the underlying WebDAV implementation, which yields an index.html for a collection and thus forces the client to issue a GET request for each file. Since each GET causes a significant overhead (in the order of several hundreds of ms), the whole operation never uses the available bandwidth and is perceived as agonizingly slow.
So what is the reason that widely used WebDAV implementations do not allow a client to download a whole folder at a time? The specification does not explicitly forbid it. Surely this would increase performance, so I guess there must be some technical reason to this limitation.
The specification does not explicitly forbid it.
It does not forbid it, but it does not even remotely suggests that it's a something that the implementations should do. All the examples given are about retrieving a list or index of contents, not the contents itself.
Moreover, even if the server implementation chooses to support retrieving contents of a collection, there's no specification for format of that (how to package individual files into one download). So such implementation would be proprietary and your WebDAV client won't support it anyway.

fastest etag algorithm

We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.

RETS data fetching problem

I am working on one real estate website which is Using RETS service to get the data to my local server.
but I have one little bit problem here,I can fetch data from RETS which is having about 3lacks record in RETS Database but I didn't find the way,How can I fetch that all records in bunch of 50k at a time ?
I didn't find any 'LIMIT' keyword on RETS.so how can I fetch without 'LIMIT' 50k records at a time?
Please help me.
RETS is not really much of a standard. It's more closely resembles a pseudo standard. It loosely defines an XML schema that describes real estate listings.
In version 1.x, the "standard" was composed of DTD documents. In 2.x, the "standard" uses XSD documents to describe the list.
http://www.rets.org/documentation
However, in practice, there is almost no consistency amongst implementers. Having connected to hundreds of "RETS Compliant" service providers, I'm convinced that not one of them is like any other one.
Furthermore, the 2.x "standard" has not changed in 3 years. It's an unmaintained, sloppy attempt at a standard. It (RETS) is often used as a business buzz word by non-technical people. In reality, it's just an arbitrary attempt at modeling real estate listing in XML.
Try asking the specific implementer for their documentation. Often, they don't have any. So, emailing the lead developer has frequently been helpful. Sometimes they'll provide a WSDL which will outline the supported calls. Often, the WSDL doesn't coincide with the actual service, so beware.
As for your specific question, try caching the results. Usually, the use of a limit on a RETS call is a sign of a direct dependency. As requests for your service increase, the load that your service puts on theirs will break (and not be appreciated). Also, if their service goes down (even temporarily), yours will be interrupted as well. Most importantly, it will make the live requests to your pages really, really slow (especially if their system is slow at the time). The listings usually don't change frequently enough for worries about stale data, so caching up to and hour is pretty acceptable.
Best of luck!
libRets provides support for generating a query with fetch limits:
http://www.crt.realtors.org/projects/rets/librets/documentation/api/classlibrets_1_1_search_request.html
But last I knew: I remember the company Intereality either ignored or outright didn't provide complete compatibility to RETS. Quickest way to know your dealing with them is that also thought making all "System" name's for table fields numeric.
If you're lucky, you're using a Rapattoni backed server and they do provide spec. compatible servers.
Last point, I can't for the life of me remember it's name, but I used to use a free Java based RETS tool to build valid queries ( included offset/limit clauses ) and that made it a tad easier to build automated fetchers for a client's batch processing system.
IN RETS if Count More Than limit then We can download using Batch form or we can remove that Limit using regex while downloading
Best way to solve Problem divide Data Count in small unit of download and while we have to consider download limit in mind Field for Divide that one in MLS/IDX I Suggest Modification Date and ListingDate

Techniques to reduce data harvesting from AJAX/JSON services

I was wondering if anyone had come across any techniques to reduce the chances of data exposed through JSON type services on the server (intended to supply AJAX functions) from being harvested by external agents.
It seems to me that the problem is not so difficult if you had say a Flash client consuming the data. Then you could send encrypted data to the client, which would know how to decrypt it. The same method seems impossible with AJAX though, due to the open nature of the Javascript source.
Has anybody implemented a clever technique here?
Whatever the method, it should still allow a genuine AJAX function to consume the data.
Note that I'm not really talking about protecting 'sensitive' information here, the odd record leaking out is not a problem. Rather I am thinking about stopping a situation where the whole DB is hoovered up by bots (either in one go, or gradually over time).
Thanks.
First, I would like to clear on this:
It seems to me that the problem is not
so difficult if you had say a Flash
client consuming the data. Then you
could send encrypted data to the
client, which would know how to
decrypt it. The same method seems
impossible with AJAX though, due to
the open nature of the Javascrip
source.
It will be pretty obvious the information is being sent encrypted to the flash client & it won't be that hard for the attacker to find out from your flash compiled program what's being used for this - replicate & get all that data.
If the data does happens to have the value you are thinking, you can count on the above.
If this is public information, embrace that & don't combat it - instead find ways to capitalize on it.
If this is information that you are only exposing to a set of users, make sure you have the corresponding authentication / secure communication. Track usage as others have said, and have measures that act on it,
The first thing to prevent bots from stealing your data is not technological, it's legal. First, make sure you have the right language in your site's Terms of Use that what you're trying to prevent is actually disallowed and defensible from a legal standpoint. Second, make sure you design your technical strategy with legal issues in mind. For example, in the US, if you put data behind an authentication barrier and an attacker steals it, it's likely a violation of the DMCA law. Third, find a lawyer who can advise you on IP and DMCA issues... nice folks on StackOverflow aren't enough. :-)
Now, about the technology:
A reasonable solution is to require that users be authenticated before they can get access to your sensitive Ajax calls. This allows you to simply monitor per-user usage of your Ajax calls and (manually or automatically) cancel the account of any user who makes too many requests in a particular time period. (or too many total requests, if you're trying to defend against a trickle approach).
This approach of course is vulnerable to sophisticated bots who automatically sign up new "users", but with a reasonably good CAPTCHA implementation, it's quite hard to build this kind of bot. (see "circumvention" section at http://en.wikipedia.org/wiki/CAPTCHA)
If you are trying to protect public data (no authentication) then your options are much more limited. As other answers noted, you can try IP-address-based limits (and run afoul of large corporate proxy users) but sophisticated attackers can get around this by distributing the load. There's also likley sophisticated software which watches things like request timing, request patterns, etc. and tries to spot bots. Poker sites, for example, spend a lot of time on this. But don't expect these kinds of systems to be cheap. One easy thing you can do is to mine your web logs (e.g. using Splunk) and find the top N IP addresses hitting your site, and then do a reverse-IP lookup on them. Some will be legitimate corporate or ISP proxies. But if you recognize a compeitor's domain name among the list, you can block their domain or follow up with your lawyers.
In addition to pre-theft defense, you might also want to think about inserting a "honey pot": deliberately fake information that you can track later. This is how, for example, maps manufacturers catch plaigarism: they insert a fake street in their maps and see which other maps show the same fake street. While this doesn't prevent determined folks from sucking out all your data, it does let you find out later who's re-using your data. This can be done by embedding unique text strings in your text output, and then searching for those strings on Google later (assuming your data is re-usable on another public website). If your data is HTML or images, you can include an image which points back to your site, and you can track who is downloading it, and look for patterns you can use to bust the freeloaders.
Note that the javascript encryption approach noted in one of the other answers won't work for non-authenticated sessions-- an attacker can simply download the javascript and run it just like a regular browser would. Moral of the story: public data is essentially indefensible. If you want to keep data protected, put it behind an authentication barrier.
This is obvious, but if your data is publicly searchable by search engines, you'll both need a non-AJAX solution for them (Google won't read your ajax data!) and you'll want to mark those pages NOARCHIVE so your data doesn't show up in Google's cache. You'll also probably want a white list of search engine crawler IP addreses which you allow into your search-engine-crawlable pages (you can work with Google, Bing, Yahoo, etc. to get these), otherwise malicious bots could simply impersonate Google and get your data.
In conclusion, I want to echo #kdgregory above: make sure that the threat is real enough that it's worth the effort required. Many companies overestimate the interest that other people (both legitimate customers and nefarious actors) have in their business. It might be that yours is an oddball case where you have particularly important data, it's particularly valuable to obtain, it must be publicly accessible without authentication, and your legal recourses will be limited if someone steals your data. But all those together is admittedly an unusual case.
P.S. - another way to think about this problem which may or may not apply in your case. Sometimes it's easier to change how your data works which obviates securing it. For example, can you tie your data in some way to a service on your site so that the data isn't very useful unless it's being used in conjunction with your code. Or can you embed advertising in it, so that wherever it's shown you get paid? And so on. I don't know if any of these mitigations apply to your case, but many businesses have found ways to give stuff away for free on the Internet (and encourage rather than prevent wide re-distribution) and still make money, so a hybrid free/pay strategy may (or may not) be possible in your case.
If you have an internal Memcached box, you could consider using a technique where you create an entry for each IP that hits your server with an hour expiration. Then increment that value each time the IP hits your AJAX endpoint. If the value gets over a particular threshold, fry the connection. If the value expires in Memcached, you know it isn't getting "hoovered away".
This isn't a concrete answer with a proof of concept, but maybe a starting point for you. You could create a javascript function that provides encryption/decryption functions. The javascript would need to be built dynamically, and you would include an encryption key that is unique to the session. On the server side, you'd have an encryption service that uses the key from the session to encrypt your JSON before delivering it.
This would at least prevent someone from listening to your web traffic, pulling information out of your database.
I'm with kdgergory though, it sounds like your data is too open.
Some techniques are listed in Further thoughts on hindering screen scraping.
If you use PHP, Bad behavior is a nice tool to help. If you don't use PHP, it can give some ideas on how to filter (see How it works page).
Incredibill's blog is giving nice tips, lists of User-agents/IP ranges to block, etc...
Here are a variety of suggestions:
Issue tokens required for redemption along with each AJAX request. Expire the tokens.
Track how many queries are coming from each client, and throttle excessive usage based on expected normal usage of your site.
Look for patterns in usage such as sequential queries, spikes in requests, or queries that occur faster than a human could conduct.
Check user-agents. Many bots don't completely replicate the user agent info of a browser, and you can eliminate programatic scraping of your data using this method.
Change the front-end component of your website to redirect to a captcha (or some other human verifying mechanism) once a request threshold is exceeded.
Modify your logic so the respsonse data is returned in a few different ways to complicate the code required to parse.
Obsfucate your client-side javascript.
Block IPs of offending clients.
Bots usually doesn't parse Javascript, so your ajax code won't be instantly executed. And if they even do, bots usually doesn't maintain sessions/cookies as well. Knowing that, you could reject the request if it is invoked without a valid session/cookie (which is obviously set on the server side beforehand by the request on the parent page).
This does not protect you from human hazard though. The safest way is to restrict access to users with a login/password. If that is not your intent, well, then you have to live with the fact that it's a public application. You could of course scan logs and maintian blacklists with IP addresses and useragents, but that goes extreme.

Resources