Request size vs Server load - performance

I was writing the spec on how to organize the query parameters that are sent in a HTTP Request, and I came up with the following:
All parameters a prefixed with the entity to which they belong, an example "a.b", which is read "b of entity a", that way each parameter would be clearly mapped to the corresponding entity, but what if there were two different entities that share a query paramater?, to avoid repetition and request size I came up with the following micro format. To have a request wide entity called shared each property of shared will represent a property that is shared among entities, e.g.
POST /app/my/resource HTTP/1.1
a.p = v
b.p = v
c.p = v
d.p = v
Here it is clear that property p is shared among a,b,c and d so this could be sent as
POST /app/my/resource HTTP/1.1
shared.p = a:b:c:d%v
Now, the request is smaller and I'm being a bit more DRY, however this adds an extra burden to the server as it has to parse the string to process the values.
Probably in my example the differences are insignificant and I could chose either, but I'd like to know what do you think about it, what would you prefer, maybe the size of the request does not matter, or maybe the parsing of the string is not such a big deal when the length is short, but what happens when we scale the size of both the request and string which one would be better, what are the tradeoffs?

What you are showing is a compression algorithm. Beware that payloads often are compressed on protocol layer already (HTTP, gzip Content-Type, see HTTP compression examples). Compression algorithms are advanced enough to compress duplicate string-items, so probably you won't win much by a custom compression.
Generally try not to optimize prematurely. First show that you are having a response-time or payload-size issue and then optimize. Your compression algorithm itself is a good idea, but it makes payload more complicated as normal key/value pairs (xxx-form-urlencoded Content-Type). For maintenance reasons head for the simplest design possible.

Just to throw this out there, I think the answer would depend on what platform your back-end servers are running to process the requests. For example, the last time that I checked, the Perl-based mod_perl can parse those strings much faster something like ASP.NET.

Related

How to work around 100K character limit for the StanfordNLP server?

I am trying to parse book-length blocks of text with StanfordNLP. The http requests work great, but there is a non-configurable 100KB limit to the text length, MAX_CHAR_LENGTH in StanfordCoreNLPServer.java.
For now, I am chopping up the text before I send it to the server, but even if I try to split between sentences and paragraphs, there is some useful coreference information that gets lost between these chunks. Presumably, I could parse chunks with large overlap and link them together, but that seems (1) inelegant and (2) like quite a bit of maintenance.
Is there a better way to configure the server or the requests to either remove the manual chunking or preserve the information across chunks?
BTW, I am POSTing using the python requests module, but I doubt that makes a difference unless a corenlp python wrapper deals with this problem somehow.
You should be able to start the server with the flag -maxCharLength -1 and that'll get rid of the sentence length limit. Note that this is inadvisable in production: arbitrarily large documents can consume arbitrarily large amounts of memory (and time), especially with things like coref.
The list of options to the server should be accessible by calling the server with -help, and are documented in code here.

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates
You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Should the length of a URL string be limited to increase security?

I am using ColdFusion 8 and jQuery 1.7.2.
I am using CFAJAXPROXY to pass data to a CFC. Doing so creates a JSON array (argument collection) and passes it through the URL. The string can be very long, since quite a bit of data is being passed.
The site that I am working has existing code that limits the length of any URL query string to 250 characters. This is done in the application.cfm file by testing the length of the query string. If any query string is great than 250 characters, the request is aborted. The purpose of this was to ensure that hackers or other malicious code wouldn't be passed through the URL string.
Now that we are using the query string to pass JSON arrays in the URL, we discovered that the Ajax request was being aborted quite frequently.
We have many other security practices in place, such as stripping any "<>" tags from code and using CFQUERYPARAM.
My question is whether limiting the length of a URL string for the sake of security a good idea or is simply ineffective?
There is absolutely no correlation between URI length and security rather more a question of:
Limiting the amount of information that you provide to a user agent to a 'Need to know basis'. This covers things such as the type of application server you run and associated conventions, the web server you run and associated conventions and the operating system on the host machine. These are essentially things that can be considered vulnerabilities.
Reducing the impact of exploiting those vulnerabilities i.e introducing patches, ensuring correct configuration etc.
As alluded to above, at the web tier, this doesn't only cover GET's (your concern), but also POST's, PUT's, DELETE's on just about any other operation on a HTTP resource.
Moved this into an answer for Evik -
That seems (at best) completely unnecessary if the inputs are being properly sanitized. I'm sure someone clever can quickly defeat a "security by small doorway" defense, assuming that's the only defense.
OWASP has some good, sane guidelines for web security. As far as I've read, limiting the size of the url is not on the list. For more information, see: https://www.owasp.org/index.php/Top_10_2010-Main
I would also like to echo Hereblur's comment that this makes internationalization tricky, or maybe impossible.
I'm not a ColdFusion developer. But I think it's the same with other language.
I think It's help just a little bit. The problem of malicious code or sql injection should be handle by your application.
I agree that limited length of query string value is safer and add more difficult to hackers. But you cant do this with POST data. and It's limit some functionality. For example,
For one utf-8 character, It may take 9 characters after encoded. that's mean you can put only 27 non-english characters.
The only reason to limit has to do with performance and DOS attack - not security per se (though DOS is a security threat by bringing down your server). Web servers and App servers (including CF) allow you to limit the size of POST data so that your server won't be degraded by very large file uploads. URL data if substantial can result in long running requests as the server struggles to parse or handle or write or whatever.
So there is some modest risk here related to such things. Back in the NT days IIS 3 had a number of flaws that were "locked down" by limiting the length of the URL - but those days are long gone. There are far more exploits representing low hanging fruit that I would look at first before examining this issue too closely - unless of course you feel like you are hanging a specific problem with folks probing you (with long URLs I mean :).

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

fastest etag algorithm

We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.

Resources