Client side processing with javascript vs server side with mod_perl - ajax

I have a perl script that converts strings to different encodings, like base64, ASCII or hex (both ways). Now I am writing ajax front end for it, and my question is; if I want to automate the detection of the encoding of the string submitted, is it more efficient to perform regex search on the string submitted with javascript before I send it to the server, or is it faster to leave it for the perl script to figure out what type of string?
To clarify, I am asking which of these two is better:
String submitted
Javascript detects the encoding
AJAX submits encoding and the string to perl script
Perl script returns decoded string
or
String submitted
AJAX submits the string to perl script
Perl script detects encoding and returns decoded string
Is there a particular rule of thumb where this type of processing should be performed, and what do you think is better (meaning faster) implementation?

You must validate your data on the server. Period. Otherwise you'll be sailing off into uncharted waters as soon as some two-bit wannabe "hacker" passes you a base64 string and a tag claiming that your javascript thinks it's hex.
Given this, it's up to you whether you want to also detect encoding on the client side. This has some potential benefits, since it allows you to not send data to the server at all if it's encoded in an invalid fashion or to tell the user what encoding was detected and allow them to correct it if it's an ambiguous case (e.g., hex digits are a subset of the base64 character set, so any hex string could potentially be base64). Just remember that, if an encoding gets passed to the server by the client, the server must still sanity-check the received encoding specifier and be prepared to ignore it (or reject the request completely) if it's inappropriate for the corresponding data.

This depends on the scale.
If there will be a LOT of client requests to do this, it's definitely "faster" to do it on the client side (e.g. in JS before the Ajax call), since putting it on the server side causes the server to process ALL those requests whch will compete for server's CPU resources, whereas client side you will only do one detection per client.
If you only anticipate very few concurrent requests, then doing it in Perl is probably marginally faster since Perl's regex implementation is likely better/faster than JavaScript (I don't have any stats to back this up, though) and presumably the server has better CPU.
But I would not really think that the server side margin would be terribly big considering the whole processing shouldn't take that long on either side, so I'd advise to go with client-side checking since that (as per the first paragraph) scales better.
If the performance difference between the two really matters to you a lot, you should actually implement both and benchmark under both the average anticipated and the maximum projected client loads.

Related

Is Gzip compressed binary data or uncompressed text safe to transmit over https, or should it be base 64 encoded as the final step before sending it?

My question is in the title, this provides context to help you understand my confusion. Everything is sent over https.
My understanding of base 64 encoding is that it is a way of representing binary data as text, such that the text is safe to transmit across networks or the internet because it avoids anything that might be interpreted as a control code by the various possible protocols that might be involved at some point.
Given this understanding, I am confused why everything sent to over the internet is not base 64 encoded. When is it safe not to base 64 encode something before sending it? I understand that not everything understands or expects to receive things in base 64, but my question is why doesn't everything expect and work with this if it is the only way to send data without the possibility it could be interpreted as control codes?
I am designing an Android app and server API such that the app can use the API to send data to the server. There are some potentially large SQLite database files the client will be sending to the server (I know this sounds strange, yes it needs to send the entire database files). They are being gzipped prior to uploading. I know there is also a header that can be used to indicate this: Content-Encoding: gzip. Would it be safe to compress the data and send it with this header without base 64 encoding it? If not, why does such a header exist if it is not safe to use? I mean, if you base 64 encode it first and then compress it, you undo the point of base 64 encoding and it is not at that point base 64 encoded. If you compress it first and then base 64 encode it, that header would no longer be valid as it is not in the compressed format at that point. We actually don't want to use the header because we want to save the files in a compressed state, and using the header will cause the server to decompress it prior to our API code running. I'm only asking this to further clarify why I am confused about whether it is safe to send gzip compressed data without base 64 encoding it.
My best guess is that it depends on if what you are sending is binary data or not. If you are sending binary data, it should be base 64 encoded as the final step before uploading it. But if you are sending text data, you may not need to do this. However it still seems to my logic, this might still depends on the character encoding used. Perhaps some character encodings can result in sending data that could be interpreted as a control code? If this is true, which character encodings are safe to send without base 64 encoding them as the final step prior to sending it? If I am correct about this, it implies you should only use the that gzip header if you are sending compressed text that has not been base 64 encoded. Does compressing it create the possibility of something that could be interpreted as a control code?
I realize this was rather long, so I will repeat my primary questions (the title) here: Is either Gzip compressed binary data or uncompressed text safe to transmit, or should it be base 64 encoded as the final step before sending it? Okay I lied there is one more question involved in this. Would sending gzip compressed text always be safe to send without base 64 encoding it at the end, no matter which character encoding it had prior to compression?
My understanding of base 64 encoding is that it is a way of representing binary data as text,
Specifically, as text consisting of characters drawn from a 64-character set, plus a couple of additional characters serving special purposes.
such that the text is safe to transmit across networks or the internet because it avoids anything that might be interpreted as a control code by the various possible protocols that might be involved at some point.
That's a bit of an overstatement. For two endpoints to communicate with each other, they need to agree on one protocol. If another protocol becomes involved along the way, then it is the responsibility of the endpoints for that transmission to handle any needed encoding considerations for it.
What bytes and byte combinations can successfully be conveyed is a matter of the protocol in use, and there are plenty that handle binary data just fine.
At one time there was also an issue that some networks were not 8-bit clean, so that bytes with numeric values greater than 127 could not be conveyed across those networks, but that is not a practical concern today.
Given this understanding, I am confused why everything sent to over the internet is not base 64 encoded.
Given that the understanding you expressed is seriously flawed, it is not surprising that you are confused.
When is it safe not to base 64 encode something before sending it?
It is not only safe but essential to avoid base 64 encoding when the recipient of the transmission expects something different. The two or more parties to a given transmission must agree about the protocol to be used. That establishes the acceptable parameters of the communication. Although Base 64 is an available option for part or all of a message, it is by no means the only one, nor is it necessarily the best one for binary data, much less for data that are textual to begin with.
I understand that not everything understands or expects to receive things in base 64, but my question is why doesn't everything expect and work with this if it is the only way to send data without the possibility it could be interpreted as control codes?
Because it is not by any means the only way to avoid data being misinterpreted.
They are being gzipped prior to uploading. I know there is also a header that can be used to indicate this: Content-Encoding: gzip. Would it be safe to compress the data and send it with this header without base 64 encoding it?
It would be expected to transfer such data without base-64 encoding it. HTTP(S) handles binary data just fine. The Content-Encoding header tells the recipient how to interpret the message body, and if it specifies a binary content type (such as gzip) then binary data conforming to that content type are what the recipient will expect.
My best guess is that it depends on if what you are sending is binary data or not.
No. These days, for all practical intents and purposes, it depends only on what application-layer protocol you are using for the transmission. If it specifies that some or all of the message is to be base-64 encoded (according to a particular base-64 scheme, as there are more than one) then that's what the sender must do and how the receiver will interpret the message. If the protocol does not specify that, then the sender must not perform base-64 encoding. Some protocols afford the sender the option to make this choice, but those also provide a way for the sender to indicate inside the transmission what choice has been made.
Is either Gzip compressed binary data or uncompressed text safe to transmit, or should it be base 64 encoded as the final step before sending it?
Neither is inherently unsafe to transmit on today's networks. Whether data are base-64 encoded for transmission is a question of agreement between sender and receiver.
Okay I lied there is one more question involved in this. Would sending gzip compressed text always be safe to send without base 64 encoding it at the end, no matter which character encoding it had prior to compression?
The character encoding of the uncompressed text is not a factor in whether a gzipped version can be safely and successfully conveyed. But it probably matters for the receiver or anyone to whom they forward that data to understand the uncompressed text correctly. If you intend to accommodate multiple character encodings then you will want to provide a way to indicate which applies to each text.

How to work around 100K character limit for the StanfordNLP server?

I am trying to parse book-length blocks of text with StanfordNLP. The http requests work great, but there is a non-configurable 100KB limit to the text length, MAX_CHAR_LENGTH in StanfordCoreNLPServer.java.
For now, I am chopping up the text before I send it to the server, but even if I try to split between sentences and paragraphs, there is some useful coreference information that gets lost between these chunks. Presumably, I could parse chunks with large overlap and link them together, but that seems (1) inelegant and (2) like quite a bit of maintenance.
Is there a better way to configure the server or the requests to either remove the manual chunking or preserve the information across chunks?
BTW, I am POSTing using the python requests module, but I doubt that makes a difference unless a corenlp python wrapper deals with this problem somehow.
You should be able to start the server with the flag -maxCharLength -1 and that'll get rid of the sentence length limit. Note that this is inadvisable in production: arbitrarily large documents can consume arbitrarily large amounts of memory (and time), especially with things like coref.
The list of options to the server should be accessible by calling the server with -help, and are documented in code here.

Loadrunner Design Studio misses URL encoded variations of correlated parameters

We are a shop that uses LoadRunner and VuGen (recording in standard HTTP/HTML web mode), and have an issue where we have an application which uses long base64 parameters, where in some cases they are encoded (primarily with + turned into %2B), and some are not. Design studio finds only one or the other (depending on the server response), and correlates requests only using the source encoding, but not both.
For example, let's say that there's a value which needs to be passed back into the application. In the original response from the server which is the source of the correlation, the value is "ABCDEF+012345".
Now, in some cases it is submitted exactly like that, in which case Design Studio successfully correlates the server response and replaces the requests. However, in other cases it's submitted as "ABCDEF%2B012345" (URL encoded to replace + with %2B), in which case Design Studio does NOT correlate the requests which use that variation of the value.
Now it's not a big deal if it's only a single instance to manually add a conversion function, then search/replace the use of the parameters that did not get correlated. Unfortunately this is cumbersome and we have scripts that have about 100 parameters which are 200-400 characters long. So not only would it take a lot of time to fix this, but since they wrap in the editor frequently due to their length, search/replace isn't possible anyway.
Is there any way to have Design Studio correlate parameters when the used value may or may not be encoded? This seems like it should be a pretty common thing to do.
I recommend using TruClient protocol which solved all of the correlations issues.
If you insist on using HTTP protocol, in your case you rather not use the design studio. Get the parameter yourself using lr_reg_save_param command above the code line that calls the HTTP page.
This way, you can specify the left and right bounds yourself and not let the design studio use regex or other recognition methods that are sometimes problematic:
lr_reg_save_param("param1", "LB=textleftofyourparam", "RB=textrightofyourparam", LAST);

Should the length of a URL string be limited to increase security?

I am using ColdFusion 8 and jQuery 1.7.2.
I am using CFAJAXPROXY to pass data to a CFC. Doing so creates a JSON array (argument collection) and passes it through the URL. The string can be very long, since quite a bit of data is being passed.
The site that I am working has existing code that limits the length of any URL query string to 250 characters. This is done in the application.cfm file by testing the length of the query string. If any query string is great than 250 characters, the request is aborted. The purpose of this was to ensure that hackers or other malicious code wouldn't be passed through the URL string.
Now that we are using the query string to pass JSON arrays in the URL, we discovered that the Ajax request was being aborted quite frequently.
We have many other security practices in place, such as stripping any "<>" tags from code and using CFQUERYPARAM.
My question is whether limiting the length of a URL string for the sake of security a good idea or is simply ineffective?
There is absolutely no correlation between URI length and security rather more a question of:
Limiting the amount of information that you provide to a user agent to a 'Need to know basis'. This covers things such as the type of application server you run and associated conventions, the web server you run and associated conventions and the operating system on the host machine. These are essentially things that can be considered vulnerabilities.
Reducing the impact of exploiting those vulnerabilities i.e introducing patches, ensuring correct configuration etc.
As alluded to above, at the web tier, this doesn't only cover GET's (your concern), but also POST's, PUT's, DELETE's on just about any other operation on a HTTP resource.
Moved this into an answer for Evik -
That seems (at best) completely unnecessary if the inputs are being properly sanitized. I'm sure someone clever can quickly defeat a "security by small doorway" defense, assuming that's the only defense.
OWASP has some good, sane guidelines for web security. As far as I've read, limiting the size of the url is not on the list. For more information, see: https://www.owasp.org/index.php/Top_10_2010-Main
I would also like to echo Hereblur's comment that this makes internationalization tricky, or maybe impossible.
I'm not a ColdFusion developer. But I think it's the same with other language.
I think It's help just a little bit. The problem of malicious code or sql injection should be handle by your application.
I agree that limited length of query string value is safer and add more difficult to hackers. But you cant do this with POST data. and It's limit some functionality. For example,
For one utf-8 character, It may take 9 characters after encoded. that's mean you can put only 27 non-english characters.
The only reason to limit has to do with performance and DOS attack - not security per se (though DOS is a security threat by bringing down your server). Web servers and App servers (including CF) allow you to limit the size of POST data so that your server won't be degraded by very large file uploads. URL data if substantial can result in long running requests as the server struggles to parse or handle or write or whatever.
So there is some modest risk here related to such things. Back in the NT days IIS 3 had a number of flaws that were "locked down" by limiting the length of the URL - but those days are long gone. There are far more exploits representing low hanging fruit that I would look at first before examining this issue too closely - unless of course you feel like you are hanging a specific problem with folks probing you (with long URLs I mean :).

What data formats can AJAX transfer?

I'm new to AJAX, but as an overview I'd like to know what formats you can upload and download. Is it limited to JSON or XML or can you even send binary types like MP3 or UTF-8 HTML. And finally, do you have full control over the data, byte for byte in something like a byte array, or is only a string sent/received.
If we are talking about ajax we are talking about javascript? And about XMLHTTPRequest?
The XMLHttpRequest which is only a http request can transfer everything. But there is no byte array in javascript. Only strings, numbers and such. Every thing you get from an ajax call is a piece of text (responseText). That might be parsed into XML (which gives you reponseXML). Special encodings should be more a matter of the http transport.
The binary stuff is not ajax dependent but javascript dependent. There are some weird encodings for strings to deliver byte data inside in javascript (especially for images) but it is not a general solution.
HTML is not a problem and that is the most prominent use case. From this type of request you get an HTML string delivered and that is added to some node in the DOM per innerHTML that parses the HTML.
Since data is transported via HTTP you will have to make sure that you use some kind of encoding. One of the most popular is base64 encoding. You can find more information at: http://www.webtoolkit.info/javascript-base64.html
The methodology is to base64-encode the data you would like to send and then base64-decode the data at the server(or the client) and use the original data as you intended.
You can transfer any type of data either string or bytes
You can send anything you like, the problem may be how to handle it once you get it ;)
Standard HTML is probably the most common type of ajax content in use out there - you can choose character encoding too, although it's always best to stick with one type of encoding.
AJAX simply means you're transferring data asynchronously over HTTP with a JavaScript call. So your script makes a "normal" HTTP request using the XmlHttpRequest() object. However, as the name implies, it's really only suited for text-based data formats since you generally want to perform some action on the client side with the data you got back from the server (not always though, sometimes people just send XmlHttpRequests only to update something on the server).
On a side note, I have never seen an application where sending binary data would have been appropriate anyway.
Most often, people choose to send data over to the server with POST or GET (which is basically a method to transfer name-value pairs inherent to HTTP). For sending more complex data, for example hierarchical structures, they need to be encoded somehow. XML documents can be made natively per JavaScript, sent over to the server and get parsed into whatever data types necessary. But since XML can be a bit of a pain, many devs use JSON encoded data instead because it's easy to generate and easy to parse.
What the server sends back is equally as arbitrary. Usually, you specify a callback function in your Javascript that handles the incoming data. Again, the popular choices are XML and JSON, they parse easily into a document object or an array structure respectively. You could also send plain text or some other packaging but remember that you then have to take care of extracting the usable data from it yourself. Sometimes, it can also be beneficial to send actual HTML fragments to the client to update something on the page directly.
For starters, I suggest you have a look at JQuery. It's a very lightweight framework that abstracts many of evil compatibility stuff and lets you write AJAX requests very nicely.
You can move anything that can be sent over HTTP. There are restrictions about the call being made to the same domain as the page loaded from, but not on the content of the transfer. You can do either GET or POST transactions too.
There is a Digg the Blog entry titled DUI.Stream and MXHR that shows off what they call "Multipart XMLHttpRequests." It is alpha code now, but there is a demo that handles images.

Resources