JMeter - Simulate accented UTF8 characters in http POST request - performance

I'm using Apache JMeter 2.8 to carry out some performance testing on one web-based information system.
There are several accented letters used in different requests - like 'ä', 'ö', 'ü' or 'õ'.
When it comes to running test scripts and executing requests, for example 'ä' value in some parameter turns into 'ä'. ('ä' - This is the way jmeter saves such character into a *.jmx file) Content encoding for these http requests is set to UTF-8. When i look at the contents of the project, all characters are displayed correcly. When i run test scripts wrong values are used.
Added later:
I can successfully simulate GET requests with utf8 chars, but still accented characters in my POST requests look like 'ä'. What can be the reason, why jmeter's GET requests' data has proper utf8 encoding and POSTs Windows-1252/ISO-8859-1/cp1252/"ANSI" instead?
Any ideas why this happens? Thanks in advance!

The characters displayed don't only depend on the bytes of the input but what decoding the display is using to interpret them. For example, ä, when encoded as UTF-8, is the bytes 0xC3A4.
Now, what does 0xC3A4 look like when displayed? That depends what decoding is used. Here's some examples:
UTF-8: ä
Windows-1252/ISO-8859-1/cp1252/"ANSI": ä
UTF-16BE: 쎤
UTF-32: �
Mac Os Roman: ä
Windows-1251: Г¤
And so on.

JMeter saves character correctly in JMX, ensure that you opened them with the right encoding (UTF-8).
In JMeter there is this property:
sampleresult.default.encoding=ISO-8859-1
which you can change if this is not the default encoding. But I am not sure it's the issue you are facing.
Check the "Encode?"
Solution is:
set content encoding to UTF-8
Check Encode? in parameters table as your parameters are non ascii ones
if this does not work, it reveals an issue on tested application:
request.setCharacterEncoding("UTF-8") must be called before using parameters if it's a Java Application.
Same concepts exist for PHP and ASP.

The thing got fixed by switching HTTP request implementation field from HttpClient4 to HttpClient3.1 + leaving HTTP Request Content encoding value empty :)
There might be some JMeter bug regarding using HttpClient4.

Related

Issues with using UTF-8 with PHPMailer

I'm using PHPMailer 5 to send plain text emails from forms. It looks like some users are pasting content from word into the textarea fields and the resulting email comes out with lots of non-readable characters (e.g. “).
I've tried adding $mail->CharSet = 'UTF-8'; and that seems to fix the tests I've done (e.g. bullet lists are now coming through properly).
$mail = new PHPMailer;
$mail->CharSet = 'UTF-8';
$mail->ContentType = 'text/plain';
$mail->IsHTML(false);
Are there any security issues or other issues that could come up from setting the character set to UTF-8?
You're doing it right. PHPMailer defaults (as does PHP's internal mail function) to the ISO-8859-1 character set because that can be used in the absence of the mbstring PHP extension which is not available by default - and if you don't have that extension, UTF-8 support won't work. Once you switch to using UTF-8, your entire toolchain must also use UTF-8 - your editors, your database, your database connection. You also need to be wary of functions like strlen and substr, which are not UTF-8-safe because they work in bytes, not chars (which may be more than 1 byte long). Whenever one of those things gets it wrong, you'll see the kind of corruption you have. It's a good exercise to stick in some difficult strings to test with (though see my answer about that) to make sure it comes through unscathed.
Unfortunately, MS Word is one of the best examples of how to do UTF-8 badly; it often riddles the text with unnecessary unusual characters, extra control chars etc, so I would advise doing some heavy filtering on your inputs - editors like CKEditor have built-in filters to help deal with Word's issues. That doesn't have anything to do with PHPMailer, it's a just a common problem with dealing with input that has been touched by Word.
The only thing you're doing wrong is using PHPMailer 5.x; current version is 6.x.

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

JSON encoding issue with Ruby 1.9 and HTTParty

I've created a WebAPI that returns JSON.
The initial data is as follow (UTF-8 encoded):
#text="Rosenborg har ikke h\xC3\xB8rt hva Steffen"
Then with a .to_json on my object, here is what is sent by the API (I think it is ISO-8859-1 encoding) :
"text":"Rosenborg har ikke h\ufffd\ufffdrt hva Steffen"
I'm using HTTParty on the client side, and that's what I finally get :
"text":"Rosenborg har ikke h��rt hva"
Both WebAPI and client app are using Ruby 1.9.2 and Rails 3.
I'm a bit lost with this encoding issue... I tried to add the utf8 encoding header to my ruby files but it didn't changed anything.
I guess that I'm missing an encoding / decoding part somewhere... anyone has an idea?
Thank you very much !!!
Vincent
In Ruby 1.9, encoding is explicit now. However, Rails may or may not be configured to send the responses in the encoding you expect. You'll have to set the global configuration setting:
Encoding.default_external = "utf-8".
I believe the encoding that Ruby specifies by default for serialization is the platform default. In America on Windows that would be CodePage-1251. Other countries would have an alternate encoding.
Edit: Also see this url if the json is executed against MySQL: https://rails.lighthouseapp.com/projects/8994/tickets/5210-encoding-problem-in-json-format-response
Edit 2: Rails core and its suite of libraries (ActiveRecord, et. al.) will respect the Encoding.default_external configuration setting which encodes all the values it sends. Unfortunately, because encoding is a relatively new concept to Ruby not every 3rd party library has been adjusted for proper encoding. The ones that have may require additional configuration settings for those libraries. This includes MySQL, and the RSolr library you were using.
In all versions of Ruby before the 1.9 series, a string was just an array of bytes. When you've been thinking like that for so long, it's hard to wrap your head around the concept of multiple string encodings. The thing that is even more confusing now is that unlike Java, C#, and other languages that use some form of UTF as the native string format, Ruby allows each string to be encoded differently. In retrospect, that might be a mistake, but at least now they are respecting encoding.
The Encoding.force_encoding method is designed to treat the byte sequence with that new encoding, but does not change any of the underlying data. So it is possible to have invalid byte sequences. There is another method called .encode() that will transform the bytes from one encoding to another and guarantees valid byte sequences. For more information read this:
http://blog.grayproductions.net/articles/ruby_19s_string
Ok, I finally found out what the problem is...
I'm using RSolr to get my data from Solr, and by default encoding for all results is unfortunately 'US-ASCII' as mentioned here (and checked by myself) :
http://groups.google.com/group/rsolr/browse_thread/thread/2d4890fa7737e7ef#
So you need to force encoding as follow :
my_string.force_encoding(Encoding::UTF_8)
There is maybe a nice encoding option to provide to RSolr!

ruby string encoding

So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.

How do you think Google is handling this encoding issue?

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.
For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:
For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).
So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.
However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.
As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?
Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.
If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).
Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.
Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.
There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).
The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.
So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.

Resources