Extended charsets chars not reccognized and converting to ? mark - utf-8

I have a string contain some special char like "\u2012" i.e. FIGURE DASH. When i am trying to print this on console I am getting a '?' mark instead of its symbol. I have an editor where in I can insert the symbol using alt+numpad like alt+2012. In editor it I could see the symbol save it in a xml file and get the value using nodevalue, I get a '?' mark.
To summerize I am facing problem to read extended latin a charset. What i need is When i insert such symbols and read it, i should get something like &#xXXXX;.
Please help!
TIA :)
Simply I have a String inpath = "À";, I want to get its unicode value..like &#xXXXX;

The default console encoding in Windows is some MS-DOS code page and they don't support the character. You can try running chcp 65001 before running the program but you might also need to change the console font as well.
You don't need to do anything you wouldn't do with any other character, as long as you use UTF-8. You aren't doing that in many places. You need to explicitly write in your code to save and read the file in UTF-8, and not rely on the platform default encoding.

Related

Setting special characters in QSettings

I'm trying to edit a desktop.ini file using QSettings. I need to set a value on a section that contains { and } which I believe are special characters.
I am currently setting these value using the following code:
desk_ini.beginGroup("{F29F85E0-4FF9-1068-AB91-08002B27B3D9}");
desk_ini.setValue("Prop5", "TestTag");
desk_ini.endGroup();
Unexpectedly after executing the program, this is what is looks like in the INI file.
[%7BF29F85E0-4FF9-1068-AB91-08002B27B3D9%7D]
Prop5=TestTag1
After some reading, I found these: (Quoting from the QSettings Documentation)
The INI file format has severe restrictions on the syntax of a key. Qt works around this by using % as an escape character in keys.
It seems like QSettings is using % to escape {}
Now, I really need it to be "as is" for the desktop.ini to be read property.
To reiterate, my question: Is there a way to set special characters in QSettings without changing them?

how to get the scripts format

I am seeing this in my browse:
\xe18\xe23\xe23\xe21\xe0a\xe32\xe15
I believe it's some valid Thai scripts? But how do I know the format of it?
Thanks
It's hard to know if this is the correct answer without more details but the sample you provided looks like a hexadecimal escape sequence.
\x followed by two hexadecimal characters represent a character by its ASCII code
You can check directly wha the value is in your browser console:
console.log("\xe18\xe23\xe23\xe21\xe0a\xe32\xe15");
Output is:
á8â3â3â1àaã2á5

How to Escape Double Quotes from Ruby Page Object text

In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
(RSpec::Expectations::ExpectationNotMetError)
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
error_str.encode('UTF-8')
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Quotation marks turn to question marks

So I have a ruby script that parses HTML pages and saves the extracted string into a DB...
but i'm getting weired charcters (usually question marks) instead of plain text...
Eg : ‘SOME TEXT’ instead of 'Some Text'
I've tried HTML entities and CGI::unescape ... but to no avail...
did some googling n set $KCODE = 'u' & require 'jcode'
still not working...
any suggestions /pointers would be great
Thanks
PS : using mysql 5.1
Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.
That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.
The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.
If it's terminal output, make sure that $ENV{'LANG'} is set to the appropriate UTF8 encoding (e.g. en.UTF-8) and that the terminal emulator itself is set the same way.
If it's HTML output, make sure that the page encoding is set to UTF-8 as well, i.e.:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.
It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.
Ruby has an iconv implementation to convert between encoding types. See here for more information.

Resources