Quotation marks turn to question marks - ruby

So I have a ruby script that parses HTML pages and saves the extracted string into a DB...
but i'm getting weired charcters (usually question marks) instead of plain text...
Eg : ‘SOME TEXT’ instead of 'Some Text'
I've tried HTML entities and CGI::unescape ... but to no avail...
did some googling n set $KCODE = 'u' & require 'jcode'
still not working...
any suggestions /pointers would be great
Thanks
PS : using mysql 5.1

Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.
That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.
The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.
If it's terminal output, make sure that $ENV{'LANG'} is set to the appropriate UTF8 encoding (e.g. en.UTF-8) and that the terminal emulator itself is set the same way.
If it's HTML output, make sure that the page encoding is set to UTF-8 as well, i.e.:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.
It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.
Ruby has an iconv implementation to convert between encoding types. See here for more information.

Related

How to Escape Double Quotes from Ruby Page Object text

In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
(RSpec::Expectations::ExpectationNotMetError)
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
error_str.encode('UTF-8')
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

Not sure why the output of my PHP scripts contains random embedded spaces within character strings

I have written several PHP scripts to read the contents of a database and output those contents in an email message. Every once in a while, I will see a SPACE (0x20) character embedded in the output where there shouldn't be any. For example, in one script, I reference a PHP global variable containing exactly "n" non-space characters, and sometimes (not always), when that variable is dumped to an email message, the string will appear with an embedded blank (making the total length of the string "n+1"). Other times, an HTML tag (such as <BR>) will appear as < BR> (note the SPACE before the "B").
Because the behavior of the script is not consistent (some emails are affected, and others aren't), I can't seem to find the problem.
I am enclosing a link to the PHP script that is occasionally embedding a space into the BREAK tag. I have removed the lines that provide specific login information to the databases. Otherwise, everything else is intact. In the code file you can find at the link below, line 281 is the one that contained the BREAK command with the embedded SPACE (as described above). This has happened only once!
http://jem-software.com/temptest.txt
I guess the only other potentially relevant information is that this script file is taken from code entered into a JUMI code block contained within a Joomla! based website.
Edit 1:
Thank you, Riccardo, for your suggestions. Here is some more clarification:
I am not reading an email and parsing the results in order to insert into a database. Just the opposite, I am reading from a database and using the results to create an email. I will check the database to see what character set was used, and explicitly pass the character set to see if that makes a difference.
I don't use Joomla functions to access the database because the database I am referencing is external to the Joomla! environment. It is a pre-existing database created from PHP scripts written several years prior. When my old website was re-written using Joomla, I wanted to "port" the PHP database access code intact, so I installed the JUMI plugin to make this possible.
I will check out the character coding in the database and synchronize it to the character code of the email message.
I don't understand how an issue with character coding would result in the insertion of a SPACE into the hard-coded HTML tag - this tag did not come from any database, but was typed into the email as a literal string.
This is a strange issue, but here are my two cents:
The first is you're not using Joomla functions to access the db and the mail subsystem. While this could work, it's not really nice.
The second is, this smells like a character set / codepage issue.
Here are a few considerations on the character set issue:
I read your code quickly, and I didn't notice anything wrong. But Joomla uses UTF-8, and your queries don't specify it (mysql_set_charset() is missing!) which could be a first issue.
The second is that the emails you read will have different character sets, depending on the senders' settings. Make sure you handle the codepage issues properly: the following is a snippet of a function I use for parsing email:
$mime = imap_fetchmime($this->connection, $this->messageNumber, $partNumber);
return $this->decodeMailBody($data,$mime); // QUOTED_PRINTABLE
function decodeMailBody($string,$mime) {
$str = quoted_printable_decode($string);
echo "<h3>mime: $mime; charset $charset</h3>";
//mime: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
//mime: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252
$mimes = explode('charset=',$mime);
foreach($mimes as $mimepiece) {
$charset = $mimepiece;
}
$charset = strtolower(trim($charset));
if ($charset == 'utf-8') {
return $str;
} else {
return iconv($charset, 'UTF-8', $str);
}
}
Last, make sure you use utf-8 when you insert the mail into the db after parsing it.

Extended charsets chars not reccognized and converting to ? mark

I have a string contain some special char like "\u2012" i.e. FIGURE DASH. When i am trying to print this on console I am getting a '?' mark instead of its symbol. I have an editor where in I can insert the symbol using alt+numpad like alt+2012. In editor it I could see the symbol save it in a xml file and get the value using nodevalue, I get a '?' mark.
To summerize I am facing problem to read extended latin a charset. What i need is When i insert such symbols and read it, i should get something like &#xXXXX;.
Please help!
TIA :)
Simply I have a String inpath = "À";, I want to get its unicode value..like &#xXXXX;
The default console encoding in Windows is some MS-DOS code page and they don't support the character. You can try running chcp 65001 before running the program but you might also need to change the console font as well.
You don't need to do anything you wouldn't do with any other character, as long as you use UTF-8. You aren't doing that in many places. You need to explicitly write in your code to save and read the file in UTF-8, and not rely on the platform default encoding.

ISO-8859-1 characters treated as UTF-8 in XSLT attributes

The ¬ character (0xAC in ISO-8859-1) works for normal text if I ensure that ISO-8859-1 is always used as the encoding throughout. However, when using it in attributes it is escaped to: %C2%AC. I understand that it needs to be escaped for urls, but not why it escapes it in the same way as it would for UTF-8, rather than just %AC as I'd expect it to for ISO-8859-1.
Since the escapes are in the output html file the only conclusion is that the xslt processor is the cause.
Example:
input.xml
stylesheet.xslt
makefile
Which for me generates:
output.html
Output was generated using xsltproc, compiled against libxml 20707, libxslt 10126 and libexslt 815. This was on #! Linux (amd64). I have also tried: xmlstarlet tr (also uses libxml), xalan and google chrome (by adding an <?xml-stylesheet ... >, see input_ss.xml tag) with the same result.
Opera doesn't escape it at all, and it allows ¬ to be used literally in the url and attribute.
Is this standard behaviour for xslt or is this a bug in the way the attributes are escaped? And either way, is there a solution other than replacing %C2%AC with %AC bearing in mind it is almost certainly the same for other characters that are valid ISO-8859-1 and invalid in UTF-8.
There are 3 different text-based technologies in use here, XML, HTML and URIs.
All of these have escape mechanisms - that is to say, ways to use text to indicate other text that it is impossible or difficult to indicate in a given context.
The not-sign character ¬ (U+00AC) could be escaped in the first two as ¬ or ¬ perhaps with some leading zeros, in both XML and HTML (¬ would also work in HTML). This escape would be used no matter what encoding the XML or HTML was in, because it relates to the character ¬, not to its set of octets in a given character encoding - indeed, we would generally only use it in the case where there was no such set of octets in the encoding being used.
In this case, this is unnecessary, since the output is in a character encoding in which there is no need to escape it, and so in the source you can see The ¬ character unescaped.
This HTML includes the text of a URI. The encoding of the HTML has nothing to do with this, because the encoding is how we get the text of the HTML from one machine to another, but when the HTML is being parsed to read this URI we're past that point and are dealing with some text at the level of text - that is to say, it doesn't have an encoding any more.
Now, URIs have their own escape mechanisms. This must be used in the case of ¬, as it is not a character allowed in URIs (as opposed to IRIs). Sadly, unlike the escapes in XML and HTML, these escapes are based on octets in a given encoding rather than the code-point of the character itself.
It's easy to see this as a mistake now, but URIs were specified in 1994 and that formalised work going back to 1989/1990 while Unicode 1.0 was released in 1991 and didn't have the ground-breaking 2.0 until 1996, so hindsight has considerably more benefits than URI's inventors. (HTML had the same problem many years ago, but the format of its encodings made it much easier to fix this without as many backwards-compatibility issues).
So, what encoding should we use for those octets? The original specs left this undefined, but really the only possible choice is UTF-8. It's the only encoding that gives those escapes commonly used for chracters special to URIs their escapes in the range 0x20 - 0x7F while also covering all of the UCS.
There's also no way to indicate another choice could be more appropriate. Remember, we're working at the level of text, so your use of ISO-8859-1 is completely irrelevant. Even if we kept track of the encoding while parsing the HTML, the URI is going to be made use of in a way that is nothing to do with the document, so we still couldn't use it. In all, if we have to make use of an octet-based encoding, and we have to keep characters in the ASCII range matching the octets they'd have in ASCII, the only possible basis for the encoding is UTF-8.
For that reason, the escape in any URI for ¬ must always be %C2%AC.
There can be some legacy systems that expect URIs to use other encodings, but the solution is to fix the bit that's broken, not the bit that works, so if something expects ¬ to be %AC then catch it close to that by converting %C2%AC close to its use (and if it outputs %AC itself then of course you'll need to fix it to %C2%AC before it hits the outside world).
The XSLT spec says that when serializing URI-valued attributes, all non-ASCII characters are escaped using the %HH-escaping of the UTF-8 octets that represent the character. Although %HH-escaping of other encodings has been used in the past, it is no longer used today. This is quite independent of the encoding of the document itself.

Resources