Testing for extended characters in watir-webdriver - ruby

I need to check for text with extended character set characters in my watir-webdriver scripts.
For example checking for a link has the follow text;
Weiß
I read the text from a CSV file, which when edited looks like the above text.
But when running the test in FireFox I get the following failure.
Wrong values on attribute table after add all save.
<"Wei\247"> expected but was
<"Wei\303\237>.
I tried saving it in the CSV as Wei\303\237 but the expected value then had double backslash characters.
How can I encode this in the CSV so I can check the text value safely cross platform and browser?

I had this problem, and I got around it by writing it in the spreadsheet as something like {S} and gsubbing it when I read the file into Ruby. If you gsub the text when you check the link too then basically you have your own encoding method for special characters. This is a long way around, so I'd be very interested in other answers.
The double backslash is probably because when your code reads from the CSV it escapes the backslashes in the file to preserve the text. Therefore you can't put the unicode in your CSV file. I don't really know a way around this. I hear that Ruby unicode support isn't that great, but is being worked on as of 1.9.x.

Related

How to use $ (dollar sign) ^(exponent sign) in yaml?

I saw a YAML file that includes some signs like $, ^. For $, I think it tries to get value from a JSON file. But for ^, I'm not sure about that.
I tried to search for the YAML syntax but cannot find the usage of those signs.
Could anyone point out where that usage from? Thanks a lot!
examples:
json: $.A.Documents[*]
input: ^.B.ID
YAML doesn't assign any special meaning to those characters. As far as YAML is concerned, they are simply part of the content.
Of course, the software loading that YAML can do anything with the loaded data – including inspecting the loaded scalars for $ and ^ and implementing some action on them.
While someone might be able to correctly guess which software expects a YAML file like the one you show, it would be vastly easier for you to check the context in which you found that YAML file. This should lead you to the information you seek – i.e., for which software that YAML file has been written. That software's documentation will then describe how those characters are processed.

How do I find formatting settings for CSV on Mac?

I have a Python program that extracts data from an API, applies transformations, and converts it to a csv to be used in Tableau. When I view the file in excel and Google Sheets, it looks fine. No data formatting or read errors as it is formatted in standard UTF8.
When I read it in Tableau, different story. You will notice how the columns lose shape and get parsed incorrectly.
I am thinking it has to do with the fact that my data set is text heavy and contains punctuation, but I have been able to work with data in this format just fine without having to do any custom formatting.
It looks like your csv has multiline fields (which are quoted).
You'll somehow have to tell the Tableau reader/parser to read your data as quoted (and multiline).
Also check the escaping of the quotes (if they are inside a field) - usually this is done with another quote, but could also be with a backslash.

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

Escaping special characters in User Input in IzPack Installer

I have an IzPack installer that takes in a lot of User Inputs and substitutes them in an XML file. This XML file is actually the configuration file for my application.
There is a major problem that I have hit and I cant move on from it.
In the Input fields (in the installer) user can enter any text and also special characters like & # % ' etc. These special characters messes up my XML file as they are no allowed in the XML syntax and needs to be escaped. for example for & one would need &
So far I have been asking the user to do this, as in escape the special characters themselves, but thats now working either.
Is there a way to have this done automatically? I really need a solution fast.
I am using IzPack V 4.1
You should use a proper XML Api (SAX, DOM) to generate the XML file, this will apply the correct encoding automatically. This may look more complicated first but guarantees that a well formed, syntactically correct file is written.
Searching for JAXP should give you a proper starting point.

Resources