Converting a text file to ascii in PYTHON code then ascii character - converters

I have tried many ways but they all seem to only either convert one character or nothing at all
Please help any suggestions appreciated

Related

How to open a (CSV) file in oracle and save it in UTF-8 format if it is in another formats

Can anyone please advise me on the below issue.
I have an oracle program which will take a .CSV file as the input and will process it. We are now facing an issue that when there is an extended ASCII character appear in the input file, its trimming the next letter after that special character.
We are using the File utility function Utl_File.Fopen_Nchar() to open the file and Utl_File.Get_Line_Nchar() for reading the characters in the file. The program is written in such a way that it should handle multiple languages(Unicode characters) in the input file.
In the analysis its found that when the character encoding of the CSV file is UTF-8 its processing the file successfully even when extended ASCII characters as well as Unicode characters are there. But some times we are getting the file in 1252 (ANSI - Latin I) format which makes the trimming problem for extended ASCII characters.
So is there any way to handle this issue? Can we open a (CSV) file in oracle and save it in UTF-8 format if it's in any another formats?
Please let me know if any more info is needed.
Thanks in anticipation.
The problem is when you don't know in which encoding your CSV file is saved then it is not possible to determine any conversion either. You would screw up your CSV file.
What do you mean by "1252 (ANSI - Latin I)"?
Windows-1252 and ISO-8859-1 are not equal, see the difference here: ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode
(Sorry for posting the German Wikipedia, however the English version does not show such a nice table)
You could use the fix_latin command-line tool convert a file from an unknown mixture of ASCII / Latin-1 / CP1251 / UTF8 into UTF8:
fix_latin < input.csv > output.csv
The fix_latin utility is a simple Perl script which is shipped with the Encoding::FixLatin module on CPAN.

Some Chinese, Japanese characters appear garbled on IE/FF

We have a content rendering site that displays content in multiple languages. The site is built using JSP and content fetched from Oracle DB. All our pages are UTF-8 compliant.
When displaying zh/jp content, out of the complete content, only some of the character appear garbled (square boxes on IE and diamond question marks on FF). The data in DB does not have any garbled character. Since we dont understand the language, we dont know what characters are problematic. Would appreciate some pointers to the solution please. Could it be that some characters may be appearing invalid to the browsers?
Example in FF:
ネット犯���者 がアプ
脆弱性保護機能 - ネット犯���者 がアプリケーションのセキュリティホール (脆弱性) を突いて、パソコンに脅威を侵入させることを阻止します。
In doubt on the sanity of a UTF-8 encoding, you can always re-encode it either with a good text editor or with a specialized tool like iconv :
iconv -f UTF-8 -t UTF-8 yourfile > youfile2
If your file is indeed invalid, iconv will also give you some information on the problem.
But, another way you might want to explore is installing new fonts for Far East languages…
Indeed, not knowing the actual bytes used in your file, it is hard to say why their are replaced with a replacement character � (U+FFFD). Thus, you might want to post a hex dump of the parts of your file that do not work.

Testing for extended characters in watir-webdriver

I need to check for text with extended character set characters in my watir-webdriver scripts.
For example checking for a link has the follow text;
Weiß
I read the text from a CSV file, which when edited looks like the above text.
But when running the test in FireFox I get the following failure.
Wrong values on attribute table after add all save.
<"Wei\247"> expected but was
<"Wei\303\237>.
I tried saving it in the CSV as Wei\303\237 but the expected value then had double backslash characters.
How can I encode this in the CSV so I can check the text value safely cross platform and browser?
I had this problem, and I got around it by writing it in the spreadsheet as something like {S} and gsubbing it when I read the file into Ruby. If you gsub the text when you check the link too then basically you have your own encoding method for special characters. This is a long way around, so I'd be very interested in other answers.
The double backslash is probably because when your code reads from the CSV it escapes the backslashes in the file to preserve the text. Therefore you can't put the unicode in your CSV file. I don't really know a way around this. I hear that Ruby unicode support isn't that great, but is being worked on as of 1.9.x.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

Getprivateprofilestring Bug

I encrypted some text and put it in a INI file. Then I used getprivateprofilestring() to retrieve the value but some of the end characters are missing. I suspect it may be a new line character causing it to be incomplete. Writing to the INI file is OK. Opening the INI file and looking at the sections and keys - everything is in order. Its just the retrieving part that causes the bug.
Please any help would be appreciated.
Thanks
Eddie
First off when encrypting strings, make sure that they are converted to Base64 before dumping them into the INI file.
Most likely, the encrypted string created an ascii character which is not handled very well by the INI related APIs.
WritePrivateProfileStringW writes files in the active ANSI codepage by default; WritePrivateProfileStringA will always write ANSI.
To achieve the best results, follow the directions here and use GetPrivateProfileStringW when reading the data back
It's more than likely that the encryption is injecting a NULL character into the stream you are writing. GetPrivateProfileString will read a string till it finds a NULL character.
So I agree with Angry Hacker, convert to Base64 or some other friendly human readable encoding and you won't have any problems.

Resources