I assumed a dataset was ISO-8859-1 encoded, while it was actually encoded in utf-8.
I wrote a python script where i decoded the data with ISO-8859-1 and wrote it into a redshift sql database.
I wrote the messed up characters into the redshift table, the decoding did not happen while writing into the table. (used python and pandas with wrong encoding)
Now the datasource is not available anymore but the data in the table has a lot of messed up characters.
E.g. 'Hello Günter' -> 'Hello GĂŒnter'
What is the best way to resolve this issue?
Right now i can only think of collecting a complete list of messed up characters and their translation, but maybe there is a way i have not thought of.
So my questions:
First of all i would like to know if information was lost when the decoding happened..
Also i would like to know if there might be a way in redshift to solve such a decoding issue. Finally i have been searching for a complete list, so i do not have to create it myself. I could not find such list.
Thank you
EDIT:
I pulled a part of the table and found out i have to do the following thing:
"Ð\x97амÑ\x83ж вÑ\x8bÑ\x85оди".encode('iso-8859-1').decode('utf8')
The table has billions of rows, would it be possible to do that in redshift?
Related
I have a legacy system that is using one EBCDIC to ASCII conversion table. The writer probably did not know that there are multiple code pages for ASCII and EBCDIC. There are extended and accented letters that are not converted properly and I can fix those according to the code pages in use.
I'm asking if anybody knows a single place where I can look at as much code pages as possible to try and figure out the table that was used for conversion. Looking through multiple Wikipedia pages for each code page is too slow and possibly error prone.
The ICU project has a wide variety of tables for converting various EBCDIC and ASCII versions into Unicode,.
There are many tables. If you handle documents of various nations/regions, you have to use multiple tables.
Most of Europian lanuages may be coverd by http://www-01.ibm.com/software/globalization/cp/cp00500.html.
The code table may be found at ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00500.pdf
If you specify what the languages you try to handle by your code, more suitable and specific answer may come.
I'm looking for a datasets with all the Chinese character Mandarin pronunciations in bopomofo and/or pinyin. Also, I need open source datasets that I can copy into my own code bases.
It sounds like you might be looking for the Unihan Database. The Unihan Database is maintained by the Unicode Consortium.
The Unihan database is the repository for the Unicode Consortium’s collective knowledge
regarding the CJK Unified Ideographs contained in the Unicode Standard. It contains
mapping data to allow conversion to and from other coded character sets and additional
information to help implement support for the various languages which use the Han
ideographic script.
For an example, here is the data for 爱.
Here is the description of the organization and content of the Unihan Database. Be sure to read that to understand what the data is referring to.
If this is the information you want, you can download the ZIP archive that contains all this data.
The Unihan Database doesn't have Bopomofo (Zhuyin) pronunciations, but it has Pinyin readings. Converting from Pinyin to Zhuyin is simple; there are a lot of online tools that can do it for you.
As for licensing issues, the Unihan Database data files have a liberal copyright notice. So, you shouldn't run into any problems using that data in your own software.
this is a bit of a late entry but I was searching for the same thing last year and ended up compiling my own character/bopomofo database based on a bunch of different data sets. I have put enough work into this thing to thoroughly call it my own though so you should check it out! its part of a rubygem I made to sort by bopomofo (I had a system that would not let me change the database colaltion settings) https://github.com/nallan/a-b-chi
The question mark "?" appears only in the front of the first field of the first row to insert.
For once, I changed the ftp upload file type to text/ascii (rather than binary) and it seemed resolve the problem. But later it came back.
The server OS is aix5.3.
DataStage is 7.5x2.
Oracle is 11g.
I used ue to save the file to utf-8, using unix end mark.
Has anyone got this thing before?
The question mark itself doesn't mean much as it could be only a "mask" for some special character which is not recognized by the database. You didn't provide any details about your environment, so my opinions here are only a guess. I hope it can give you a little of a light.
How is the text file created? If it's a file created in a windows environment you're very likely to have character like this due brake lines {CR}{LF} characters.
What is the datatype for the oracle table?
Char datatype will "fill" every position according to the size of the field, I'd recommend to use varchar instead on this case.
If it's not the case, I would edit the file in Hex mode and check for the Ascii code for this specific character then use a TRIM (if parallel) or Convert(if server) to replace the character.
The convert function would be something like this:
Convert(Char([ascii_char_number]),'',[your_string])
Alternatively you can use the Trim function if your job is a parallel job
Trim([your_string],[ascii_char_number],'L')
The option "L" will remove all leading characters. You might need to adapt this function to suit your needs. If you're not familiar with the TRIM function you can find more details at the datastage online documentation.
The only warning I'd give when doing this, is that you'll be deleting data from your original source of data, so make sure you're not deleting any valid information when manipulating a file like this as this is not a very recommended practice between the ETL gurus out there.
Any questions, give me a shout. Happy to help if I can.
Cheers
I had a similar issue where unprintable characters were being displayed as '?' and datastage was throwing a warning when processing these records. It was ok for me to not display those unprintable characters, so I used the function ICONV which converts those characters into printable ones. There are multiple options, I chose the one which will convert them to '.' which worked for me. More details are available in the IBM pages below:
https://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.3.0/com.ibm.swg.im.iis.ds.parjob.dev.doc/topics/r_deeref_String_Functions.html
http://docs.intersystems.com/ens201317/csp/docbook/DocBook.UI.Page.cls?KEY=RVBS_foconv
The conversion I used:
ICONV(column_name,"MCP")
Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)
I'm getting the error message: "Invalid byte sequence for encoding "UTF8": 0x9f
Ok, now I know somewhere my php app is trying to query using that 0x9f character.
But I have no idea WHERE.
I checked postgresql.conf but I didn't find anything like "log_on_error". There's only the log_statement parameter which causes postgres to log all selects or just all queries.
But what I would like to see is this:
ERROR: "Invalid byte sequence for encoding "UTF8": 0x9f
QUERY: SELECT * FROM blabla WHERE field1='blabla0x9f'
In this case I would be able to see which query caused this. So I know in which php script to check.
Is this possible with postgres? My psql version is 8.3.9.
You are looking for log_min_error_statement for that.
But "invalid byte sequence" comes in the parser before the text is even parsed into a statement. So there is no way to log that without risking logging it in a weird encoding and making it either useless or dangerous.
But presumable your PHP application detects the error? If not, you are not checking enough return codes there ;)
Just to be clear, because I don't think Magnus is getting through (though his answer is pretty good): If you read the file in UTF-8, but it is encoded in latin1 it doesn't magically get converted to utf8. You can only work with the file in the encoding of the file, whether submiting it to the db, or re-encoding it in a different encoding. You have three options (that I might use):
Of course, the right-most way is the first one.
Convert the file using php. Read the docs on recode for more information about this.
Change the client_encoding on postgresql, using set client_encoding = encoding. You can find the valid encoding in the docs
Send it to postgresql to be converted read the docs on PostgresSQL's convert()
Information about the php function recode`