I'm getting the error message: "Invalid byte sequence for encoding "UTF8": 0x9f
Ok, now I know somewhere my php app is trying to query using that 0x9f character.
But I have no idea WHERE.
I checked postgresql.conf but I didn't find anything like "log_on_error". There's only the log_statement parameter which causes postgres to log all selects or just all queries.
But what I would like to see is this:
ERROR: "Invalid byte sequence for encoding "UTF8": 0x9f
QUERY: SELECT * FROM blabla WHERE field1='blabla0x9f'
In this case I would be able to see which query caused this. So I know in which php script to check.
Is this possible with postgres? My psql version is 8.3.9.
You are looking for log_min_error_statement for that.
But "invalid byte sequence" comes in the parser before the text is even parsed into a statement. So there is no way to log that without risking logging it in a weird encoding and making it either useless or dangerous.
But presumable your PHP application detects the error? If not, you are not checking enough return codes there ;)
Just to be clear, because I don't think Magnus is getting through (though his answer is pretty good): If you read the file in UTF-8, but it is encoded in latin1 it doesn't magically get converted to utf8. You can only work with the file in the encoding of the file, whether submiting it to the db, or re-encoding it in a different encoding. You have three options (that I might use):
Of course, the right-most way is the first one.
Convert the file using php. Read the docs on recode for more information about this.
Change the client_encoding on postgresql, using set client_encoding = encoding. You can find the valid encoding in the docs
Send it to postgresql to be converted read the docs on PostgresSQL's convert()
Information about the php function recode`
Related
I am trying to use wget -m <address> to download the contents of an FTP server. A lot of the content is icelandic and so contains a bunch of weird characters that I think are causing issues as I keep seeing:
Incomplete or invalid multibyte sequence encountered
I have tried adding flags such as --restrict-file-names=nocontrol but to no avail.
I have also tried using lftp but doesn't seem to make any difference.
According to wget manual
If you specify ‘nocontrol’, then the escaping of the control
characters is also switched off.
that is it as actually more permissive than default, bunch of weird characters suggest you have some issues with getting encoding right and therefore ascii seems to be best fit for your use case
The ‘ascii’ mode is used to specify that any bytes whose values are
outside the range of ASCII characters (that is, greater than 127)
shall be escaped. This can be useful when saving filenames whose
encoding does not match the one used locally.
As I do not have ability to test, please try it and write about result it give.
I assumed a dataset was ISO-8859-1 encoded, while it was actually encoded in utf-8.
I wrote a python script where i decoded the data with ISO-8859-1 and wrote it into a redshift sql database.
I wrote the messed up characters into the redshift table, the decoding did not happen while writing into the table. (used python and pandas with wrong encoding)
Now the datasource is not available anymore but the data in the table has a lot of messed up characters.
E.g. 'Hello Günter' -> 'Hello GĂŒnter'
What is the best way to resolve this issue?
Right now i can only think of collecting a complete list of messed up characters and their translation, but maybe there is a way i have not thought of.
So my questions:
First of all i would like to know if information was lost when the decoding happened..
Also i would like to know if there might be a way in redshift to solve such a decoding issue. Finally i have been searching for a complete list, so i do not have to create it myself. I could not find such list.
Thank you
EDIT:
I pulled a part of the table and found out i have to do the following thing:
"Ð\x97амÑ\x83ж вÑ\x8bÑ\x85оди".encode('iso-8859-1').decode('utf8')
The table has billions of rows, would it be possible to do that in redshift?
I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.
I'm trying to create a piece of code that will download a page from the internet and do some manipulation on it. The page is encoded in iso-8859-1.
I can't find a way to handle this file. I need to search through the file in Hebrew and return the changed file to the user.
I tried to use string.encode, but I still get the wrong encoding.
when printing the response encoding, I get: "encoding":{} like its undefined, and this is an example of what it returns:
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd
It should be Hebrew letters.
When I try with final.body.encode('iso-8859-8-i'), I get the error code converter not found (ASCII-8BIT to iso-8859-8-i).
When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.
However, if you know from some other source what the correct encoding is, you can use force_encoding method to tell Ruby how to interpret the bytes it has loaded into a String. Note this alters the object in place.
E.g.
contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents
At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.
I could not find 'ISO-8859-8-I' on my version of Ruby. I am not sure yet how close 'ISO-8859-8' is to what you need (some Googling suggests that it may be OK for you, if the ...-I encoding is not available).
I am having difficulty outputting data in UTF-8 format. I have a test case set up where data I am reading from an input file contains a British pound symbol (Hex C2A3). When I write it out on Linux, I get valid UTF-8 (C2A3). On windows, I only get HEX A3.
I tried using a PrintStream and specifying the character set as "UTF-8". No luck. I tried many other streams with no luck until I finally tried a DataOutputStream. I used the "write()" method which took a byte array as a parameter. I needed to output a string, so I called "myString.getBytes("UTF-8")".
I ended up with code like:
dataOutputStream.write(myString.getBytes("UTF-8"));
This works properly on both systems; Windows 7 and Linux.
I am trying to understand why this worked and convince myself my solution is correct. Does it come down to system Locale's? Linux defaults to en_US.utf-8. While all I could specify in Windows was just "en_US". So when the outputstream attempted to get data from the string, the string was sending its data based upon the locale?
Or are you using FileOutputStream and there it matters the character encoding or DataOutputStream where you write binary. You should do a research too, but look at here please