Fix Filename that was changed to ASCII from UTF8 - utf-8

I recently downloaded a pack of videos that should have Japanese characters as their file names. Instead who ever uploaded them botched the formatting.
Instead of Kana, Hiragana, and Kanji I get;
002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4
I was wondering if there was a way to fix this short of asking for another upload?
I tried to put the names into a Text file and then hex edit that file to change it's encoding, but that didn't work.

I would use the chardet library for Python as an aid to guess at the encoding.
>>> import chardet
>>> s='002òÅü¢âyâbâeâBâôâO(âuâïâ}).mp4'
>>> chardet.detect(s.encode('l1'))
{'encoding': 'ISO-8859-5', 'confidence': 0.536359806931924, 'language': 'Russian'}
>>> chardet.detect(s.encode('cp437'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
>>> chardet.detect(s.encode('cp850'))
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
Probably not ISO-8859-1, more likely IBM 437 or 850.
>>> s.encode('cp850').decode('sjis')
'002撫⊃ペッティング(ブルマ).mp4'
>>> s.encode('cp437').decode('sjis')
'002撫○ペッティング(ブルマ).mp4'
Could be either one of these, but I can't read them.

Related

Text editor keeps using a wrong file encoding and replaces certain characters with another codes

When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.
Non-corrupted string before saving the file:“Diff Clang Compiler Log Files”
Corrupted string after saving the file:
�Diff Clang Compiler Log Files�
So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:
The original file is saved using the Windows-1252 Encoding (I am using Win 10 machine, German keyboard)
VSC faulty interprets the file with UTF-8 encoding
Characters codes get converted from Windows-1252 into UTF-8 once the file is saved, thus 0xE2 0x80 0x9C becomes 0xEF 0xBF 0xBD.
Is my understanding corrrect?
Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).
Thank you!
The encoding of the file can be found with the help of python magic module
import magic
FILE_PATH = 'C:\\myPath'
def getFileEncoding (filePath):
blob = open(filePath, 'rb').read()
m = magic.Magic(mime_encoding=True)
fileEncoding = m.from_buffer(blob)
return fileEncoding
fileEncoding = getFileEncoding ( FILE_PATH )
print (f"File Encoding: {fileEncoding}")

Understanding encoding and decoding in Python

I'm looking around how works encoding in python 2.7, and I can't quite understand some aspects of it. I've worked with files with different encodings, and yet so far I was doing okay. Until I started to work with certain API, and it requires to work with Unicode strings
u'text'
and I was using Normal strings
'text'
Which araised a lot of problems.
So I want to know how to go from Unicode String to Normal String and backwards, because the data that I'm working with is handled by Normal Strings, and I only know how to get the Unicode ones without having issues, over the Python Shell.
What I've tried is:
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
Now, to get an Unicode string what I do is:
>>> foobar = unicode(foo, "latin1")
u'gur\xa3'
But this doesn't work for me, since I'm doing some comparisons in my code like this:
>>> foobar in u"Foo gurú Bar"
False
Which fails, even if the original value is the same, because of the encoding.
[Edit]
I'm using Python Shell on Windows 10.
The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows application use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary!... Source
So if you want to go from normal String to Unicode and backwards. Then first you have to findout the encoding of your system, which is used for normal Strings in Python 2.X. And later on, use it to make the proper conversion.
I leave you with an example:
>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>
>>> foo = "gurú"
>>> bar = u"gurú"
>>> foo
'gur\xa3'
>>> bar
u'gur\xfa'
>>>
>>> foobar = unicode(foo, 'cp850')
u'gur\xfa'
>>>
>>> foobar in u"Foo gurú Bar"
True

Printing a CP850 encoded string with Ruby (IRB)

I want to open a text file (test.txt) that contains arabic text (its encoding is CP850), then print its content to STDOUT :
# coding : CP850
STDOUT.set_encoding(Encoding::CP850); # not sure if it's necessary
open('G:/test.txt',?r){|f|
f.read.each_char{|c| print c};
# or puts f.read;
}
gets
but it does not print the arabic characters, the output is some symbols and random characters.
Using Ruby 2.2.3
Change the encoding of the file to utf-8.
I don't know how this is accomplished in Ruby, but Django (the newer ones using Python 3), it's:
open('filename.txt', w, 'utf-8)
If you're using Python 2 it will be slightly more difficult. If so, it's worth upgrading to 3 just because it's native unicode and makes doing anything with Arabic a lot easier.

Replacing "\xe9" character from a unicode string in Python 3

Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.

ruby `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

Hannibal episodes in tvdb have weird characters in them.
For example:
Œuf
So ruby spits out:
./manifesto.rb:19:in `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from ./manifesto.rb:19:in `to_json'
from ./manifesto.rb:19:in `<main>'
Line 19 is:
puts #tree.to_json
Is there a way to deal with these non utf characters? I'd rather not replace them, but convert them? Or ignore them? I don't know, any help appreciated.
Weird part is that script works fine via cron. Manually running it creates error.
File.open(yml_file, 'w') should be change to File.open(yml_file, 'wb')
It seems you should use another encoding for the object. You should set the proper codepage to the variable #tree, for instance, using iso-8859-1 instead of ascii-8bit by using #tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.
To find the current external encoding for ruby, issue:
Encoding.default_external
If sudo solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:
In ruby to change encoding to utf-8 or another proper one, do as follows:
Encoding.default_external = Encoding::UTF_8
In bash, grep current valid set up:
$ sudo env|grep UTF-8
LC_ALL=ru_RU.UTF-8
LANG=ru_RU.UTF-8
Then set them in .bashrc properly, in a similar way, but not exactly with ru_RU language, such as the following:
export LC_ALL=ru_RU.UTF-8
export LANG=ru_RU.UTF-8
I had the same problems when saving to the database. I'll offer one thing that I use (perhaps, this will help someone).
if you know that sometimes your text has strange characters, then
before saving you can encode your text in some other format, and then
decode the text again after it is returned from the database.
example:
string = "Œuf"
before save we encode string
text_to_save = CGI.escape(string)
(character "Œ" encoded in "%C5%92" and other characters remained the same)
=> "%C5%92uf"
load from database and decode
CGI.unescape("%C5%92uf")
=> "Œuf"
I just suffered through a number of hours trying to fix a similar problem. I'd checked my locales, database encoding, everything I could think of and was still getting ASCII-8BIT encoded data from the database.
Well, it turns out that if you store text in a binary field, it will automatically be returned as ASCII-8BIT encoded text, which makes sense, however this can (obviously) cause problems in your application.
It can be fixed by changing the column encoding back to :text in your migrations.

Resources