Text file encoding

Text file encoding - macos

I'm having a problem trying to read in windows a CSV file generated in MAC.
My question is how can I convert the encoding to UTF-8 or even ISO-8859-1.
I've already tried iconv with no success.
Inside "vim" I can understand that in this file linebreaks are marked with ^M and the accent ã is marked with <8b>, Ç = <82> and so on.
Any ideas?

To convert from encoding a to encoding b, you need to know what encoding a is.
Given that
ã is marked with <8b>, Ç = <82>
encoding a is very likely Mac OS Roman.
So call iconv with macintosh* as from argument, and utf-8 as to argument.
*try macroman, x-mac-roman etc if macintosh is not found

Related

Text editor keeps using a wrong file encoding and replaces certain characters with another codes

When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.
Non-corrupted string before saving the file:“Diff Clang Compiler Log Files”
Corrupted string after saving the file:
�Diff Clang Compiler Log Files�
So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:
The original file is saved using the Windows-1252 Encoding (I am using Win 10 machine, German keyboard)
VSC faulty interprets the file with UTF-8 encoding
Characters codes get converted from Windows-1252 into UTF-8 once the file is saved, thus 0xE2 0x80 0x9C becomes 0xEF 0xBF 0xBD.
Is my understanding corrrect?
Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).
Thank you!

The encoding of the file can be found with the help of python magic module
import magic
FILE_PATH = 'C:\\myPath'
def getFileEncoding (filePath):
blob = open(filePath, 'rb').read()
m = magic.Magic(mime_encoding=True)
fileEncoding = m.from_buffer(blob)
return fileEncoding
fileEncoding = getFileEncoding ( FILE_PATH )
print (f"File Encoding: {fileEncoding}")

Decoding files containing hebrew characters and German eszett using iconv

I have a file that I'm pretty sure is in a weird encoding. I've successfully converted similar files to utf-8 previously by assuming they were encoded in windows-1255 using iconv (iconv -f windows-1255 -t utf-8 $file) and this has worked successfully.
My current file contains a ß character that is throwing me off - iconv breaks when it hits this (with an "illegal input sequence" error). Is there a different kind of encoding I should be using?

WINDOWS-1255 (= Hebrew) does not know an Eszett (ß), so ICONV behaves correctly. Other legacy codepages that know that character on code point 00DF:
WINDOWS-1250 = Latin 2 / Central European
WINDOWS-1252 = Latin 1 / Western European
WINDOWS-1254 = Turkish
WINDOWS-1257 = Baltic
WINDOWS-1258 = Vietnamese
Only the document owner knows which codepage is the correct one. If it's one of the WINDOWS-125x at all.

Character replacement batch file

I'm trying to do a batch script using Windows command line to convert some characters for example:
É to Й
Ö to Ц
Ó to У
Ê to К
Å to Е
Í to Н
Ã to Г
Ø to Ш
Ù to Щ
Ç to З
with no success. That's because I am using a program that does not support a Cyrillic font.
And I have already the file with these words, like:
ОБОГРЕВ ЗОНЫ 1
ДАВЛЕНИЕ ЦВЕТА 1
...
and so on...
Is it possible?

I'm guessing that you'd like to convert the character set (alias code page) of a file so you can open and read it.
I'm assuming you are using a Windows computer.
Let's say that your file is russian.txt and when you open it with notepad, the characters doesn't make any sense. The russian.txt file's character encoding is most propably ANSI and it's code page is Windows-1251.
Some words about character encoding:
In ANSI one character is one byte long.
Different languages have different code pages: Windows-1251 = Russian, Windows-1252 = Western Languages (English, German, Swedish...), Windows-1253 = Greek ...
In UTF-8 English characters are one byte long and non-English characters two bytes long.
In Unicode all characters are two bytes long.
UTF-8 and Unicode doesn't need code pages.
You can check the encoding by opening the file in notepad and clicking File, Save As. At the right bottom corner beside the Save-button you can see the encoding.
With some googling I found a site where you can do the character encoding conversion online. I Haven't tested it, but here's the address:
http://i-tools.org/charset
I've made a script (= a small program) which changes the character encoding from any ANSI and code page combination to UTF-8 or Unicode or vice versa.
Let's say you have and English Windows computer and want to convert the russian.txt (ANSI / Windows-1251) to UTF-8.
Here's how:
Open this web-page and copy the script in it to the clipboard:
VB6/VBScript change file encoding to ansi
Create a new file named ConvertCharset.vbs to the same folder, where the russian.txt is, say C:\Temp.
Open the ConvertCharset.vbs in notepad (right click+edit) and paste.
Open CMD (Windows-button+R, cmd, Enter).
In CMD-window type (hit Enter-key at each end of the line):
cd C:\Temp\
cscript ConvertCharset.vbs /InputCharset:Windows-1251 /OutputCharset:utf-8 /InputFile:russian.txt /OutputFile:russian_utf-8.txt
Now the you can open the russian_utf-8.txt in notepad and you'll see the Russian characters OK.
More info:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Windows-1251
http://en.wikipedia.org/wiki/UTF-8
VB6/VBScript change file encoding to ansi

convert UTF-8 to CP1252 in ubuntu with PHP or bash shell

I have a question about converting UTF-8 to CP1252 in Ubuntu with PHP or SHELL.
Background : Converting a csv file from UTF-8 to CP1252 in Ubuntu with PHP or SHELL, copy file from Ubuntu to Windows, open file with nodepad++.
Environment :
Ubuntu 10.04
PHP 5.3
a file csv with letters (œ, à, ç)
Methods used :
With PHP
iconv("UTF-8", "CP1252", "content of file")
or
mb_convert_encoding("content of file", "UTF-8", "CP1252")
If I check the generated file with
file -i name_of_the_file
It displayed :
name_of_the_file: text/plain; charset=iso-8859-1
I copy this converted file to windows and opened with notepad++, in the bottom of the right, we can see the encoding is ANSI
And when I changed the encoding from ANSI to Windows-1252, the specials characters were well displayed.
With Shell
iconv -f UTF-8 -t CP1252" "content of file"
The rest will be the same .
Question :
1. Why the command file did not display directly CP1252 or ANSI but ISO-8895-1 ?
2. Why the specials characters could be well displayed when I changed the encoding from ANSI to Windows-1252.
Thank you in advance !

1.
CP1252 and ISO-8859-1 are very similar, quite often a file encoded in one of them would look identically as the file encoded in the second one. See Wikipedia to see which characters are in Windows-1252 and not in ISO-8859-1.
Letters à and ç are encoded identically in both encodings. While ISO-8859-1 doesn't have an œ and CP1252 does, file might have missed that. AFAIK it doesn't analyse the entire file.
2.
"ANSI" is a misnomer used for the default non-Unicode encoding in Windows. In case of Western European languages, ANSI means Windows-1252. In case of Central European, it's Windows-1250, in case of Russian it's Windows-1251, and so on. Nothing apart from Windows uses the term "ANSI" to refer to an encoding.

why does my vim encoding and ruby encoding not agree?

In the ruby file:
p __ENCODING__
#<Encoding:US-ASCII>
In vim:
set encoding?
encoding=utf-8
This is causing me grief (http://stackoverflow.com/questions/14495486/ruby-syntax-error-with-multiple-language-in-hash), which is patched but I still don't understand why the file shows as ASCII by ruby and utf-8 by vim.

As #melpomene commented, :set encoding tells you what encoding is used internally by Vim.
:set fileencoding will tell you what encoding Vim decided to use for your document. The possible values are given by the fileencodings option. ASCII is not part of the default list as it's usually handled transparently by the other encodings listed.
But that part of your question is puzzling me:
but I still don't understand why the file is ASCII
because it looks like you actively want that file to be treated as ASCII by the interpreter.
Anyway, that encoding directive is only used by Ruby: it doesn't mean that the file is actually encoded as ASCII or that Vim is supposed to care about it and treat it in a special way.
In short, whether your file is actually encoded in ASCII or not, Vim doesn't care.
So… what do you want exactly? That vim sets its fileencoding option to ASCII when you open a supposedly ASCII file? That your supposedly ASCII file be converted to another encoding?
edit
With that directive, you explicitely tell Ruby that the file's content must be treated as ASCII and Ruby says "OK, that's ASCII, if you say so.".
This directive doesn't change anything to the actual encoding of the file. It could be utf-8, latin1 or whatever.
Vim doesn't understand that directive.
Vim chooses the encoding it uses for that file according to a number of rules you should read about in :h encoding, :h fileencoding and :h fileencodings.
Vim doesn't treat ASCII in a special "ASCII" way, it just handles it has the subset of utf-8 that it is.
So, before we go further, please verify:
the encoding of the file with something like $ file /path/to/file
the fileencoding Vim uses for that file with :set fileencoding

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio