I'm trying to do a batch script using Windows command line to convert some characters for example:
É to Й
Ö to Ц
Ó to У
Ê to К
Å to Е
Í to Н
à to Г
Ø to Ш
Ù to Щ
Ç to З
with no success. That's because I am using a program that does not support a Cyrillic font.
And I have already the file with these words, like:
ОБОГРЕВ ЗОНЫ 1
ДАВЛЕНИЕ ЦВЕТА 1
...
and so on...
Is it possible?
I'm guessing that you'd like to convert the character set (alias code page) of a file so you can open and read it.
I'm assuming you are using a Windows computer.
Let's say that your file is russian.txt and when you open it with notepad, the characters doesn't make any sense. The russian.txt file's character encoding is most propably ANSI and it's code page is Windows-1251.
Some words about character encoding:
In ANSI one character is one byte long.
Different languages have different code pages: Windows-1251 = Russian, Windows-1252 = Western Languages (English, German, Swedish...), Windows-1253 = Greek ...
In UTF-8 English characters are one byte long and non-English characters two bytes long.
In Unicode all characters are two bytes long.
UTF-8 and Unicode doesn't need code pages.
You can check the encoding by opening the file in notepad and clicking File, Save As. At the right bottom corner beside the Save-button you can see the encoding.
With some googling I found a site where you can do the character encoding conversion online. I Haven't tested it, but here's the address:
http://i-tools.org/charset
I've made a script (= a small program) which changes the character encoding from any ANSI and code page combination to UTF-8 or Unicode or vice versa.
Let's say you have and English Windows computer and want to convert the russian.txt (ANSI / Windows-1251) to UTF-8.
Here's how:
Open this web-page and copy the script in it to the clipboard:
VB6/VBScript change file encoding to ansi
Create a new file named ConvertCharset.vbs to the same folder, where the russian.txt is, say C:\Temp.
Open the ConvertCharset.vbs in notepad (right click+edit) and paste.
Open CMD (Windows-button+R, cmd, Enter).
In CMD-window type (hit Enter-key at each end of the line):
cd C:\Temp\
cscript ConvertCharset.vbs /InputCharset:Windows-1251 /OutputCharset:utf-8 /InputFile:russian.txt /OutputFile:russian_utf-8.txt
Now the you can open the russian_utf-8.txt in notepad and you'll see the Russian characters OK.
More info:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Windows-1251
http://en.wikipedia.org/wiki/UTF-8
VB6/VBScript change file encoding to ansi
Related
When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.
Non-corrupted string before saving the file:“Diff Clang Compiler Log Files”
Corrupted string after saving the file:
�Diff Clang Compiler Log Files�
So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:
The original file is saved using the Windows-1252 Encoding (I am using Win 10 machine, German keyboard)
VSC faulty interprets the file with UTF-8 encoding
Characters codes get converted from Windows-1252 into UTF-8 once the file is saved, thus 0xE2 0x80 0x9C becomes 0xEF 0xBF 0xBD.
Is my understanding corrrect?
Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).
Thank you!
The encoding of the file can be found with the help of python magic module
import magic
FILE_PATH = 'C:\\myPath'
def getFileEncoding (filePath):
blob = open(filePath, 'rb').read()
m = magic.Magic(mime_encoding=True)
fileEncoding = m.from_buffer(blob)
return fileEncoding
fileEncoding = getFileEncoding ( FILE_PATH )
print (f"File Encoding: {fileEncoding}")
In my source files I have string containing non-ASCII characters like
sCursorFormat = TRANSLATE("Frequency (Hz): %s\nDegree (°): %s");
But when I extract them they vanish like
msgid ""
"Frequency (Hz): %s\n"
"Degree (): %s"
msgstr ""
I have specified the encoding when extracting as
xgettext --from-code=UTF-8
I'm running under MS Windows and the source files are C++ (not that it should matter).
The encoding of your source file is probably not UTF-8, but ANSI, which stands for whatever the encoding for non-Unicode applications is (probably code page 1252). If you would open the file in some hex editor you would see byte 0x80 standing for degree symbol. This byte is not a valid UTF-8 character. In UTF-8 encoding degree symbol is represented with two bytes 0xC2 0xB0. This is why the byte vanishes when using --from-code=UTF-8.
The solution for your problem is to use --from-code=windows-1252. OR, better yet, to save all source files as UTF-8, and then use --from-code=UTF-8.
I have a question about converting UTF-8 to CP1252 in Ubuntu with PHP or SHELL.
Background : Converting a csv file from UTF-8 to CP1252 in Ubuntu with PHP or SHELL, copy file from Ubuntu to Windows, open file with nodepad++.
Environment :
Ubuntu 10.04
PHP 5.3
a file csv with letters (œ, à, ç)
Methods used :
With PHP
iconv("UTF-8", "CP1252", "content of file")
or
mb_convert_encoding("content of file", "UTF-8", "CP1252")
If I check the generated file with
file -i name_of_the_file
It displayed :
name_of_the_file: text/plain; charset=iso-8859-1
I copy this converted file to windows and opened with notepad++, in the bottom of the right, we can see the encoding is ANSI
And when I changed the encoding from ANSI to Windows-1252, the specials characters were well displayed.
With Shell
iconv -f UTF-8 -t CP1252" "content of file"
The rest will be the same .
Question :
1. Why the command file did not display directly CP1252 or ANSI but ISO-8895-1 ?
2. Why the specials characters could be well displayed when I changed the encoding from ANSI to Windows-1252.
Thank you in advance !
1.
CP1252 and ISO-8859-1 are very similar, quite often a file encoded in one of them would look identically as the file encoded in the second one. See Wikipedia to see which characters are in Windows-1252 and not in ISO-8859-1.
Letters à and ç are encoded identically in both encodings. While ISO-8859-1 doesn't have an œ and CP1252 does, file might have missed that. AFAIK it doesn't analyse the entire file.
2.
"ANSI" is a misnomer used for the default non-Unicode encoding in Windows. In case of Western European languages, ANSI means Windows-1252. In case of Central European, it's Windows-1250, in case of Russian it's Windows-1251, and so on. Nothing apart from Windows uses the term "ANSI" to refer to an encoding.
I'm having a problem trying to read in windows a CSV file generated in MAC.
My question is how can I convert the encoding to UTF-8 or even ISO-8859-1.
I've already tried iconv with no success.
Inside "vim" I can understand that in this file linebreaks are marked with ^M and the accent ã is marked with <8b>, Ç = <82> and so on.
Any ideas?
To convert from encoding a to encoding b, you need to know what encoding a is.
Given that
ã is marked with <8b>, Ç = <82>
encoding a is very likely Mac OS Roman.
So call iconv with macintosh* as from argument, and utf-8 as to argument.
*try macroman, x-mac-roman etc if macintosh is not found
Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"