how to convert ascii encoding file to utf-8 encoding in perl?

how to convert ascii encoding file to utf-8 encoding in perl? - shell

I want to convert a text file with ascii encoding to utf-8 encoding.
So far I have tried this:
open( my $test, ">:encoding(utf-8)", $test_file ) or die("Error: Could not open file!\n");
and ran the below command which is showing the encoding of file
file $test_file
test_file: ASCII text
Please let me know if I am missing something here.

Any file that is in ASCII (i.e. containing only codepoints from 0 to 127) is already in UTF-8. There will be no difference in encoding and, hence, no way for file to identify it as UTF-8.
Differences in encoding only happen with characters with codepoints from 128.
It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
(From the Wikipedia article on UTF-8)

You are doing it correctly.
ASCII is a subset of UTF-8.
decode encode
ASCII ⇒ Unicode ⇒ UTF-8
---------- ---------- ----------
00 U+0000 00
01 U+0001 01
02 U+0002 02
⋮ ⋮ ⋮
7E U+007E 7E
7F U+007F 7F
---------- ---------- ----------
ASCII ⇐ Unicode ⇐ UTF-8
encode decode
As such, an ASCII file is a UTF-8 file.[1]
When you only use that subset, file identifies the file as being encoded using ASCII.
$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdef"' | file -
/dev/stdin: ASCII text
Going out of that subset causes file to identify the file as text encoded using UTF-8.
$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdéf"' | file -
/dev/stdin: UTF-8 Unicode text
It is also an iso-latin-1 file, iso-latin-2 file, iso-latin-3 file, a cp1250 file, a cp1251 file, a cp1252 file, etc, etc, etc

Related

Gettext failed to extract non-ASCII characters

In my source files I have string containing non-ASCII characters like
sCursorFormat = TRANSLATE("Frequency (Hz): %s\nDegree (°): %s");
But when I extract them they vanish like
msgid ""
"Frequency (Hz): %s\n"
"Degree (): %s"
msgstr ""
I have specified the encoding when extracting as
xgettext --from-code=UTF-8
I'm running under MS Windows and the source files are C++ (not that it should matter).

The encoding of your source file is probably not UTF-8, but ANSI, which stands for whatever the encoding for non-Unicode applications is (probably code page 1252). If you would open the file in some hex editor you would see byte 0x80 standing for degree symbol. This byte is not a valid UTF-8 character. In UTF-8 encoding degree symbol is represented with two bytes 0xC2 0xB0. This is why the byte vanishes when using --from-code=UTF-8.
The solution for your problem is to use --from-code=windows-1252. OR, better yet, to save all source files as UTF-8, and then use --from-code=UTF-8.

Character replacement batch file

I'm trying to do a batch script using Windows command line to convert some characters for example:
É to Й
Ö to Ц
Ó to У
Ê to К
Å to Е
Í to Н
Ã to Г
Ø to Ш
Ù to Щ
Ç to З
with no success. That's because I am using a program that does not support a Cyrillic font.
And I have already the file with these words, like:
ОБОГРЕВ ЗОНЫ 1
ДАВЛЕНИЕ ЦВЕТА 1
...
and so on...
Is it possible?

I'm guessing that you'd like to convert the character set (alias code page) of a file so you can open and read it.
I'm assuming you are using a Windows computer.
Let's say that your file is russian.txt and when you open it with notepad, the characters doesn't make any sense. The russian.txt file's character encoding is most propably ANSI and it's code page is Windows-1251.
Some words about character encoding:
In ANSI one character is one byte long.
Different languages have different code pages: Windows-1251 = Russian, Windows-1252 = Western Languages (English, German, Swedish...), Windows-1253 = Greek ...
In UTF-8 English characters are one byte long and non-English characters two bytes long.
In Unicode all characters are two bytes long.
UTF-8 and Unicode doesn't need code pages.
You can check the encoding by opening the file in notepad and clicking File, Save As. At the right bottom corner beside the Save-button you can see the encoding.
With some googling I found a site where you can do the character encoding conversion online. I Haven't tested it, but here's the address:
http://i-tools.org/charset
I've made a script (= a small program) which changes the character encoding from any ANSI and code page combination to UTF-8 or Unicode or vice versa.
Let's say you have and English Windows computer and want to convert the russian.txt (ANSI / Windows-1251) to UTF-8.
Here's how:
Open this web-page and copy the script in it to the clipboard:
VB6/VBScript change file encoding to ansi
Create a new file named ConvertCharset.vbs to the same folder, where the russian.txt is, say C:\Temp.
Open the ConvertCharset.vbs in notepad (right click+edit) and paste.
Open CMD (Windows-button+R, cmd, Enter).
In CMD-window type (hit Enter-key at each end of the line):
cd C:\Temp\
cscript ConvertCharset.vbs /InputCharset:Windows-1251 /OutputCharset:utf-8 /InputFile:russian.txt /OutputFile:russian_utf-8.txt
Now the you can open the russian_utf-8.txt in notepad and you'll see the Russian characters OK.
More info:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Windows-1251
http://en.wikipedia.org/wiki/UTF-8
VB6/VBScript change file encoding to ansi

Confusion with ARGF#set_encoding

ARGF.set_encoding says:
If single argument is specified, strings read from ARGF are tagged with the encoding specified.
If two encoding names separated by a colon are given, e.g. "ascii:utf-8", the read string is converted from the first encoding (external encoding) to the second encoding (internal encoding), then tagged with the second encoding.
So I tried the below:
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding('ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
No encoding change is found. So please advice me the correct approach.

Encoding IBM437 and ASCII (and UTF-8) has the same byte sequence for ASCII characters. So you won't see the difference from String#inspect. However, you can check the String#encoding value for the input strings.
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/).map{|s| s.encoding}
In Ruby (1.9 and higher version), String is a byte sequence tagged with some encoding. You can get the encoding from String#encoding.
So the Chinese word "中" can be represented different ways:
e4 b8 ad # tagged with encoding UTF-8
d6 d0 # tagged with encoding GBK
2d 4e # tagged with encoding UTF-16le
I will always write my script in UTF-8, that is, the internal encoding for my script is UTF-8. Some times I want to process text file (e.g. named "a.txt" and has content "中") encoded with GBK. Then I can set the external encoding and the internal encoding for the IO object and Ruby will do the conversion for me.
ARGF.set_encoding('GBK', 'UTF-8')
str = ARGF.readline
puts str.encoding
# run $ script.rb a.txt
Ruby reads "\xd6\xd0" from "a.txt" and since I have specified the external encoding as GBK, it tags the data with encoding GBK. And I have specified the internal encoding as UTF-8 so Ruby do a conversion from GBK byte sequence to UTF-8, which results in "\xe4\xb8\xad" with tag UTF-8. And this string has the same encoding as other strings in my script, so I can use it with ease.
This is useful because a lot of String methods fail when the two String operands has different, incompatible encoding. For example:
# encoding: utf-8
a = "中" # tagged with UTF-8
b = "中".encode('gbk') # tagged with GBK
puts a + b
#=> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and GBK

hex representation of german umlaut to string

I have a String that has non ascii characters encoded as "\\'fc" (without quotes), where fc is hex 252 which corresponds to the german ü umlaut.
I managed to find all occurences and can replace them. But I have not been able to convert the fc to an ü.
"fc".hex.chr
gives me another representation...but if I do
puts "fc".hex.chr
I get nothing back...
Thanks in advance
PS: I'm working on ruby 1.9 and have
# coding: utf-8
at the top of the file.

fc is not the correct UTF-8 codepoint for that character; that's iso-8859-1 or windows-1252. The UTF-8 encoding for ü is the two-byte sequence, c3bc. Further, FC is not a valid UTF-8 sequence.
Since UTF-8 is assumed in Ruby 1.9, you should be able to get the literal u-umlaut with: "\xc3\xbc"

Have you tried
puts "fc".hex.chr(Encoding::UTF_8)
Ruby docs:
int.chr
Encoding
UPDATE:
Jason True is right. fc is invalid UTF-8. I have no idea why my example works!

Determine if a text file without BOM is UTF8 or ASCII

Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!

This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...

You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio