Reading special characters in vbs - vbscript

I am having troubles reading special characters in vbs script.
My reading file goes like this.
ls_folder = "file path"
Set fso = CreateObject("Scripting.FileSystemObject")
Set fa = fso.OpenTextFile(ls_folder + f1.Name, 1, False)
Do While fa.AtEndOfStream <> True
Line = fa.readline
'Code
Next
If I open files with N++ I get that encoding is ANSI. I tried using 4th parameter for OpenTextFile but none of 3 values have worked for me.
Script doesnt read "ł" character. When encoded to ascii it gives value 179.
Is there any other way to declare encoding other than using ADODB.Stream, which allows to declare Charset.

Use AscW function rather than Asc one: AscW is provided for 32-bit platforms that use Unicode characters. It returns the Unicode (wide) character code, thereby avoiding the conversion from Unicode to ANSI.
Note that "ł" character (Unicode U+0142 i.e. decimal 322) is defined in next ANSI code pages:
Asc("ł") returns 179 in ANSI 1250 Central Europe and
Asc("ł") returns 249 in ANSI 1257 Baltic.
For proof, open charmap.exe or run my Alt KeyCode Finder script:
==> mycharmap "ł"
Ch Unicode Alt? CP IME Alt Alt0 IME 0405/cs-CZ; CP852; ANSI 1250
ł U+0142 322 …66… Latin Small Letter L With Stroke
136 CP852 cs-CZ 136 0179 (ANSI 1250) Central Europe
136 CP775 et-EE 0249 (ANSI 1257) Baltic
ł
==>
For the sake of completeness, AscB("ł") returns 66…

Add another argument to OpenTextFile to specify 'Unicode'. After the 'False' value which says "don't create the file" add 1 as extra argument.

Related

Bash: Remove ALL special characters from end of variable (not just CR or LF)

I have some text files and getting the first line(s) into a variable using var=$(head -n 1 "$#"), however the variable contains special charaters that I want removed (ASCII 1-31).
Is there a quick way to strip the end of a variable from ASCII codes 1-31? I've used ${var//[^[:ascii:]]/} and var="${var//[$'\t\r\n']}" already, however I need ASCII 1-31 removed from the end in a simple way (not just CF/LF/Tab/FF/etc.).
There's a character class for control characters – quote from the grep manual:
[:cntrl:]
Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.
So, you could do
var=${var//[[:cntrl:]]}

how to convert ascii encoding file to utf-8 encoding in perl?

I want to convert a text file with ascii encoding to utf-8 encoding.
So far I have tried this:
open( my $test, ">:encoding(utf-8)", $test_file ) or die("Error: Could not open file!\n");
and ran the below command which is showing the encoding of file
file $test_file
test_file: ASCII text
Please let me know if I am missing something here.
Any file that is in ASCII (i.e. containing only codepoints from 0 to 127) is already in UTF-8. There will be no difference in encoding and, hence, no way for file to identify it as UTF-8.
Differences in encoding only happen with characters with codepoints from 128.
It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
(From the Wikipedia article on UTF-8)
You are doing it correctly.
ASCII is a subset of UTF-8.
decode encode
ASCII ⇒ Unicode ⇒ UTF-8
---------- ---------- ----------
00 U+0000 00
01 U+0001 01
02 U+0002 02
⋮ ⋮ ⋮
7E U+007E 7E
7F U+007F 7F
---------- ---------- ----------
ASCII ⇐ Unicode ⇐ UTF-8
encode decode
As such, an ASCII file is a UTF-8 file.[1]
When you only use that subset, file identifies the file as being encoded using ASCII.
$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdef"' | file -
/dev/stdin: ASCII text
Going out of that subset causes file to identify the file as text encoded using UTF-8.
$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdéf"' | file -
/dev/stdin: UTF-8 Unicode text
It is also an iso-latin-1 file, iso-latin-2 file, iso-latin-3 file, a cp1250 file, a cp1251 file, a cp1252 file, etc, etc, etc

What does '^#' mean in vim?

When I cat a file in bash I get the following:
$ cat /tmp/file
microsoft
When I view the same file in vim I get the following:
^#m^#i^#c^#r^#o^#s^#o^#f^#t^#
How can I identify and remove these "non-printable" characters. What does '^#' mean in vim??
(Just a piece of background information: the file was created by base 64 decoding and cutting from the pssh header of an mpd file for Microsoft Playready)
What you see is Vim's visual representation of unprintable characters. It is explained at :help 'isprint':
Non-printable characters are displayed with two characters:
0 - 31 "^#" - "^_"
32 - 126 always single characters
127 "^?"
128 - 159 "~#" - "~_"
160 - 254 "| " - "|~"
255 "~?"
Therefore, ^# stands for a null byte = 0x00. These (and other non-printable characters) can come from various sources, but in your case it's an ...
encoding issue
If you clearly observe your output in Vim, every second byte is a null byte; in between are the expected characters. This is a clear indication that the file uses a multibyte encoding (utf-16, big endian, no byte order mark to be precise), and Vim did not properly detect that, and instead opened the file as latin1 or so (whereas things worked out properly in the terminal).
To fix this, you can either explicitly specify the encoding:
:edit ++enc=utf-16 /tmp/file
Or tweak the 'fileencodings' option, so that Vim can automatically detect this. However, be aware that ambiguities (as in your case) make this prone to fail:
For an empty file or a file with only ASCII characters most encodings
will work and the first entry of 'fileencodings' will be used (except
"ucs-bom", which requires the BOM to be present).
That's why a byte order mark (BOM) is recommended for 16-bit encodings; but that assumes that you have control over the output encoding.
^# is Vim's representation of a null byte. The ^ indicates a non-printable control character, with the following ASCII character indicating
which control character it is.
^# == 0 (NUL)
^A == 1
^B == 2
...
^H == 8
^K == 11
...
^Z == 26
^[ == 27
^\ == 28
^] == 29
^^ == 30
^_ == 31
^? == 127
9 and 10 aren't escaped because they are Tab and Line Feed respectively.
32 to 126 are printable ASCII characters (starting with Space).

Character replacement batch file

I'm trying to do a batch script using Windows command line to convert some characters for example:
É to Й
Ö to Ц
Ó to У
Ê to К
Å to Е
Í to Н
à to Г
Ø to Ш
Ù to Щ
Ç to З
with no success. That's because I am using a program that does not support a Cyrillic font.
And I have already the file with these words, like:
ОБОГРЕВ ЗОНЫ 1
ДАВЛЕНИЕ ЦВЕТА 1
...
and so on...
Is it possible?
I'm guessing that you'd like to convert the character set (alias code page) of a file so you can open and read it.
I'm assuming you are using a Windows computer.
Let's say that your file is russian.txt and when you open it with notepad, the characters doesn't make any sense. The russian.txt file's character encoding is most propably ANSI and it's code page is Windows-1251.
Some words about character encoding:
In ANSI one character is one byte long.
Different languages have different code pages: Windows-1251 = Russian, Windows-1252 = Western Languages (English, German, Swedish...), Windows-1253 = Greek ...
In UTF-8 English characters are one byte long and non-English characters two bytes long.
In Unicode all characters are two bytes long.
UTF-8 and Unicode doesn't need code pages.
You can check the encoding by opening the file in notepad and clicking File, Save As. At the right bottom corner beside the Save-button you can see the encoding.
With some googling I found a site where you can do the character encoding conversion online. I Haven't tested it, but here's the address:
http://i-tools.org/charset
I've made a script (= a small program) which changes the character encoding from any ANSI and code page combination to UTF-8 or Unicode or vice versa.
Let's say you have and English Windows computer and want to convert the russian.txt (ANSI / Windows-1251) to UTF-8.
Here's how:
Open this web-page and copy the script in it to the clipboard:
VB6/VBScript change file encoding to ansi
Create a new file named ConvertCharset.vbs to the same folder, where the russian.txt is, say C:\Temp.
Open the ConvertCharset.vbs in notepad (right click+edit) and paste.
Open CMD (Windows-button+R, cmd, Enter).
In CMD-window type (hit Enter-key at each end of the line):
cd C:\Temp\
cscript ConvertCharset.vbs /InputCharset:Windows-1251 /OutputCharset:utf-8 /InputFile:russian.txt /OutputFile:russian_utf-8.txt
Now the you can open the russian_utf-8.txt in notepad and you'll see the Russian characters OK.
More info:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Windows-1251
http://en.wikipedia.org/wiki/UTF-8
VB6/VBScript change file encoding to ansi

Strange character in textoutput when piping from tasklist command WIN7

Strange character 'ÿ' in textoutput (should have been a space). Why is this, how can I fix it? Does not happen when command is executed at prompt. Only when piped to textfile.
Windows 7
c:\tasklist > text.txt
outputs:
System 4 Services 0 1ÿ508 K
smss.exe 312 Services 0 1ÿ384 K
csrss.exe 492 Services 0 5ÿ052 K
The "space" you could see in the console window was not the standard space character with the ASCII code of 32 (0x20), but the non-breaking space with the ASCII code of 255 (0xFF) in probably most OEM code pages.
After redirecting the output to a file, you likely opened the file in an editor that by default used a different code page to display the contents, possibly Windows-1252 since the character with the code of 255 is ÿ in Windows-1252.
Andriy was right.
I added
chcp 1252
at the begin of my batch file, and all weird characters were correctly translated into spaces in the output file.

Resources