When I cat a file in bash I get the following:
$ cat /tmp/file
microsoft
When I view the same file in vim I get the following:
^#m^#i^#c^#r^#o^#s^#o^#f^#t^#
How can I identify and remove these "non-printable" characters. What does '^#' mean in vim??
(Just a piece of background information: the file was created by base 64 decoding and cutting from the pssh header of an mpd file for Microsoft Playready)
What you see is Vim's visual representation of unprintable characters. It is explained at :help 'isprint':
Non-printable characters are displayed with two characters:
0 - 31 "^#" - "^_"
32 - 126 always single characters
127 "^?"
128 - 159 "~#" - "~_"
160 - 254 "| " - "|~"
255 "~?"
Therefore, ^# stands for a null byte = 0x00. These (and other non-printable characters) can come from various sources, but in your case it's an ...
encoding issue
If you clearly observe your output in Vim, every second byte is a null byte; in between are the expected characters. This is a clear indication that the file uses a multibyte encoding (utf-16, big endian, no byte order mark to be precise), and Vim did not properly detect that, and instead opened the file as latin1 or so (whereas things worked out properly in the terminal).
To fix this, you can either explicitly specify the encoding:
:edit ++enc=utf-16 /tmp/file
Or tweak the 'fileencodings' option, so that Vim can automatically detect this. However, be aware that ambiguities (as in your case) make this prone to fail:
For an empty file or a file with only ASCII characters most encodings
will work and the first entry of 'fileencodings' will be used (except
"ucs-bom", which requires the BOM to be present).
That's why a byte order mark (BOM) is recommended for 16-bit encodings; but that assumes that you have control over the output encoding.
^# is Vim's representation of a null byte. The ^ indicates a non-printable control character, with the following ASCII character indicating
which control character it is.
^# == 0 (NUL)
^A == 1
^B == 2
...
^H == 8
^K == 11
...
^Z == 26
^[ == 27
^\ == 28
^] == 29
^^ == 30
^_ == 31
^? == 127
9 and 10 aren't escaped because they are Tab and Line Feed respectively.
32 to 126 are printable ASCII characters (starting with Space).
Related
I have some text files and getting the first line(s) into a variable using var=$(head -n 1 "$#"), however the variable contains special charaters that I want removed (ASCII 1-31).
Is there a quick way to strip the end of a variable from ASCII codes 1-31? I've used ${var//[^[:ascii:]]/} and var="${var//[$'\t\r\n']}" already, however I need ASCII 1-31 removed from the end in a simple way (not just CF/LF/Tab/FF/etc.).
There's a character class for control characters – quote from the grep manual:
[:cntrl:]
Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.
So, you could do
var=${var//[[:cntrl:]]}
I am having troubles reading special characters in vbs script.
My reading file goes like this.
ls_folder = "file path"
Set fso = CreateObject("Scripting.FileSystemObject")
Set fa = fso.OpenTextFile(ls_folder + f1.Name, 1, False)
Do While fa.AtEndOfStream <> True
Line = fa.readline
'Code
Next
If I open files with N++ I get that encoding is ANSI. I tried using 4th parameter for OpenTextFile but none of 3 values have worked for me.
Script doesnt read "ł" character. When encoded to ascii it gives value 179.
Is there any other way to declare encoding other than using ADODB.Stream, which allows to declare Charset.
Use AscW function rather than Asc one: AscW is provided for 32-bit platforms that use Unicode characters. It returns the Unicode (wide) character code, thereby avoiding the conversion from Unicode to ANSI.
Note that "ł" character (Unicode U+0142 i.e. decimal 322) is defined in next ANSI code pages:
Asc("ł") returns 179 in ANSI 1250 Central Europe and
Asc("ł") returns 249 in ANSI 1257 Baltic.
For proof, open charmap.exe or run my Alt KeyCode Finder script:
==> mycharmap "ł"
Ch Unicode Alt? CP IME Alt Alt0 IME 0405/cs-CZ; CP852; ANSI 1250
ł U+0142 322 …66… Latin Small Letter L With Stroke
136 CP852 cs-CZ 136 0179 (ANSI 1250) Central Europe
136 CP775 et-EE 0249 (ANSI 1257) Baltic
ł
==>
For the sake of completeness, AscB("ł") returns 66…
Add another argument to OpenTextFile to specify 'Unicode'. After the 'False' value which says "don't create the file" add 1 as extra argument.
I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file.
I have 2 version of this file, one encoded in UTF-8 and one in ANSI.
If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:
grep -ni "[äöü]" myfile.txt
[-n for line numbers, -i for ignoring cases]
The Turbo Grep Version is :
Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax: GREP [-rlcnvidzewoqhu] searchstring file[s] or #filelist
GREP ? for help
Help for this command lists:
Options are one or more option characters preceded by "-", and optionally
followed by "+" (turn option on), or "-" (turn it off). The default is "+".
-r+ Regular expression search -l- File names only
-c- match Count only -n- Line numbers
-v- Non-matching lines only -i- Ignore case
-d- Search subdirectories -z- Verbose
-e Next argument is searchstring -w- Word search
-o- UNIX output format Default set: [0-9A-Z_]
-q- Quiet: supress normal output
-h- Supress display of filename
-u xxx Create a copy of grep named 'xxx' with current options set as default
A regular expression is one or more occurrences of: One or more characters
optionally enclosed in quotes. The following symbols are treated specially:
^ start of line $ end of line
. any character \ quote next character
* match zero or more + match one or more
[aeiou0-9] match a, e, i, o, u, and 0 thru 9 ;
[^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9
Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?
Thanks in advance
Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.
Jon Skeet explain:1
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-16 is more similar to ANSI so for this reason with ANSI work well.
if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others
Strange character 'ÿ' in textoutput (should have been a space). Why is this, how can I fix it? Does not happen when command is executed at prompt. Only when piped to textfile.
Windows 7
c:\tasklist > text.txt
outputs:
System 4 Services 0 1ÿ508 K
smss.exe 312 Services 0 1ÿ384 K
csrss.exe 492 Services 0 5ÿ052 K
The "space" you could see in the console window was not the standard space character with the ASCII code of 32 (0x20), but the non-breaking space with the ASCII code of 255 (0xFF) in probably most OEM code pages.
After redirecting the output to a file, you likely opened the file in an editor that by default used a different code page to display the contents, possibly Windows-1252 since the character with the code of 255 is ÿ in Windows-1252.
Andriy was right.
I added
chcp 1252
at the begin of my batch file, and all weird characters were correctly translated into spaces in the output file.
Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"