Strange character 'ÿ' in textoutput (should have been a space). Why is this, how can I fix it? Does not happen when command is executed at prompt. Only when piped to textfile.
Windows 7
c:\tasklist > text.txt
outputs:
System 4 Services 0 1ÿ508 K
smss.exe 312 Services 0 1ÿ384 K
csrss.exe 492 Services 0 5ÿ052 K
The "space" you could see in the console window was not the standard space character with the ASCII code of 32 (0x20), but the non-breaking space with the ASCII code of 255 (0xFF) in probably most OEM code pages.
After redirecting the output to a file, you likely opened the file in an editor that by default used a different code page to display the contents, possibly Windows-1252 since the character with the code of 255 is ÿ in Windows-1252.
Andriy was right.
I added
chcp 1252
at the begin of my batch file, and all weird characters were correctly translated into spaces in the output file.
Related
I'm putting together a script and need to take a file's content as input for setting a variable. I'm using Out-File to produce a text file:
$string | Out-File -FilePath C:\Full\Path\To\file.txt -NoNewLine
Then I am using that file to set a variable in batch:
set /P variablename=<C:\Full\Path\To\file.txt
The content of that file is a unique id string that looks practically like this:
1i32l54bl5b2hlthtl098
When I echo this variable, I get this:
echo %variablename%
■1
When I have tried a different string in the input file, I see that what is being echoed is the ■ character and then the first character in the string. So, if my string was "apfvuu244ty0vh" then it would echo "■a" instead.
Why isn't the variable being set to the content of the file? I'm using the method from this stackoverflow post where the chosen answer says to use this syntax with the set command. Am I doing something wrong? Is there perhaps a problem with using a full path as input to a set variable?
tl;dr:
Use Out-File -Encoding oem to produce files that cmd.exe reads correctly.
This effectively limits you to the 256 characters available in the legacy "ANSI" / OEM code pages, except NUL (0x0). See bottom section if you need full Unicode support.
In Windows PowerShell (but not PowerShell Core), Out-File and its effective alias > default to UTF-16LE character encoding, where most characters are represented as 2-byte sequences; for characters in the ASCII range, the 2nd byte of each sequence is NUL (0x0); additionally, such files start with a BOM that indicates the type of encoding.
By contrast, cmd.exe expects input to use the legacy single-byte OEM encoding (note that starting cmd.exe with /U only controls the encoding of its output).
When cmd.exe (unbeknownst to it) encounters UTF-16LE input:
It interprets the bytes individually as characters (even though characters in UTF-16LE are composed of 2 bytes (typically), or, in rare cases, of 4 (a pair of 2-byte sequences)).
It interprets the 2 bytes that make up the BOM (0xff, 0xfe) as part of the string. With OEM code page 437 (US-English) in effect, 0xff renders like a space, whereas 0xfe renders as ■.
Reading stops once the first NUL (0x0 byte) is encountered, which happens with the 1st character from the ASCII range, which in your sample string is 1.
Therefore, string 1i32l54bl5b2hlthtl098 encoded as UTF-16LE is read as ■1, as you state.
If you need full Unicode support, use UTF-8 encoding:
Use Out-File -Encoding utf8 in PowerShell.
Before reading the file in cmd.exe (in a batch file), run chcp 65001 in order to switch to the UTF-8 code page.
Caveats:
Not all Unicode chars. may render correctly, depending on the font used in the console window.
Legacy applications may malfunction with code page 65001 in effect, especially on older Windows versions.
A possible strategy to avoid problems is to temporarily switch to code page 65001, as needed, and then switch back.
Note that the above only covers communication via files, and only in one direction (PowerShell -> cmd.exe).
To also control the character encoding used for the standard streams (stdin, stdout, stderr), both when sending strings to cmd.exe / external programs and when interpreting strings received from them, see this answer of mine.
When I cat a file in bash I get the following:
$ cat /tmp/file
microsoft
When I view the same file in vim I get the following:
^#m^#i^#c^#r^#o^#s^#o^#f^#t^#
How can I identify and remove these "non-printable" characters. What does '^#' mean in vim??
(Just a piece of background information: the file was created by base 64 decoding and cutting from the pssh header of an mpd file for Microsoft Playready)
What you see is Vim's visual representation of unprintable characters. It is explained at :help 'isprint':
Non-printable characters are displayed with two characters:
0 - 31 "^#" - "^_"
32 - 126 always single characters
127 "^?"
128 - 159 "~#" - "~_"
160 - 254 "| " - "|~"
255 "~?"
Therefore, ^# stands for a null byte = 0x00. These (and other non-printable characters) can come from various sources, but in your case it's an ...
encoding issue
If you clearly observe your output in Vim, every second byte is a null byte; in between are the expected characters. This is a clear indication that the file uses a multibyte encoding (utf-16, big endian, no byte order mark to be precise), and Vim did not properly detect that, and instead opened the file as latin1 or so (whereas things worked out properly in the terminal).
To fix this, you can either explicitly specify the encoding:
:edit ++enc=utf-16 /tmp/file
Or tweak the 'fileencodings' option, so that Vim can automatically detect this. However, be aware that ambiguities (as in your case) make this prone to fail:
For an empty file or a file with only ASCII characters most encodings
will work and the first entry of 'fileencodings' will be used (except
"ucs-bom", which requires the BOM to be present).
That's why a byte order mark (BOM) is recommended for 16-bit encodings; but that assumes that you have control over the output encoding.
^# is Vim's representation of a null byte. The ^ indicates a non-printable control character, with the following ASCII character indicating
which control character it is.
^# == 0 (NUL)
^A == 1
^B == 2
...
^H == 8
^K == 11
...
^Z == 26
^[ == 27
^\ == 28
^] == 29
^^ == 30
^_ == 31
^? == 127
9 and 10 aren't escaped because they are Tab and Line Feed respectively.
32 to 126 are printable ASCII characters (starting with Space).
I've written a program that returns keycodes as integers for DOS
but i don't know how to get it's output as a variable.
Note: I'm using MS-DOS 7 / Windows 98, so i can't use FOR /F or SET /P
Does anyone know how i could do that?
A few solutions are described by Eric Pement here. However, for older versions of cmd the author was forced to use external tools.
For example, program tools like STRINGS by Douglas Boling, allows for following code:
echo Greetings! | STRINGS hi=ASK # puts "Greetings!" into %hi%
Same goes for ASET by Richard Breuer:
echo Greetings! | ASET hi=line # puts "Greetings!" into %hi%
One of alternative pure DOS solutions needs the program output to be redirected to the file (named ANSWER.DAT in example below) and then uses a specially prepared batch file. To cite the aforementioned page:
[I]n the batch file we need to be able to issue the command
set MYVAR={the contents of ANSWER.DAT go here}. This is a difficult task, since MS-DOS doesn't offer an easy way to prepend "set MYVAR=" to a file [...]
Normal DOS text files and batch files end all lines with two consecutive bytes: a carriage return (Ctrl-M, hex 0D, or ASCII 13) and a linefeed (Ctrl-J, hex 0A or ASCII 10). In the batch file, you must be able to embed a Ctrl-J in the middle of a line.
Many text editors have a way to do this: via a Ctrl-P followed by Ctrl-J (DOS EDIT with Win95/98, VDE), via a Ctrl-Q prefix (Emacs, PFE), via direct entry with ALT and the numeric keypad (QEdit, Multi-Edit), or via a designated function key (Boxer). Other editors absolutely will not support this (Notepad, Editpad, EDIT from MS-DOS 6.22 or earlier; VIM can insert a linefeed only in binary mode, but not in its normal text mode).
If you can do it, your batch file might look like this:
#echo off
:: assume that the datafile exists already in ANSWER.DAT
echo set myvar=^J | find "set" >PREFIX.DAT
copy PREFIX.DAT+ANSWER.DAT VARIAB.BAT
call VARIAB.BAT
echo Success! The value of myvar is: [%myvar%].
:: erase temp files ...
for %%f in (PREFIX.DAT ANSWER.DAT VARIAB.BAT) do del %%f >NUL
Where you see the ^J on line 3 above, the linefeed should be embedded at that point. Your editor may display it as a square box with an embedded circle.
I have a VS2013 project with a custom build command. In the command script I set an environment variable, and read it out again in the same script. I can confirm by calling set that setting the variable works. However, depending on the variable name, I can't read it out again.
The following works as expected when run as a batch script:
set AVAR=xxx
set ABLAH=xxx
set BBLAH=xxx
set DEV=xxx
set #ABLAH=xxx
echo %AVAR%
echo %ABLAH%
echo %BBLAH%
echo %DEV%
echo %#ABLAH%
But produces the following output in the project:
1> xxx
1> «LAH
1> »LAH
1> ÞV
1> xxx
In this case, the name AVAR works, but many others don't. Also, variables starting with # seem to work. Any idea what is going on?
I've found the solution. Visual Studio (msbuild) converts %XX escape sequences like in URLs. I only expected it to so in URLs, like browsers do. However, it seems to replace them everywhere.
So when it encounters %ABCDE%, it recognizes %AB and inserts the character « = 0xAB, giving «CDE% to the batch interpreter. But if the code is not a valid hexadecimal number, it silently ignores it, and the interpreter sees the right characters. That's why variable names with # at the beginning always worked.
So the solution is to escape at least all % in front valid hex codes 00-FF, better even all of them, with %25.
An easy solution would be to just edit the corresponding commands in the GUI (via project properties), and not directly in the .vcxproj or .props file. This way, VS inserts the correct escape codes. In my case this was not possible since the commands were defined as user macros (Property Pages: Common Properties/User Macros). My commands span multiple lines, but the user macro editor only supports single lines.
Another thing to watch out for is that it not only replaces percent signs. Other symbols have special meaning and have to be replaced, too. (This goes beyond XML entities, like & -> &.) Here is a list of special characters from MSDN. The characters are: % $ # ' ; ? *. It doesn't seem to be necessary to replace all of them all the time, but if you notice funky behavior then this is a thing to look at. You can try to enter these characters through the GUI and see how and if VS escapes them in the project file.
On other character to note especially is the semicolon. If you define a property with unescaped semicolons, like <MyPaths>DirA;DirB</MyPaths>, msbuild/VS will internally convert them to newlines (well, or it splits the property into a list or something). But it will still show the paths as separated with semicolons in the property pages! Except when you click the dropdown button next to a property and select <Edit...>, then it will show the paths as a list or separated by newlines! This is completely invisible most of the time, except when you set a property not in XML or the GUI, but you are reading the output of a command into a property. In this case the command must output newlines, if you want the effect of a semicolon. Otherwise you don't get multiple paths, but one long path with semicolons in it.
Batch files are usually in North American and Western European countries "ASCII" files using an OEM code page like code page 850 (OEM multilingual Latin I) or code page 437 (OEM US) and not code page Windows-1252 as used usually for single byte encoded text files. The code page to use for a batch file depends on local settings for non Unicode files in console. The code page does not matter if just characters with a code value smaller 128 are used in batch file, i.e. the batch file is a real ASCII file.
Therefore make sure that you edit and save the batch file as ASCII file using the right code page and not as Unicode file using UTF-8, UTF-16 Little Endian or UTF-16 Big Endian. Editor of Visual Studio uses by default UTF-8 encoding for the files. This is the wrong encoding for batch files.
Character « has in table of code page 850 the code value 174 decimal (0xAB). In table of code page 1252 code value 174 is for character ® which is an indication that you want to output in batch file characters encoded in UTF-8 (also code value 174 for character ®) or Windows-1252.
A simple batch code for demonstration stored as ANSI file with code page Windows-1252.
#echo off
cls
echo This batch file was saved as ANSI file using code page Windows-1252.
echo.
echo Registered trademark symbol ® has code value 174 in Windows-1252.
echo.
echo But active code page is not Windows 1252 in console window.
echo.
chcp
echo.
echo Therefore the left guillemet character is output instead of registered
echo trademark symbol as this character has in code page 850 code value 174.
echo.
echo Press any key to continue ...
pause>nul
And batch files are for DOS/Windows and should therefore use carriage return + line-feed as line terminator instead of just line-feed (UNIX) or just carriage return (old MAC).
Some text editors display line terminator type and encoding respectively code page somewhere in status bar at bottom of main application window for active file.
So I have a strange question. I have written a script that re-formats data files. I basically create new files with the right column order, spacing, and such. I then unix2dos these files (the program I am formatting these files for is DIPS for windows, and I assume that the files should be ansi). When I go to open the files in the DIPS Program however an error occurs and the file won't open.
When I create the same kind of data file through the DIPS program and open it in note pad, it matches exactly with the data files I have created with my script.
On the other hand if I open the data files that I have created with my script in Kedit first, save them, and then open them in the DIPS program everything works.
My question is what could saving in Kedit possibly do that unix2dos does not?
(Also if I try using note pad or word pad to save instead of Kedit the file doesn't open in DIPS)
Here is what was created using the diff command in unix
"
1,16c1,16
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
--
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
18c18
--
440c440
--
442c442
-1
-1
"
Any help would be appreciated! Thanks!
Okay! Figured it out.
Simply when you unix2dos your file you do not strip any space characters in between the last letter in a line and the line break character. When saving in Kedit you do strip the spaces between the last letter in a line and the line break character.
In my script I had a poor programing practice in which I was writing a string like this;
echo "This is an example string " >> outfile.txt
The character count is 32, and if you could see the break line character (chr(10)) the line would read;
This is an example string
If you unix2dos outfile.txt the line looks the same as above but with a different break line character. However when you place the file into Kedit and save it, now the character count is 25 and the line looks like this;
This is an example string
This occurs because Kedit does not preserve spaces at the end of a line. It places the return or line break character at the last letter or "non space" character in a line.
So programs that read literal input like DIPS (i'm guessing) or more widely used AutoCAD scripting will have a real problem with extra spaces before the return character. Basically in AutoCAD scripting a space in a line is treated as a return character. So if you have ten extra spaces at the end of a line it's treated the same as ten returns instead of the one you probably intended.
OH and if this helped you out or though it was good please give me a vote up!
unix2dos converts the line-break characters at the end of each line, from unix line breaks (10) to dos line breaks (13, 10)
Kedit could possible change the encoding of the file (like from ansi to UTF-8)
You can change the encoding of a file with the iconv utility (on a linux box)