I'm putting together a script and need to take a file's content as input for setting a variable. I'm using Out-File to produce a text file:
$string | Out-File -FilePath C:\Full\Path\To\file.txt -NoNewLine
Then I am using that file to set a variable in batch:
set /P variablename=<C:\Full\Path\To\file.txt
The content of that file is a unique id string that looks practically like this:
1i32l54bl5b2hlthtl098
When I echo this variable, I get this:
echo %variablename%
■1
When I have tried a different string in the input file, I see that what is being echoed is the ■ character and then the first character in the string. So, if my string was "apfvuu244ty0vh" then it would echo "■a" instead.
Why isn't the variable being set to the content of the file? I'm using the method from this stackoverflow post where the chosen answer says to use this syntax with the set command. Am I doing something wrong? Is there perhaps a problem with using a full path as input to a set variable?
tl;dr:
Use Out-File -Encoding oem to produce files that cmd.exe reads correctly.
This effectively limits you to the 256 characters available in the legacy "ANSI" / OEM code pages, except NUL (0x0). See bottom section if you need full Unicode support.
In Windows PowerShell (but not PowerShell Core), Out-File and its effective alias > default to UTF-16LE character encoding, where most characters are represented as 2-byte sequences; for characters in the ASCII range, the 2nd byte of each sequence is NUL (0x0); additionally, such files start with a BOM that indicates the type of encoding.
By contrast, cmd.exe expects input to use the legacy single-byte OEM encoding (note that starting cmd.exe with /U only controls the encoding of its output).
When cmd.exe (unbeknownst to it) encounters UTF-16LE input:
It interprets the bytes individually as characters (even though characters in UTF-16LE are composed of 2 bytes (typically), or, in rare cases, of 4 (a pair of 2-byte sequences)).
It interprets the 2 bytes that make up the BOM (0xff, 0xfe) as part of the string. With OEM code page 437 (US-English) in effect, 0xff renders like a space, whereas 0xfe renders as ■.
Reading stops once the first NUL (0x0 byte) is encountered, which happens with the 1st character from the ASCII range, which in your sample string is 1.
Therefore, string 1i32l54bl5b2hlthtl098 encoded as UTF-16LE is read as ■1, as you state.
If you need full Unicode support, use UTF-8 encoding:
Use Out-File -Encoding utf8 in PowerShell.
Before reading the file in cmd.exe (in a batch file), run chcp 65001 in order to switch to the UTF-8 code page.
Caveats:
Not all Unicode chars. may render correctly, depending on the font used in the console window.
Legacy applications may malfunction with code page 65001 in effect, especially on older Windows versions.
A possible strategy to avoid problems is to temporarily switch to code page 65001, as needed, and then switch back.
Note that the above only covers communication via files, and only in one direction (PowerShell -> cmd.exe).
To also control the character encoding used for the standard streams (stdin, stdout, stderr), both when sending strings to cmd.exe / external programs and when interpreting strings received from them, see this answer of mine.
Related
In PowerShell (5.1):
Calling an external command (in this case nssm.exe get logstash-service Application) the output is displayed in PowerShell as I would have expected (ASCII-string "M:\logstash-7.1.1\bin\logstash.bat"):
PS C:\> M:\nssm-2.24\win64\nssm.exe get logstash-service Application
M:\logstash-7.1.1\bin\logstash.bat
But the following command (which pipes the output into Out-Default) results in:
PS C:\> M:\nssm-2.24\win64\nssm.exe get logstash-service Application | Out-Default
M : \ l o g s t a s h - 7 . 1 . 1 \ b i n \ l o g s t a s h . b a t
(Please note all that "whitespace" separating all characters of the resulting output string)
Also the following attempt to capture the output (as an ASCII string) into variable $outCmd results in :
PS C:\> $outCmd = M:\nssm-2.24\win64\nssm.exe get logstash-service Application
PS C:\> $outCmd
M : \ l o g s t a s h - 7 . 1 . 1 \ b i n \ l o g s t a s h . b a t
PS C:\>
Again, please note the separating whitespace between the characters.
Why is there a difference in the output between the first and the latter 2 commands?
Where are the "spaces" (or other kinds of whitespace chars) coming from in the output of the latter 2 commands?
What exactly needs to be done in order to capture the output of that external command as ASCII string "M:\logstash-7.1.1\bin\logstash.bat" (i.e. without the strange spaces in between)?
If the issue is related to ENCODING, please specify what exactly needs to be done/changed.
Yes, the problem is one of character encoding, and the problem often only surfaces when an external program's output is either captured in a variable, sent through the pipeline, or redirected to a file.
Only in these cases does PowerShell get involved and decodes the output into .NET strings before any further processing.
This decoding happens based on the encoding stored in the [Console]::OutputEncoding property, so for programs that do not themselves respect this encoding for their output you'll have to set this property to match the actual character encoding used.
Your symptom implies that nssm.exe outputs UTF-16LE-encoded ("Unicode") strings[1], so to capture them properly you'll have to do something like the following:
$orig = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.Encoding]::Unicode
# Store the output lines from nssm.exe in an array of strings.
$output = M:\nssm-2.24\win64\nssm.exe get logstash-service Application
[Console]::OutputEncoding = $orig
The underlying problem is that external programs are expected to use the current console's output code page for their output encoding, which defaults to the system's active legacy OEM code page, as reflected in [Console]::OutputEncoding (and reported by chcp), but some do not, in an attempt to:
either: overcome the limitations of the legacy, single-byte OEM encodings in order to provide full Unicode support (as is the case here, although it is more common to do that with UTF-8 encoding, as the Node.js CLI, node.exe does, for instance)
or: use the more widely used active ANSI legacy code page instead (as python does by default).
See this answer for additional information, which also links to two helper functions:
Invoke-WithEncoding, which wraps capturing output from an external program with a given encoding (see example below), and Debug-NativeInOutput, for diagnosing what encoding a given external program uses.
With function Invoke-WithEncoding from the linked answer defined, you could then call:
$output = Invoke-WithEncoding -Encoding Unicode {
M:\nssm-2.24\win64\nssm.exe get logstash-service Application
}
[1] The apparent spaces in the output are actually NUL characters (code point 0x0) that stem from the 0x0 high bytes of 8-bit-range UTF-16LE code units, which includes all ASCII characters and most of Windows-1252): Because PowerShell, based on the single-byte OEM code page stored in [Console]::Encoding (e.g., 437 on US-English systems), interprets each byte as a whole character, the 0x0 bytes of 2-byte (16-bit) Unicode code units (in the 8-bit range) are mistakenly retained as NUL characters, and in the console these characters present like spaces.
I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?
The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.
The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.
I have a source csv file which is quite big and in order to be able to work more efficiently with it I decided to split it into smaller file chunks. In order to do that, I execute the following script:
Get-Content C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | %{$i++; $_ | Out-File C:\Users\me\Desktop\Processed\splitfile_$i.csv}
As you can see, these are csv files which contain alphanumeric data. So, I have an issue with strings similar to this one:
Hämeenkatu 33
In the target file it looks like this:
Hämeenkatu 33
I've tried to determine the encoding of the source file and it is UTF-8 (as described here). I am really wondering why it gets so messed up in the target. I've also tried the following to explicitly tell that I want the encoding to be UTF8 but without success:
Get-Content C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | %{$i++; $_ | Out-File -Encoding "UTF8" C:\Users\me\Desktop\Processed\splitfile_$i.csv}
I am using a Windows machine running Windows 10.
Does the input file have a bom? Try get-content -encoding utf8. Out-file defaults to utf16le or what windows and powershell call "unicode".
Get-Content -encoding utf8 C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 |
%{$i++; $_ |
Out-File -encoding utf8 C:\Users\me\Desktop\Processed\splitfile_$i.csv}
The output file will have a bom unless you use powershell 6 or 7.
js2010's answer provides an effective solution; let me complement it with background information (a summary of the case at hand is at the bottom):
Fundamentally, PowerShell never preserves the character encoding of a [text] input file on output:
On reading, file content is decoded into .NET strings (which are internally UTF-16 code units):
Files with a BOM for the following encodings are always correctly recognized (identifiers recognized by the -Encoding parameter of PowerShell's cmdlets in parentheses):
UTF-8 (UTF8) - info
UTF-16LE (Unicode) / UTF-16BE (BigEndianUnicode) - info
UTF-32LE (UTF32) / UTF-32BE (BigEndianUTF32) - info
Note the absence of UTF-7, which, however, is rarely used as an encoding in practice.
Without a BOM, a default encoding is assumed:
PowerShell [Core] v6+ commendably assumes UTF-8.
The legacy Windows PowerShell (PowerShell up to v5.1) assumes ANSI encoding, i.e the code page determined by the legacy system locale; e.g., Windows-1252 on US-English systems.
The -Encoding parameter of file-reading cmdlets allows you to specify the source encoding explicitly, but note that the presence of a (supported) BOM overrides this - see below for what encodings are supported.
On writing, .NET strings are encoded based on a default encoding, unless an encoding is explicitly specified with -Encoding (the .NET strings created on reading carry no information about the encoding of the original input file, so it cannot be preserved):
PowerShell [Core] v6+ commendably uses BOM-less UTF-8.
The legacy Windows PowerShell (PowerShell up to v5.1) regrettably uses various default encodings, depending on the specific cmdlet / operator used.
Notably, Set-Content defaults to ANSI (as for reading), and Out-File / > defaults to UTF-16LE.
See this answer for the full picture.
As noted in js2010's answer, using -Encoding UTF8 in Windows PowerShell invariably creates files with a BOM, which can be problematic for files read by tools on Unix-like platforms / tools with a Unix heritage, which are often not equipped to deal with such a BOM.
See the answers to this question for how to create BOM-less UTF-8 files in Windows PowerShell.
As with reading, the -Encoding parameter of file-writing cmdlets allows you to specify the output encoding explicitly:
Note that in PowerShell [Core] v6+, in addition to its defaulting to BOM-less UTF-8, -Encoding UTF8 too refers to the BOM-less variant (unlike in Windows PowerShell), and there you must use -Encoding UTF8BOM in order to create a file with BOM.
Curiously, as of PowerShell [Core] v7.0, there is no -Encoding value for the system's active ANSI code page, i.e. for Windows PowerShell's default (in Windows PowerShell, -Encoding Default explicitly request ANSI encoding, but in PowerShell [Core] this refers to BOM-less UTF-8). This problematic omission is discussed in this GitHub issue. By contrast, targeting the active OEM code page with -Encoding OEM still works.
In order to create UTF-32BE files, Windows PowerShell requires identifier BigEndianUtf32; due to a bug in PowerShell [Core] as of v7.0, this identifier isn't supported, but you can use UTF-32BE instead.
Windows PowerShell is limited to those encodings listed in the Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding enumeration, but PowerShell [Core] allows you to pass any of the supported .NET encodings to the -Encoding parameter, either by code-page number (e.g., 1252) or by encoding name (e.g., windows-1252); [Text.Encoding]::GetEncodings().CodePage and [Text.Encoding]::GetEncodings().Name enumerate them in principle, but note that due to lack of .NET Core API support as of v7.0 this enumeration lists only a small subset of the actually supported encodings; running these commands in Windows PowerShell will show them all.
You can create UTF-7 files (UTF7), but they won't have a BOM; even input files that do have one aren't automatically recognized on reading, so specifying -Encoding UTF7 is always necessary for reading UTF-7 files.
In short:
In PowerShell, you have to know an input file's encoding in order to match that encoding on writing, and specify that encoding explicitly via the -Encoding parameter (if it differs from the default).
Get-Content (without -Encoding) provides no information as to what encoding it detected via a BOM or which one it assumed in the absence of a BOM.
If needed, you can perform your own analysis of the opening bytes of a text file to look for a BOM, but note that in the absence of one you'll have to rely on heuristics to infer the encoding - that is, you can make a reasonable guess, but you cannot be certain.
Also note that PowerShell, as of v7, fundamentally lacks support for passing raw byte streams through the pipeline - see this answer.
Your particular case:
Your problem was that your input file was UTF-8-encoded, but didn't have a BOM (which is actually preferable for the widest compatibility).
Since you're using Windows PowerShell, which misinterprets such files as ANSI-encoded, you need to tell it to read the file as UTF-8 with -Encoding Utf8.
As stated, on writing -Encoding Utf8 inevitably creates a file with BOM in Windows PowerShell; if that is a concern, use the .NET framework directly to produce a BOM-less files, as shown in the answers to this question.
Note that you would have had no problem with your original command in PowerShell [Core] v6+ - it defaults to BOM-less UTF-8 both on reading and writing, across all cmdlets.
This sensible, standardized default alone is a good reason for considering the move to PowerShell v7.0, which aims to be a superior replacement for the legacy Windows PowerShell.
I have a VS2013 project with a custom build command. In the command script I set an environment variable, and read it out again in the same script. I can confirm by calling set that setting the variable works. However, depending on the variable name, I can't read it out again.
The following works as expected when run as a batch script:
set AVAR=xxx
set ABLAH=xxx
set BBLAH=xxx
set DEV=xxx
set #ABLAH=xxx
echo %AVAR%
echo %ABLAH%
echo %BBLAH%
echo %DEV%
echo %#ABLAH%
But produces the following output in the project:
1> xxx
1> «LAH
1> »LAH
1> ÞV
1> xxx
In this case, the name AVAR works, but many others don't. Also, variables starting with # seem to work. Any idea what is going on?
I've found the solution. Visual Studio (msbuild) converts %XX escape sequences like in URLs. I only expected it to so in URLs, like browsers do. However, it seems to replace them everywhere.
So when it encounters %ABCDE%, it recognizes %AB and inserts the character « = 0xAB, giving «CDE% to the batch interpreter. But if the code is not a valid hexadecimal number, it silently ignores it, and the interpreter sees the right characters. That's why variable names with # at the beginning always worked.
So the solution is to escape at least all % in front valid hex codes 00-FF, better even all of them, with %25.
An easy solution would be to just edit the corresponding commands in the GUI (via project properties), and not directly in the .vcxproj or .props file. This way, VS inserts the correct escape codes. In my case this was not possible since the commands were defined as user macros (Property Pages: Common Properties/User Macros). My commands span multiple lines, but the user macro editor only supports single lines.
Another thing to watch out for is that it not only replaces percent signs. Other symbols have special meaning and have to be replaced, too. (This goes beyond XML entities, like & -> &.) Here is a list of special characters from MSDN. The characters are: % $ # ' ; ? *. It doesn't seem to be necessary to replace all of them all the time, but if you notice funky behavior then this is a thing to look at. You can try to enter these characters through the GUI and see how and if VS escapes them in the project file.
On other character to note especially is the semicolon. If you define a property with unescaped semicolons, like <MyPaths>DirA;DirB</MyPaths>, msbuild/VS will internally convert them to newlines (well, or it splits the property into a list or something). But it will still show the paths as separated with semicolons in the property pages! Except when you click the dropdown button next to a property and select <Edit...>, then it will show the paths as a list or separated by newlines! This is completely invisible most of the time, except when you set a property not in XML or the GUI, but you are reading the output of a command into a property. In this case the command must output newlines, if you want the effect of a semicolon. Otherwise you don't get multiple paths, but one long path with semicolons in it.
Batch files are usually in North American and Western European countries "ASCII" files using an OEM code page like code page 850 (OEM multilingual Latin I) or code page 437 (OEM US) and not code page Windows-1252 as used usually for single byte encoded text files. The code page to use for a batch file depends on local settings for non Unicode files in console. The code page does not matter if just characters with a code value smaller 128 are used in batch file, i.e. the batch file is a real ASCII file.
Therefore make sure that you edit and save the batch file as ASCII file using the right code page and not as Unicode file using UTF-8, UTF-16 Little Endian or UTF-16 Big Endian. Editor of Visual Studio uses by default UTF-8 encoding for the files. This is the wrong encoding for batch files.
Character « has in table of code page 850 the code value 174 decimal (0xAB). In table of code page 1252 code value 174 is for character ® which is an indication that you want to output in batch file characters encoded in UTF-8 (also code value 174 for character ®) or Windows-1252.
A simple batch code for demonstration stored as ANSI file with code page Windows-1252.
#echo off
cls
echo This batch file was saved as ANSI file using code page Windows-1252.
echo.
echo Registered trademark symbol ® has code value 174 in Windows-1252.
echo.
echo But active code page is not Windows 1252 in console window.
echo.
chcp
echo.
echo Therefore the left guillemet character is output instead of registered
echo trademark symbol as this character has in code page 850 code value 174.
echo.
echo Press any key to continue ...
pause>nul
And batch files are for DOS/Windows and should therefore use carriage return + line-feed as line terminator instead of just line-feed (UNIX) or just carriage return (old MAC).
Some text editors display line terminator type and encoding respectively code page somewhere in status bar at bottom of main application window for active file.
Strange character 'ÿ' in textoutput (should have been a space). Why is this, how can I fix it? Does not happen when command is executed at prompt. Only when piped to textfile.
Windows 7
c:\tasklist > text.txt
outputs:
System 4 Services 0 1ÿ508 K
smss.exe 312 Services 0 1ÿ384 K
csrss.exe 492 Services 0 5ÿ052 K
The "space" you could see in the console window was not the standard space character with the ASCII code of 32 (0x20), but the non-breaking space with the ASCII code of 255 (0xFF) in probably most OEM code pages.
After redirecting the output to a file, you likely opened the file in an editor that by default used a different code page to display the contents, possibly Windows-1252 since the character with the code of 255 is ÿ in Windows-1252.
Andriy was right.
I added
chcp 1252
at the begin of my batch file, and all weird characters were correctly translated into spaces in the output file.