Concatenating files in Windows Command Prompt and the string "ï»¿"

Concatenating files in Windows Command Prompt and the string "ï»¿" - windows

I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "ï»¿" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?

The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.

The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.

Related

Text editor keeps using a wrong file encoding and replaces certain characters with another codes

When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.
Non-corrupted string before saving the file:“Diff Clang Compiler Log Files”
Corrupted string after saving the file:
�Diff Clang Compiler Log Files�
So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:
The original file is saved using the Windows-1252 Encoding (I am using Win 10 machine, German keyboard)
VSC faulty interprets the file with UTF-8 encoding
Characters codes get converted from Windows-1252 into UTF-8 once the file is saved, thus 0xE2 0x80 0x9C becomes 0xEF 0xBF 0xBD.
Is my understanding corrrect?
Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).
Thank you!

The encoding of the file can be found with the help of python magic module
import magic
FILE_PATH = 'C:\\myPath'
def getFileEncoding (filePath):
blob = open(filePath, 'rb').read()
m = magic.Magic(mime_encoding=True)
fileEncoding = m.from_buffer(blob)
return fileEncoding
fileEncoding = getFileEncoding ( FILE_PATH )
print (f"File Encoding: {fileEncoding}")

Batch variable being set to ■1 instead of intended output

I'm putting together a script and need to take a file's content as input for setting a variable. I'm using Out-File to produce a text file:
$string | Out-File -FilePath C:\Full\Path\To\file.txt -NoNewLine
Then I am using that file to set a variable in batch:
set /P variablename=<C:\Full\Path\To\file.txt
The content of that file is a unique id string that looks practically like this:
1i32l54bl5b2hlthtl098
When I echo this variable, I get this:
echo %variablename%
■1
When I have tried a different string in the input file, I see that what is being echoed is the ■ character and then the first character in the string. So, if my string was "apfvuu244ty0vh" then it would echo "■a" instead.
Why isn't the variable being set to the content of the file? I'm using the method from this stackoverflow post where the chosen answer says to use this syntax with the set command. Am I doing something wrong? Is there perhaps a problem with using a full path as input to a set variable?

tl;dr:
Use Out-File -Encoding oem to produce files that cmd.exe reads correctly.
This effectively limits you to the 256 characters available in the legacy "ANSI" / OEM code pages, except NUL (0x0). See bottom section if you need full Unicode support.
In Windows PowerShell (but not PowerShell Core), Out-File and its effective alias > default to UTF-16LE character encoding, where most characters are represented as 2-byte sequences; for characters in the ASCII range, the 2nd byte of each sequence is NUL (0x0); additionally, such files start with a BOM that indicates the type of encoding.
By contrast, cmd.exe expects input to use the legacy single-byte OEM encoding (note that starting cmd.exe with /U only controls the encoding of its output).
When cmd.exe (unbeknownst to it) encounters UTF-16LE input:
It interprets the bytes individually as characters (even though characters in UTF-16LE are composed of 2 bytes (typically), or, in rare cases, of 4 (a pair of 2-byte sequences)).
It interprets the 2 bytes that make up the BOM (0xff, 0xfe) as part of the string. With OEM code page 437 (US-English) in effect, 0xff renders like a space, whereas 0xfe renders as ■.
Reading stops once the first NUL (0x0 byte) is encountered, which happens with the 1st character from the ASCII range, which in your sample string is 1.
Therefore, string 1i32l54bl5b2hlthtl098 encoded as UTF-16LE is read as ■1, as you state.
If you need full Unicode support, use UTF-8 encoding:
Use Out-File -Encoding utf8 in PowerShell.
Before reading the file in cmd.exe (in a batch file), run chcp 65001 in order to switch to the UTF-8 code page.
Caveats:
Not all Unicode chars. may render correctly, depending on the font used in the console window.
Legacy applications may malfunction with code page 65001 in effect, especially on older Windows versions.
A possible strategy to avoid problems is to temporarily switch to code page 65001, as needed, and then switch back.
Note that the above only covers communication via files, and only in one direction (PowerShell -> cmd.exe).
To also control the character encoding used for the standard streams (stdin, stdout, stderr), both when sending strings to cmd.exe / external programs and when interpreting strings received from them, see this answer of mine.

Converting from ANSI to UTF-8 using script

I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?

You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.

Encode a movie with Unicode filename in Windows using Popen

I want to encode a movie through IO.popen by ruby(1.9.3) in windows 7.
If the file name contains only ascii strings, encoding proceed normally.
But with unicode filename the script returns "No such file or directory" error.
Like following code.
#-*- encoding: utf-8 -*-
command = "ffmpeg -i ü.rm"
IO.popen(command){|pipe|
pipe.each{|line|
p line
}
}
I couldn't find whether the problem causes by ffmpeg or ruby.
How can fix this problem?

Windows doesn't use UTF-8 encoding. Ruby send the byte sequence of the Unicode filename to the file system directly, and of course the file system won't recognize UTF-8 sequences. It seems newer version of Ruby has fixed this issue. (I'm not sure. I'm using 1.9.2p290 and it's still there.)
You need to convert the UTF-8 filename to the encoding your Windows uses.
# coding: utf-8
code_page = "cp#{`chcp`.chomp[/\d+$/]}" # detect code page automatically.
command = "ffmpeg -i ü.rm".encode(code_page)
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
Another way is to save your script with the same encoding Windows uses. And don't forget to update the encoding declaration. For example, I'm using Simplified Chinese Windows and it uses GBK(CP936) as default encoding:
# coding: GBK
# save this file in GBK
command = "ffmpeg -i ü.rm"
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
BTW, by convention, it is suggested to use do...end for multi-line code blocks rather than {...}, unless in special cases.
UPDATE:
The underlying filesystem NTFS uses UTF-16 for file name encoding. So 가 is a valid filename character. However, GBK isn't able to encode 가, and so as to CP932 in your Japanese Windows. So you cannot send that specific filename to your cmd.exe and it isn't likely you can process that file with IO.popen. For CP932 compatible filenames, the encoding approach provided above works fine. For those filenames not compatible with CP932, it might be better to modify your filenames to a compatible one.

A file save via notepadd++ is failing diff comparison on terminal

I saved a text file both in UTF-8 and ASCII using notepad++ on windows. The text, which had the same letter representation as the UNIX version, were claimed to be completely different by diff (e.g. 1,267c1,267). The files were actually different on binary level (xxd -b test.txt), but then vimdiff had different result than vim: it showed them to be identical. I am guessing because vimdiff renders the text before doing diff on files? Why is there such inconsistency?

If you use the -b option to diff, it will ignore leading and trailing whitespace, including differences in end-of-line characters. If this doesn't take care of the problem, you can do a closer inspection of the individual files with hd (hexdump) or od -c (Octal dump, showing ascii characters).

Check end-of-line characters in the files you compare. It might be that you've saved them with \r\n at the end of each line while the Unix versions were terminated with \n.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio