I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8.
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content
Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}
After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.
I found another solution, iconv, but as I see it needs current encoding as parameter.
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
Differently encoded files are mixed in a folder structure, so files should stay on same path.
System uses Code page 852.
Existing UTF-8 files are without BOM.
In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:
From your OEM code page being 852 I infer that your "ANSI" code page is Windows-1250 (both defined by the legacy system locale), which doesn't match your Windows-1251-encoded input files.
Using Set-Content (and similar) with -Encoding UTF8 invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.
Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding] instance to the -Encoding parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.
You must therefore use the .NET framework directly:
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$file = $_.FullName
$mustReWrite = $false
# Try to read as UTF-8 first and throw an exception if
# invalid-as-UTF-8 bytes are encountered.
try {
[IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
} catch [System.Text.DecoderFallbackException] {
# Fall back to Windows-1251
$content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
$mustReWrite = $true
}
# Rewrite as UTF-8 without BOM (the .NET frameworks' default)
if ($mustReWrite) {
Write-Verbose "Converting from 1251 to UTF-8: $file"
[IO.File]::WriteAllText($file, $content)
} else {
Write-Verbose "Already UTF-8-encoded: $file"
}
}
Note: As in your own attempt, the above solution reads each file into memory as a whole, but that could be changed.
Note:
If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.
It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
Such a file would not contain meaningful Windows-1251 content, however.
There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).
Related
We have SMB shares that are used by Windows and Mac clients. We want to move some data to Sharepoint, but need to validate the filenames against characters that are not allowed in Windows. Although Windows users wouldn't be able to create files with illegal characters anyway, Mac users are still able to create files with characters that are illegal in Windows.
The problem is that for files with illegal characters in their names, Windows/Powershell substitutes those characters with private-use address unicode codepoint. These vary by input character.
$testfolder = "\\server\test\test*dir" # created from a Mac
$item = get-item -path $testfolder
$item.Name # testdir
$char = $($item.Name)[4] #
$bytes = [System.Text.Encoding]::BigEndianUnicode.GetBytes($char) # 240:33
$unicode = [System.BitConverter]::toString($bytes) # F0-21
For a file with name pipe|, the above code produces the output F0-27, so it's not simply a generic "invalid" character.
How can I check filenames for invalid values when I can't actually get the values??
As often happens, in trying to formulate my question as precisely as possible, I came upon a solution. I would still love any other answers for how this could be tackled more elegantly, but since I didn't find any other resources with this information, I'm providing my solution here in hopes it might help others with this same problem.
Invalid Characters Map to Specific Codepoints
Note: I'm extrapolating all of this from observations I've made. I'm happy for someone to comment or provide an alternative answer that is more complete or correct.
There is a certain set of characters that are invalid for Windows file names, but this is a restriction of the OS, NOT the filesystem. This means that that it's possible to set a filename on an SMB share that is valid on another OS (e.g. MacOS) but not on Windows. When Windows encounters such a file, the invalid characters are shadowed by a set of proxy unicode codepoints, which allows Windows to interact with the files without renaming them. These codepoints are in the unicode Private Use Area, which covers 0xE000-0xF8FF. Since these codepoints are not mapped to printable characters, Powershell displays them all as ▯ (U+25AF). In my specific use case, I need to run a report of what invalid characters are present in a filename, so this generic character message is not helpful.
Through experimentation, I was able to determine the proxy codepoints for each of the printable restricted characters. I've included them below for reference (note: YMMV on this, I haven't tested it on multiple systems, but I suspect it's consistent between versions).
Character
Unicode
"
0xF020
*
0xF021
/
0xF022
<
0xF023
>
0xF024
?
0xF025
\
0xF026
|
0xF027
(trailing space)
0xF028
: is not allowed in filenames on any system I have easy access to, so I wasn't able to test that one.
Testing names in Powershell
Now that we know this, it's pretty simple to tackle in powershell. I created a hashtable with all of the proxy unicode points as keys and the "real" characters as values, which we can then use as a lookup table. I chose to replace the characters in the filename string before testing the name. This makes debugging easier.
#Set up regex for invalid characters
$invalid = [Regex]::new('^\s|[\"\*\:<>?\/\\\|]|\s$')
#Create lookup table for unicode values
$charmap = #{
[char]0xF020 = '"'
[char]0xF021 = '*'
[char]0xF022 = '/'
[char]0xF023 = '<'
[char]0xF024 = '>'
[char]0xF025 = '?'
[char]0xF026 = '\'
[char]0xF027 = '|'
[char]0xF028 = ' '
}
Get-ChildItem -Path "\\path\to\folder" -Recurse | Foreach-Object {
# Get the filename
$fixedname = split-path -path $_.FullName -leaf
#Iterate through the hashtable and replace all the proxy characters with printable versions
foreach($key in $charmap.getEnumerator()){
$fixedname = $fixedname.Replace($key.Name,$key.Value)
}
#Build a list of invalid characters to include in report (not shown here)
$invalidmatches = $invalid.Matches($fixedname)
if ($invalidmatches.count -gt 0) {
$invalidchars = $($invalidmatches | foreach-object {
if ($_.value -eq ' '){"Leading or trailing space"} else {$_.value}}) -join ", "
}
}
Extending the solution
In theory, you could also extend this to cover other prohibited characters, such as the ASCII control characters. Since these proxy unicode points are in the PUA, and there is no documentation on how this is handled (as far as I know), discovering these associations is down to experimentation. I'm content to stop here, as I have run through all of the characters that are easily put in filenames by users on MacOS systems.
I am trying to create a script that converts the encoding of a collection of CSV files (10-20 files) in a directory into UTF-8 encoding. Currently, I am doing this manually by opening each individual file In NotePad+ and then switching the encoding to UTF-8, and then re-saving.
Are there any Windows commands or something else (I have Cygwin installed as well), that I could use to build a script to do this? Ideally, I would like the script to loop through every CSV file in the directory, and convert it into a UTF-8.
Thank you in advance for the help!!!
You're not specifying what to convert from, but assuming the input encoding is Windows-1252, try
for file in *.csv; do
iconv -f cp-1252 <"$file" >"$file".tmp &&
mv "$file.tmp" "$file"
done
This could leave some files unconverted (for example, if the input file contains bytes which are undefined in the source encoding) but will not overwrite the source file in this scenario. (Maybe disable the mv logic until you can see whether it works without errors.)
You can easily do that in PowerShell
Get-Content filename.csv | Set-Content -Encoding utf8 filename-utf8.csv
For your loop, you need to modularize your commands to where you can reference them and call them properly. In your case, you need to be calling "baseName" and appending ".csv" to it After that, simply using the right variables in the right places in ForEach loop will make it work.
$a = Get-ChildItem
ForEach ($item in $a) {
Get-Content $item.FullName | Set-Content -Encoding utf8 "$($item.Basename).csv.utf8"
}
Remember that before Powershell 6, Microsoft includes BOM ( Byte-Order Mark ). Three chars placed at the beginning of a file in the conversion.
The conversion needs to create an additional file that later on you can use mv for replacing the original.
I have a source csv file which is quite big and in order to be able to work more efficiently with it I decided to split it into smaller file chunks. In order to do that, I execute the following script:
Get-Content C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | %{$i++; $_ | Out-File C:\Users\me\Desktop\Processed\splitfile_$i.csv}
As you can see, these are csv files which contain alphanumeric data. So, I have an issue with strings similar to this one:
Hämeenkatu 33
In the target file it looks like this:
Hämeenkatu 33
I've tried to determine the encoding of the source file and it is UTF-8 (as described here). I am really wondering why it gets so messed up in the target. I've also tried the following to explicitly tell that I want the encoding to be UTF8 but without success:
Get-Content C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | %{$i++; $_ | Out-File -Encoding "UTF8" C:\Users\me\Desktop\Processed\splitfile_$i.csv}
I am using a Windows machine running Windows 10.
Does the input file have a bom? Try get-content -encoding utf8. Out-file defaults to utf16le or what windows and powershell call "unicode".
Get-Content -encoding utf8 C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 |
%{$i++; $_ |
Out-File -encoding utf8 C:\Users\me\Desktop\Processed\splitfile_$i.csv}
The output file will have a bom unless you use powershell 6 or 7.
js2010's answer provides an effective solution; let me complement it with background information (a summary of the case at hand is at the bottom):
Fundamentally, PowerShell never preserves the character encoding of a [text] input file on output:
On reading, file content is decoded into .NET strings (which are internally UTF-16 code units):
Files with a BOM for the following encodings are always correctly recognized (identifiers recognized by the -Encoding parameter of PowerShell's cmdlets in parentheses):
UTF-8 (UTF8) - info
UTF-16LE (Unicode) / UTF-16BE (BigEndianUnicode) - info
UTF-32LE (UTF32) / UTF-32BE (BigEndianUTF32) - info
Note the absence of UTF-7, which, however, is rarely used as an encoding in practice.
Without a BOM, a default encoding is assumed:
PowerShell [Core] v6+ commendably assumes UTF-8.
The legacy Windows PowerShell (PowerShell up to v5.1) assumes ANSI encoding, i.e the code page determined by the legacy system locale; e.g., Windows-1252 on US-English systems.
The -Encoding parameter of file-reading cmdlets allows you to specify the source encoding explicitly, but note that the presence of a (supported) BOM overrides this - see below for what encodings are supported.
On writing, .NET strings are encoded based on a default encoding, unless an encoding is explicitly specified with -Encoding (the .NET strings created on reading carry no information about the encoding of the original input file, so it cannot be preserved):
PowerShell [Core] v6+ commendably uses BOM-less UTF-8.
The legacy Windows PowerShell (PowerShell up to v5.1) regrettably uses various default encodings, depending on the specific cmdlet / operator used.
Notably, Set-Content defaults to ANSI (as for reading), and Out-File / > defaults to UTF-16LE.
See this answer for the full picture.
As noted in js2010's answer, using -Encoding UTF8 in Windows PowerShell invariably creates files with a BOM, which can be problematic for files read by tools on Unix-like platforms / tools with a Unix heritage, which are often not equipped to deal with such a BOM.
See the answers to this question for how to create BOM-less UTF-8 files in Windows PowerShell.
As with reading, the -Encoding parameter of file-writing cmdlets allows you to specify the output encoding explicitly:
Note that in PowerShell [Core] v6+, in addition to its defaulting to BOM-less UTF-8, -Encoding UTF8 too refers to the BOM-less variant (unlike in Windows PowerShell), and there you must use -Encoding UTF8BOM in order to create a file with BOM.
Curiously, as of PowerShell [Core] v7.0, there is no -Encoding value for the system's active ANSI code page, i.e. for Windows PowerShell's default (in Windows PowerShell, -Encoding Default explicitly request ANSI encoding, but in PowerShell [Core] this refers to BOM-less UTF-8). This problematic omission is discussed in this GitHub issue. By contrast, targeting the active OEM code page with -Encoding OEM still works.
In order to create UTF-32BE files, Windows PowerShell requires identifier BigEndianUtf32; due to a bug in PowerShell [Core] as of v7.0, this identifier isn't supported, but you can use UTF-32BE instead.
Windows PowerShell is limited to those encodings listed in the Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding enumeration, but PowerShell [Core] allows you to pass any of the supported .NET encodings to the -Encoding parameter, either by code-page number (e.g., 1252) or by encoding name (e.g., windows-1252); [Text.Encoding]::GetEncodings().CodePage and [Text.Encoding]::GetEncodings().Name enumerate them in principle, but note that due to lack of .NET Core API support as of v7.0 this enumeration lists only a small subset of the actually supported encodings; running these commands in Windows PowerShell will show them all.
You can create UTF-7 files (UTF7), but they won't have a BOM; even input files that do have one aren't automatically recognized on reading, so specifying -Encoding UTF7 is always necessary for reading UTF-7 files.
In short:
In PowerShell, you have to know an input file's encoding in order to match that encoding on writing, and specify that encoding explicitly via the -Encoding parameter (if it differs from the default).
Get-Content (without -Encoding) provides no information as to what encoding it detected via a BOM or which one it assumed in the absence of a BOM.
If needed, you can perform your own analysis of the opening bytes of a text file to look for a BOM, but note that in the absence of one you'll have to rely on heuristics to infer the encoding - that is, you can make a reasonable guess, but you cannot be certain.
Also note that PowerShell, as of v7, fundamentally lacks support for passing raw byte streams through the pipeline - see this answer.
Your particular case:
Your problem was that your input file was UTF-8-encoded, but didn't have a BOM (which is actually preferable for the widest compatibility).
Since you're using Windows PowerShell, which misinterprets such files as ANSI-encoded, you need to tell it to read the file as UTF-8 with -Encoding Utf8.
As stated, on writing -Encoding Utf8 inevitably creates a file with BOM in Windows PowerShell; if that is a concern, use the .NET framework directly to produce a BOM-less files, as shown in the answers to this question.
Note that you would have had no problem with your original command in PowerShell [Core] v6+ - it defaults to BOM-less UTF-8 both on reading and writing, across all cmdlets.
This sensible, standardized default alone is a good reason for considering the move to PowerShell v7.0, which aims to be a superior replacement for the legacy Windows PowerShell.
This question is related to another one which went the perl way but found much difficulties due to Windows bugs. (see Perl or Powershell how to convert from UCS-2 little endian to utf-8 or do inline oneliner search replace regex on UCS-2 file )
I would like the POWERSHELL equivalent of simple perl regex on a little endian UCS-2 format file (UCS-2LE is same as UTF-16 Little Endian). ie:
perl -pi.bak -e 's/search/replace/g;' MyUCS-2LEfile.txt
You will probably need to tell Powershell gci that input file is ucs2-le and that you want output file in same UCS-2LE (windows CR LF) format also etc.
This will output the file after regex. The output file does -not- begin with a BOM. This should work for small files. For large files, it may require changes to be speedy.
$fin = 'C:/src/t/revbom-in.txt'
$fout = 'C:/src/t/revbom-out.txt'
if (Test-Path -Path $fout) { Remove-Item -Path $fout }
# Create a file for input
$UCS2LENoBomEncoding = New-Object System.Text.UnicodeEncoding $False, $False
[System.IO.File]::WriteAllLines($fin, "now is the time`r`nwhen was the time", $UCS2LENoBomEncoding)
# Read the file in, replace string, write file out
[System.IO.File]::ReadLines($fin, $UCS2LENoBomEncoding) |
ForEach-Object {
[System.IO.File]::AppendAllLines($fout, [string[]]($_ -replace 'the','a'), $UCS2LENoBomEncoding)
}
HT: #refactorsaurusrex at https://gist.github.com/refactorsaurusrex/9aa6b72f3519dbc71f7d0497df00eeb1 for the [string[]] cast
NB: mklement0 at https://gist.github.com/mklement0/acb868a9f15d9a34b6e88fc874b3851d
NB: If the source file is HTML, please see https://stackoverflow.com/a/1732454/447901
I have a batch script that prompts a user for some input then outputs a couple of files I'm using in an AIX environment. These files need to be in UNIX format (which I believe is UTF8), but I'm looking for some direction on the SIMPLEST way of doing this.
I don't like to have to download extra software packages; Cygwin or GnuWin32. I don't mind coding this if it is possible, my coding options are Batch, Powershell and VBS. Does anyone know of a way to do this?
Alternatively could I create the files with Batch and call a Powershell script to reform these?
The idea here is a user would be prompted for some information, then I output a standard file which are basically prompt answers in AIX for a job. I'm using Batch initially, because I didn't know that I would run into this problem, but I'm kind of leaning towards redoing this in Powershell. because I had found some code on another forum that can do the conversion (below).
% foreach($i in ls -name DIR/*.txt) { \
get-content DIR/$i | \
out-file -encoding utf8 -filepath DIR2/$i \
}
Looking for some direction or some input on this.
You can't do this without external tools in batch files.
If all you need is the file encoding, then the snippet you gave should work. If you want to convert the files inline (instead of writing them to another place) you can do
Get-ChildItem *.txt | ForEach-Object { (Get-Content $_) | Out-File -Encoding UTF8 $_ }
(the parentheses around Get-Content are important) However, this will write the files in UTF-8 with a signature at the start (U+FEFF) which some Unix tools don't accept (even though it's technically legal, though discouraged to use).
Then there is the problem that line breaks are different between Windows and Unix. Unix uses only U+000A (LF) while Windows uses two characters for that: U+000D U+000A (CR+LF). So ideally you'd convert the line breaks, too. But that gets a little more complex:
Get-ChildItem *.txt | ForEach-Object {
# get the contents and replace line breaks by U+000A
$contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
# create UTF-8 encoding without signature
$utf8 = New-Object System.Text.UTF8Encoding $false
# write the text back
[IO.File]::WriteAllText($_, $contents, $utf8)
}
Try the overloaded version ReadAllText(String, Encoding) if you are using ANSI characters and not only ASCII ones.
$contents = [IO.File]::ReadAllText($_, [Text.Encoding]::Default) -replace "`r`n", "`n"
https://msdn.microsoft.com/en-us/library/system.io.file.readalltext(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
ASCII - Gets an encoding for the ASCII (7-bit) character set.
Default - Gets an encoding for the operating system's current ANSI code page.