Detecting invalid (Windows) filenames - windows

We have SMB shares that are used by Windows and Mac clients. We want to move some data to Sharepoint, but need to validate the filenames against characters that are not allowed in Windows. Although Windows users wouldn't be able to create files with illegal characters anyway, Mac users are still able to create files with characters that are illegal in Windows.
The problem is that for files with illegal characters in their names, Windows/Powershell substitutes those characters with private-use address unicode codepoint. These vary by input character.
$testfolder = "\\server\test\test*dir" # created from a Mac
$item = get-item -path $testfolder
$item.Name # testdir
$char = $($item.Name)[4] # 
$bytes = [System.Text.Encoding]::BigEndianUnicode.GetBytes($char) # 240:33
$unicode = [System.BitConverter]::toString($bytes) # F0-21
For a file with name pipe|, the above code produces the output F0-27, so it's not simply a generic "invalid" character.
How can I check filenames for invalid values when I can't actually get the values??

As often happens, in trying to formulate my question as precisely as possible, I came upon a solution. I would still love any other answers for how this could be tackled more elegantly, but since I didn't find any other resources with this information, I'm providing my solution here in hopes it might help others with this same problem.
Invalid Characters Map to Specific Codepoints
Note: I'm extrapolating all of this from observations I've made. I'm happy for someone to comment or provide an alternative answer that is more complete or correct.
There is a certain set of characters that are invalid for Windows file names, but this is a restriction of the OS, NOT the filesystem. This means that that it's possible to set a filename on an SMB share that is valid on another OS (e.g. MacOS) but not on Windows. When Windows encounters such a file, the invalid characters are shadowed by a set of proxy unicode codepoints, which allows Windows to interact with the files without renaming them. These codepoints are in the unicode Private Use Area, which covers 0xE000-0xF8FF. Since these codepoints are not mapped to printable characters, Powershell displays them all as ▯ (U+25AF). In my specific use case, I need to run a report of what invalid characters are present in a filename, so this generic character message is not helpful.
Through experimentation, I was able to determine the proxy codepoints for each of the printable restricted characters. I've included them below for reference (note: YMMV on this, I haven't tested it on multiple systems, but I suspect it's consistent between versions).
Character
Unicode
"
0xF020
*
0xF021
/
0xF022
<
0xF023
>
0xF024
?
0xF025
\
0xF026
|
0xF027
(trailing space)
0xF028
: is not allowed in filenames on any system I have easy access to, so I wasn't able to test that one.
Testing names in Powershell
Now that we know this, it's pretty simple to tackle in powershell. I created a hashtable with all of the proxy unicode points as keys and the "real" characters as values, which we can then use as a lookup table. I chose to replace the characters in the filename string before testing the name. This makes debugging easier.
#Set up regex for invalid characters
$invalid = [Regex]::new('^\s|[\"\*\:<>?\/\\\|]|\s$')
#Create lookup table for unicode values
$charmap = #{
[char]0xF020 = '"'
[char]0xF021 = '*'
[char]0xF022 = '/'
[char]0xF023 = '<'
[char]0xF024 = '>'
[char]0xF025 = '?'
[char]0xF026 = '\'
[char]0xF027 = '|'
[char]0xF028 = ' '
}
Get-ChildItem -Path "\\path\to\folder" -Recurse | Foreach-Object {
# Get the filename
$fixedname = split-path -path $_.FullName -leaf
#Iterate through the hashtable and replace all the proxy characters with printable versions
foreach($key in $charmap.getEnumerator()){
$fixedname = $fixedname.Replace($key.Name,$key.Value)
}
#Build a list of invalid characters to include in report (not shown here)
$invalidmatches = $invalid.Matches($fixedname)
if ($invalidmatches.count -gt 0) {
$invalidchars = $($invalidmatches | foreach-object {
if ($_.value -eq ' '){"Leading or trailing space"} else {$_.value}}) -join ", "
}
}
Extending the solution
In theory, you could also extend this to cover other prohibited characters, such as the ASCII control characters. Since these proxy unicode points are in the PUA, and there is no documentation on how this is handled (as far as I know), discovering these associations is down to experimentation. I'm content to stop here, as I have run through all of the characters that are easily put in filenames by users on MacOS systems.

Related

Adding/Removing hosts file entry using powershell corrupts the file

I'm trying to add or remove a specific entry in Windows hosts file using powershell, but when I do this, it works for some time, and after a while it gets edited again (when Windows reads it, I guess), and it becomes corrupted (displays chinese characters).
I've tried using parts of a code i found here.
It allows me to edit the file properly and the entry is effective, until it gets corrupted.
I'm doing this to add the entry:
If ((Get-Content "$($env:windir)\system32\Drivers\etc\hosts" ) -notcontains "111.111.111.111 example.com")
{ac -Encoding UTF8 "$($env:windir)\system32\Drivers\etc\hosts" "111.111.111.111 example.com" }
Here is what the file looks like after it gets corrupted:
Thanks for your help.
Solved:
Remove -Encoding UTF8
Because as it states in the comment of the hosts file, "The IP address and the host name should be separated by at least one space.", trying to find a string with exactly one space character in between could return false.
I think it would be better to use Regex for this as it allows matching on more than one space character to separate the IP from the host name.
However, this does require the usage of [Regex]::Escape() on both parts of the entry as they contain regex special characters (the dot).
Something like this:
$hostsFile = "$($env:windir)\system32\Drivers\etc\hosts"
$hostsEntry = '111.111.111.111 example.com'
# split the entry into separate variables
$ipAddress, $hostName = $hostsEntry -split '\s+',2
# prepare the regex
$re = '(?m)^{0}[ ]+{1}' -f [Regex]::Escape($ipAddress), [Regex]::Escape($hostName)
If ((Get-Content $hostsFile -Raw) -notmatch $re) {
Add-Content -Path $hostsFile -Value $hostsEntry
}

Unable to change encoding of text files in Windows

I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8.
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content
Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}
After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.
I found another solution, iconv, but as I see it needs current encoding as parameter.
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
Differently encoded files are mixed in a folder structure, so files should stay on same path.
System uses Code page 852.
Existing UTF-8 files are without BOM.
In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:
From your OEM code page being 852 I infer that your "ANSI" code page is Windows-1250 (both defined by the legacy system locale), which doesn't match your Windows-1251-encoded input files.
Using Set-Content (and similar) with -Encoding UTF8 invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.
Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding] instance to the -Encoding parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.
You must therefore use the .NET framework directly:
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$file = $_.FullName
$mustReWrite = $false
# Try to read as UTF-8 first and throw an exception if
# invalid-as-UTF-8 bytes are encountered.
try {
[IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
} catch [System.Text.DecoderFallbackException] {
# Fall back to Windows-1251
$content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
$mustReWrite = $true
}
# Rewrite as UTF-8 without BOM (the .NET frameworks' default)
if ($mustReWrite) {
Write-Verbose "Converting from 1251 to UTF-8: $file"
[IO.File]::WriteAllText($file, $content)
} else {
Write-Verbose "Already UTF-8-encoded: $file"
}
}
Note: As in your own attempt, the above solution reads each file into memory as a whole, but that could be changed.
Note:
If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.
It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
Such a file would not contain meaningful Windows-1251 content, however.
There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).

How to list files given path with poorly escaped Windows separator

I'm attempting to do this:
Dir["c:\temp\*.*"]
but that is failing. I understand why, but I seem to lack the Ruby prowess to work around it.
I am given the path in a variable and otherwise have no control over it. Nor do I know the contents ahead of time.
Is there a way to make Dir function with double quoted strings that are poorly escaped? Alternatively, how does one take a variable with the apparent contents
"c:\temp\*.*"
and convert it into
'c:/temp/*.*'
This problem at the core seems to be how to potentially escape a string that should have been escaped but now is not.
The end result is I am not able to use the given string to do this as conceptually simple as puts() or Dir[].
If given 'c:\temp\*.*' then I have no problem. I can fix that:
foo = 'c:\temp\*.*'.gsub('\\', '/')
If given "c:\\\\temp\\\\*.*" then I have no problem. I can fix that:
foo = "c:\\temp\\*.*".gsub("\\", "/")
However, I am passed neither of those, but rather "c:\\temp\\*.*". This string contains a TAB and a second undefined escape. It is this that I can't fix in a general way.
Even if I knew the contents ahead of time I am stumped on how to properly escape and transform this. I should add that I am not a ruby programmer at the moment so maybe there is some simple method to deal with this that I am not aware of.
I tried a bunch of stuff like:
"c:\temp\*.*".gsub("\t", "/t")
which gets me part of the way, but since the actual contents of the string are not known to me ahead of time this is a little wonky. Further, if the escape character is not valid as in \\* then I am also in a jam. So this also fails:
"c:\temp\*.*".gsub("\t", "/t").gsub("\*", "/*")
Is there a way to make Dir function with double quoted strings that are poorly escaped?
No.
Garbage in, garbage out. There is no Rumpelstiltskin routine that returns gold when given trash.
Ruby auto-converts forward-slashes in filenames/paths to reverse-slashes when running on Windows. Simply make it a habit of using forward, *nix-style, slashes and you'll be fine.
From the IO documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
I don't have "c:\temp" I have "c:\temp" as input
In a properly defined Windows path you should see:
'c:' + '\temp' + '\*.*' # => "c:\\temp\\*.*"
Note that the single-quotes are treating "\t" as an escaped-escape + "t". Your source for the variable is creating the string improperly by using double-quotes:
'c:' + "\temp" + "\*.*" # => "c:\temp*.*"
If you have "\t", you have a TAB character. It's possible to change it to an escaped-T using:
"c:\temp" # => "c:\temp"
"c:\temp"[2] # => "\t"
"c:\temp"[2].ord # => 9
'\t' # => "\\t"
"c:\temp".sub("\t", '\t') # => "c:\\temp"
The next problem is what to do when you have a String containing "*" to convert it to "\*". There's no way to search for "\*" because that's the same as "*" as seen above:
"\*.*" # => "*.*"
But, since "*.*" is a fairly specific "anything" wildcard, maybe simply searching for and replacing that pattern would work:
"c:\temp\*.*".gsub('*.*', '\\*.*') # => "c:\temp\\*.*"
or:
"c:\temp\*.*".gsub('*.*', '/*.*') # => "c:\temp/*.*"
Back to dealing with "\t" and putting it all together... I'd start with:
"c:\temp\*.*".gsub("\t", '\t').gsub('*.*', '/*.*') # => "c:\\temp/*.*"
"c:\temp\*.*".gsub("\t", '/t').gsub('*.*', '/*.*') # => "c:/temp/*.*"
You'll have to figure out what to do if you have something like:
c:/dir/file*.*
where they mean they want all files starting with file. Since you're seeing ambiguous inputs it seems the input routine needs to be more rigorous to not allow reversed-slashes.

How to bulk susbstitute _ by / in Windows filename for all files in a folder?

The command I'm trying is:
Get-Children | Rename-Item -NewName { $_.Name -replace '_','/' }
But apparently we can't substitute by / for file names in Windows. The error is:
Cannot rename the specified target, because it represents a path or device name.
As others have already pointed out, what you want simply isn't possible in Windows. Forward slashes are reserved characters that are not allowed in file and folder names.
Naming Conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
[…]
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
&ast; (asterisk)

Batch renaming of files with international chars on Windows XP

I have a whole bunch of files with filenames using our lovely Swedish letters å å and ö.
For various reasons I now need to convert these to an [a-zA-Z] range. Just removing anything outside this range is fairly easy. The thing that's causing me trouble is that I'd like to replace å with a, ö with o and so on.
This is charset troubles at their worst.
I have a set of test files:
files\Copy of New Text Documen åäö t.txt
files\fofo.txt
files\New Text Document.txt
files\worstcase åäöÅÄÖéÉ.txt
I'm basing my script on this line, piping it's results into various commands
for %%X in (files\*.txt) do (echo %%X)
The wierd thing is that if I print the results of this (the plain for-loop that is) into a file I get this output:
files\Copy of New Text Documen †„” t.txt
files\fofo.txt
files\New Text Document.txt
files\worstcase †„”Ž™‚.txt
So something wierd is happening to my filenames before they even reach the other tools (I've been trying to do this using a sed port for Windows from something called GnuWin32 but no luck so far) and doing the replace on these characters doesn't help either.
How would you solve this problem? I'm open to any type of tools, commandline or otherwise…
EDIT: This is a one time problem, so I'm looking for a quick 'n ugly fix
You can use this code (Python)
Rename international files
# -*- coding: cp1252 -*-
import os, shutil
base_dir = "g:\\awk\\" # Base Directory (includes subdirectories)
char_table_1 = "áéíóúñ"
char_table_2 = "aeioun"
adirs = os.walk (base_dir)
for adir in adirs:
dir = adir[0] + "\\" # Directory
# print "\nDir : " + dir
for file in adir[2]: # List of files
if os.access(dir + file, os.R_OK):
file2 = file
for i in range (0, len(char_table_1)):
file2 = file2.replace (char_table_1[i], char_table_2[i])
if file2 <> file:
# Different, rename
print dir + file, " => ", file2
shutil.move (dir + file, dir + file2)
###
You have to change your encoding and your char tables (I tested this script with Spanish files and works fine). You can comment the "move" line to check if it's working ok, and remove the comment later to do the renaming.
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Others have proposed using a real programming language. That's fine, especially if you have a language you are very comfortable with. My friend on the C# team says that C# 3.0 (with Linq) is well-suited to whipping up quick, small programs like this. He has stopped writing batch files most of the time.
Personally, I would choose PowerShell. This problem can be solved right on the command line, and in a single line. I'll
EDIT: it's not one line, but it's not a lot of code, either. Also, it looks like StackOverflow doesn't like the syntax "$_.Name", and renders the _ as &#95.
$mapping = #{
"å" = "a"
"ä" = "a"
"ö" = "o"
}
Get-ChildItem -Recurse . *.txt | Foreach-Object {
$newname = $_.Name
foreach ($l in $mapping.Keys) {
$newname = $newname.Replace( $l, $mapping[$l] )
$newname = $newname.Replace( $l.ToUpper(), $mapping[$l].ToUpper() )
}
Rename-Item -WhatIf $_.FullName $newname # remove the -WhatIf when you're ready to do it for real.
}
I would write this in C++, C#, or Java -- environments where I know for certain that you can get the Unicode characters out of a path properly. It's always uncertain with command-line tools, especially out of Cygwin.
Then the code is a simple find/replace or regex/replace. If you can name a language it would be easy to write the code.
I'd write a vbscript (WSH) to scan the directories, then send the filenames to a function that breaks up the filenames into their individual letters, then does a SELECT CASE on the Swedish ones and replaces them with the ones you want. Or, instead of doing that the function could just drop it thru a bunch of REPLACE() functions, reassigning the output to the input string. At the end it then renames the file with the new value.

Resources