Batch renaming of files with international chars on Windows XP - windows

I have a whole bunch of files with filenames using our lovely Swedish letters å å and ö.
For various reasons I now need to convert these to an [a-zA-Z] range. Just removing anything outside this range is fairly easy. The thing that's causing me trouble is that I'd like to replace å with a, ö with o and so on.
This is charset troubles at their worst.
I have a set of test files:
files\Copy of New Text Documen åäö t.txt
files\fofo.txt
files\New Text Document.txt
files\worstcase åäöÅÄÖéÉ.txt
I'm basing my script on this line, piping it's results into various commands
for %%X in (files\*.txt) do (echo %%X)
The wierd thing is that if I print the results of this (the plain for-loop that is) into a file I get this output:
files\Copy of New Text Documen †„” t.txt
files\fofo.txt
files\New Text Document.txt
files\worstcase †„”Ž™‚.txt
So something wierd is happening to my filenames before they even reach the other tools (I've been trying to do this using a sed port for Windows from something called GnuWin32 but no luck so far) and doing the replace on these characters doesn't help either.
How would you solve this problem? I'm open to any type of tools, commandline or otherwise…
EDIT: This is a one time problem, so I'm looking for a quick 'n ugly fix

You can use this code (Python)
Rename international files
# -*- coding: cp1252 -*-
import os, shutil
base_dir = "g:\\awk\\" # Base Directory (includes subdirectories)
char_table_1 = "áéíóúñ"
char_table_2 = "aeioun"
adirs = os.walk (base_dir)
for adir in adirs:
dir = adir[0] + "\\" # Directory
# print "\nDir : " + dir
for file in adir[2]: # List of files
if os.access(dir + file, os.R_OK):
file2 = file
for i in range (0, len(char_table_1)):
file2 = file2.replace (char_table_1[i], char_table_2[i])
if file2 <> file:
# Different, rename
print dir + file, " => ", file2
shutil.move (dir + file, dir + file2)
###
You have to change your encoding and your char tables (I tested this script with Spanish files and works fine). You can comment the "move" line to check if it's working ok, and remove the comment later to do the renaming.

You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Others have proposed using a real programming language. That's fine, especially if you have a language you are very comfortable with. My friend on the C# team says that C# 3.0 (with Linq) is well-suited to whipping up quick, small programs like this. He has stopped writing batch files most of the time.
Personally, I would choose PowerShell. This problem can be solved right on the command line, and in a single line. I'll
EDIT: it's not one line, but it's not a lot of code, either. Also, it looks like StackOverflow doesn't like the syntax "$_.Name", and renders the _ as &#95.
$mapping = #{
"å" = "a"
"ä" = "a"
"ö" = "o"
}
Get-ChildItem -Recurse . *.txt | Foreach-Object {
$newname = $_.Name
foreach ($l in $mapping.Keys) {
$newname = $newname.Replace( $l, $mapping[$l] )
$newname = $newname.Replace( $l.ToUpper(), $mapping[$l].ToUpper() )
}
Rename-Item -WhatIf $_.FullName $newname # remove the -WhatIf when you're ready to do it for real.
}

I would write this in C++, C#, or Java -- environments where I know for certain that you can get the Unicode characters out of a path properly. It's always uncertain with command-line tools, especially out of Cygwin.
Then the code is a simple find/replace or regex/replace. If you can name a language it would be easy to write the code.

I'd write a vbscript (WSH) to scan the directories, then send the filenames to a function that breaks up the filenames into their individual letters, then does a SELECT CASE on the Swedish ones and replaces them with the ones you want. Or, instead of doing that the function could just drop it thru a bunch of REPLACE() functions, reassigning the output to the input string. At the end it then renames the file with the new value.

Related

Detecting invalid (Windows) filenames

We have SMB shares that are used by Windows and Mac clients. We want to move some data to Sharepoint, but need to validate the filenames against characters that are not allowed in Windows. Although Windows users wouldn't be able to create files with illegal characters anyway, Mac users are still able to create files with characters that are illegal in Windows.
The problem is that for files with illegal characters in their names, Windows/Powershell substitutes those characters with private-use address unicode codepoint. These vary by input character.
$testfolder = "\\server\test\test*dir" # created from a Mac
$item = get-item -path $testfolder
$item.Name # testdir
$char = $($item.Name)[4] # 
$bytes = [System.Text.Encoding]::BigEndianUnicode.GetBytes($char) # 240:33
$unicode = [System.BitConverter]::toString($bytes) # F0-21
For a file with name pipe|, the above code produces the output F0-27, so it's not simply a generic "invalid" character.
How can I check filenames for invalid values when I can't actually get the values??
As often happens, in trying to formulate my question as precisely as possible, I came upon a solution. I would still love any other answers for how this could be tackled more elegantly, but since I didn't find any other resources with this information, I'm providing my solution here in hopes it might help others with this same problem.
Invalid Characters Map to Specific Codepoints
Note: I'm extrapolating all of this from observations I've made. I'm happy for someone to comment or provide an alternative answer that is more complete or correct.
There is a certain set of characters that are invalid for Windows file names, but this is a restriction of the OS, NOT the filesystem. This means that that it's possible to set a filename on an SMB share that is valid on another OS (e.g. MacOS) but not on Windows. When Windows encounters such a file, the invalid characters are shadowed by a set of proxy unicode codepoints, which allows Windows to interact with the files without renaming them. These codepoints are in the unicode Private Use Area, which covers 0xE000-0xF8FF. Since these codepoints are not mapped to printable characters, Powershell displays them all as ▯ (U+25AF). In my specific use case, I need to run a report of what invalid characters are present in a filename, so this generic character message is not helpful.
Through experimentation, I was able to determine the proxy codepoints for each of the printable restricted characters. I've included them below for reference (note: YMMV on this, I haven't tested it on multiple systems, but I suspect it's consistent between versions).
Character
Unicode
"
0xF020
*
0xF021
/
0xF022
<
0xF023
>
0xF024
?
0xF025
\
0xF026
|
0xF027
(trailing space)
0xF028
: is not allowed in filenames on any system I have easy access to, so I wasn't able to test that one.
Testing names in Powershell
Now that we know this, it's pretty simple to tackle in powershell. I created a hashtable with all of the proxy unicode points as keys and the "real" characters as values, which we can then use as a lookup table. I chose to replace the characters in the filename string before testing the name. This makes debugging easier.
#Set up regex for invalid characters
$invalid = [Regex]::new('^\s|[\"\*\:<>?\/\\\|]|\s$')
#Create lookup table for unicode values
$charmap = #{
[char]0xF020 = '"'
[char]0xF021 = '*'
[char]0xF022 = '/'
[char]0xF023 = '<'
[char]0xF024 = '>'
[char]0xF025 = '?'
[char]0xF026 = '\'
[char]0xF027 = '|'
[char]0xF028 = ' '
}
Get-ChildItem -Path "\\path\to\folder" -Recurse | Foreach-Object {
# Get the filename
$fixedname = split-path -path $_.FullName -leaf
#Iterate through the hashtable and replace all the proxy characters with printable versions
foreach($key in $charmap.getEnumerator()){
$fixedname = $fixedname.Replace($key.Name,$key.Value)
}
#Build a list of invalid characters to include in report (not shown here)
$invalidmatches = $invalid.Matches($fixedname)
if ($invalidmatches.count -gt 0) {
$invalidchars = $($invalidmatches | foreach-object {
if ($_.value -eq ' '){"Leading or trailing space"} else {$_.value}}) -join ", "
}
}
Extending the solution
In theory, you could also extend this to cover other prohibited characters, such as the ASCII control characters. Since these proxy unicode points are in the PUA, and there is no documentation on how this is handled (as far as I know), discovering these associations is down to experimentation. I'm content to stop here, as I have run through all of the characters that are easily put in filenames by users on MacOS systems.

One-line program to delete files with few header lines

This is the next part of my earlier question perl one-liner to keep only desired lines. Here I have many *.fa files in a folder.
Suppose for three files: 1.fa, 2.fa, 3.fa
The contents of them are as follows:
1.fa
>djhnk_9
abfgdddcfdafaf
ygdugidg
>kjvk.80
jdsfkdbfdkfadf
>jnck_q2
fdgsdfjghsjhsfddf
>7ytiu98
ihdlfwdfjdlfl]ol
2.fa
>cj76
dkjfhkdjcfhdjk
>67q32
nscvsdkvklsflplsad
>kbvbk
cbjfdikjbfadkjfbka
3.fa
>1290.5
mnzmnvjbsdjb
The lines that start with a > are the headers and the rest are the feature lines.
I want to delete those files that have 3 or fewer header lines. Here, file 2.fa and file 3.fa should be deleted.
As I am working on a Windows system, preferably I use a one-line Perl script like:
for %%F in ("*.fa") do perl ...
Is there a one-line program for that?
Use a program. "One-liners" are inscrutable, non-portable, and very hard to debug
This does as you ask. I hope it's clear that I have commented out the unlink call for testing purposes: it would be a pain to regenerate the *.fa files each time
You will probably want to change '[0-9].fa' to just *.fa. I had other files in my own directory that I didn't want to be considered
use strict;
use warnings 'all';
while ( my $file = glob '[0-9].fa' ) {
open my $fh, '<', $file;
my $headers = grep /^>/, <$fh>;
#unlink $file if $headers <= 3;
print qq{deleting "$file"\n} if $headers <= 3;
}
output
deleting "2.fa"
deleting "3.fa"
Next time, please try to write some code by yourself to solve the problem, and only after come ask for help. You will learn more if you do that, and we won't feel like you're just asking us to write your code.
The problem is very simple though, so here's a solution.
Note that this solution should be considered as a quick fix. Borodin suggested cleaner, easier to understand and more portable way to do this here.
I would suggest doing this with perl like this :
perl -nE "$count{$ARGV}++ if /^>/; END { unlink grep { $count{$_} <= 3 } keys %count }" *.fa
(for the record, I'm using double-quotes" as the delimiter of the string since you are on windows, but if anyone wish to use this on an unix system, just change the double-quotes " for some single-quotes').
Explanations:
-n surround the code with while(<>){...}, which will read the files one by one.
With $count{$ARGV}++ if /^>/ we count the number of headers in each file : $ARGV holds the name of the file being read, and /^>/ is true only if the line starts with >, ie. it's a header line.
Finally ( the END { .. } part), we delete (with the function unlink) the files that have 3 headers or less : keys %count gives all the file names, and grep { $count{$_} <= 3 } retains only the files that have 3 or less header lines to delete them.

Convert file from Windows to UNIX through Powershell or Batch

I have a batch script that prompts a user for some input then outputs a couple of files I'm using in an AIX environment. These files need to be in UNIX format (which I believe is UTF8), but I'm looking for some direction on the SIMPLEST way of doing this.
I don't like to have to download extra software packages; Cygwin or GnuWin32. I don't mind coding this if it is possible, my coding options are Batch, Powershell and VBS. Does anyone know of a way to do this?
Alternatively could I create the files with Batch and call a Powershell script to reform these?
The idea here is a user would be prompted for some information, then I output a standard file which are basically prompt answers in AIX for a job. I'm using Batch initially, because I didn't know that I would run into this problem, but I'm kind of leaning towards redoing this in Powershell. because I had found some code on another forum that can do the conversion (below).
% foreach($i in ls -name DIR/*.txt) { \
get-content DIR/$i | \
out-file -encoding utf8 -filepath DIR2/$i \
}
Looking for some direction or some input on this.
You can't do this without external tools in batch files.
If all you need is the file encoding, then the snippet you gave should work. If you want to convert the files inline (instead of writing them to another place) you can do
Get-ChildItem *.txt | ForEach-Object { (Get-Content $_) | Out-File -Encoding UTF8 $_ }
(the parentheses around Get-Content are important) However, this will write the files in UTF-8 with a signature at the start (U+FEFF) which some Unix tools don't accept (even though it's technically legal, though discouraged to use).
Then there is the problem that line breaks are different between Windows and Unix. Unix uses only U+000A (LF) while Windows uses two characters for that: U+000D U+000A (CR+LF). So ideally you'd convert the line breaks, too. But that gets a little more complex:
Get-ChildItem *.txt | ForEach-Object {
# get the contents and replace line breaks by U+000A
$contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
# create UTF-8 encoding without signature
$utf8 = New-Object System.Text.UTF8Encoding $false
# write the text back
[IO.File]::WriteAllText($_, $contents, $utf8)
}
Try the overloaded version ReadAllText(String, Encoding) if you are using ANSI characters and not only ASCII ones.
$contents = [IO.File]::ReadAllText($_, [Text.Encoding]::Default) -replace "`r`n", "`n"
https://msdn.microsoft.com/en-us/library/system.io.file.readalltext(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
ASCII - Gets an encoding for the ASCII (7-bit) character set.
Default - Gets an encoding for the operating system's current ANSI code page.

an app or a batch file script to remove special characters from text

I love this online tool http://textmechanic.co/ but it lacks another important feature which is to delete special characters such as %, %, [, ), *, ?, ', etc.. except for _, -, and . from a large quantity of text.
I am looking for an online tool or a small windows utility or a batch script that can do this.
I think sed is the easiest choice here. You can download it for Windows here Furthermore, nearly every text editor should allow that (but most won't cope with files in the multi-GiB range well).
With sed you'd probably want something like this:
sed "s/[^a-zA-Z0-9_.-]//g" file.txt
Likewise, if you have a semi-recent Windows (i.e. Windows 7), then PowerShell comes preinstalled with it. The following one-liner will do that for you:
Get-Content file.txt | foreach { $_ -replace '[^\w\d_.-]' } | Out-File -Encoding UTF8 file.new.txt
This can easily adapted to multiple files as well. It could be that you also can output into the original file again, since I think Get-Content yields an array, not an enumerator (i.e. this pipeline cannot operate on the file as you read it). Similar problem due to that with very large files, though.
You can do regex with any tool/language that supports it. Here's a Ruby for Windows command
C:\work>ruby -ne 'print $_.gsub(/[%)?\[\]*]/,"")' file

Windows Command to detect and remove text in a file

I have an ascii file and in there somewhere is the line:
BEGIN
and later on the line:
END
I'd like to be able to remove those two lines and everything in between from a command line call in windows. This needs to be completely automated.
EDIT: See sed in Vista - how to delete all symbols between? for details on how to use sed to do this (cygwin has sed).
EDIT: I am finding that SED could be working but when I pipe the output to a file, the carriage returns have been removed. How can I keep these? Using this sed regex:
/^GlobalSection(TeamFoundationVersionControl) = preSolution$/,/^EndGlobalSection$/{
/^GlobalSection(TeamFoundationVersionControl) = preSolution$/!{
/^EndGlobalSection$/!d
}
}
.. where the start section is 'GlobalSection(TeamFoundationVersionControl) = preSolution' and the end section is 'EndGlobalSection'. I'd also like to delete these lines as well.
EDIT: I am now using something simpler for sed:
/^GlobalSection(TeamFoundationVersionControl) = preSolution$/,/^EndGlobalSection$/d
The line feeds are still an issue though
Alternately, what I use these days is a scripting language that plays nicely with windows like Ruby or Python for such tasks. Ruby is easy to install in windows and makes problems like this child's play.
Here's a script you could use like:
cutBeginEnd.rb myFileName.txt
sourcefile = File.open(ARGV[0])
# Get the string and do a multiline replace
fileString = sourceFile.read()
slicedString = fileString.gsub(/BEGIN.*END\n/m,"")
#Overwrite the file
sourcefile.pos = 0
sourcefile.print slicedString
sourcefile.truncate(f.pos)
This does a pretty good job, allows for a lot of flexiblity, and is possibly more readable than sed.
Here is a 1-line Perl command that does what you want (just type it from the Command Prompt window):
perl -i.bak -ne "print unless /^BEGIN\r?\n/ .. /^END\r?\n/" myfile.txt
Carriage returns and line feeds will be preserved properly. The original version of myfile.txt will be saved as myfile.txt.bak.
If you don't have Perl installed, get ActivePerl.
Here's how to delete the entire GlobalSection(TeamFoundationVersionControl) = preSolution section using a C# regular expression:
// Create a regex to match against an entire GlobalSection(TeamFoundationVersionControl) section so that it can be removed (including preceding and trailing whitespace).
// The symbols *, +, and ? are greedy by default and will match everything until the LAST occurrence of EndGlobalSection, so we must use their non-greedy counterparts, *?, +?, and ??.
// Example of string to match against: " GlobalSection(TeamFoundationVersionControl) ...... EndGlobalSection "
Regex _regex = new Regex(#"(?i:\s*?GlobalSection\(TeamFoundationVersionControl\)(?:.|\n)*?EndGlobalSection\s*?)", RegexOptions.Compiled);

Resources