How to recursively convert all filenames in folder subtree from UTF-8 to ASCII in Linux - bash

I'm quiet new to bash scripting, and I would like to convert recursively all my filenames in folder from UTF-8 encoding to ASCII (which is very portable encoding).
I think that iconv command would be of some use:
iconv -f utf8 -t ascii ...
But I'm not sure how to use it exactly.
At best the bash script should print some hint about it's progress, like name of file it just converted.
Thank you very much.

find /my/path -type f > utf8list
iconv utf8list > asciilist
i=1
for file in $(cat utf8list); do
newname=$(head -$i asciilist | tail -1 | tr -d '\n')
#mv $file $newname
echo "mv $file $newname"
let i++
done

Related

Substitute special character

I have a special character in my .txt file.
I want to substitute that special character ý with |
and rename the file to .mnt from .txt.
Here is my code: it renames the file to .mnt, but does not substitue the special character
#!/bin/sh
for i in `ls *.txt 2>/dev/null`;
do
filename=`echo "$i" | cut -d'.' -f1`
sed -i 's/\ý/\|/g' $i
mv $i ${filename}.mnt
done
How to do that?
Example:
BEGIN_RUN_SQLýDELETE FROM PRC_DEAL_TRIG WHERE DEAL_ID = '1:2:1212'
You have multiple problems in your code. Don't use ls in scripts and quote your variables. You should probably use $(command substitution) rather than the legacy `command substitution` syntax.
If your task is to replace ý in the file's contents -- not in its name -- sed -i is not wrong, but superfluous; just write the updated contents to the new location and delete the old file.
#!/bin/sh
for i in *.txt
do
filename=$(echo "$i" | cut -d'.' -f1)
sed 's/ý/|/g' "$i" >"${filename}.mnt" && rm "$i"
done
If your system is configured for UTF-8, the character ý is representable with either the byte sequence
\xc3 \xbd (representing U+00FD) or the decomposed sequence \0x79 \xcc \x81 (U+0079 + U+0301) - you might find that the file contains one representation, while your terminal prefers another. The only way to really be sure is to examine the hex bytes in the file and on your terminal. It is also entirely possible that your terminal is not capable of displaying the contents of the file exactly. Try
bash$ printf 'ý' | xxd
00000000: c3bd
bash$ head -c 16 file | xxd
00000000: 4245 4749 4e5f 5255 4e5f 5351 4cff 4445 BEGIN_RUN_SQL.DE
If (as here) you find that they are different (the latter outputs the single byte \xff between "BEGIN_RUN_SQL" and "DE") then the trivial approach won't work. Your sed may or may not support passing in literal hex sequences to say exactly what to substitute; or perhaps try e.g. Perl if not.

Bash: how to get the complete substring of a match in a string?

I have a TXT file, which is shipped from a Windows machine and is encoded in ISO-8859-1. My Qt application is supposed to read this file but QString supports only UTF-8 (I want to avoid working with QByteArray). I've been sturggling to find a way to do that in Qt so I decided to write a small script that does the conversion for me. I have no problem writing it for exactly my case but I would like to make it more general - for all ISO-8859 encoding.
So far I have the following:
#!/usr/bin/env bash
output=$(file -i $1)
# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
# Retrieve actual encoding
encoding=...
# run iconv to convert
iconv -f $encoding $1 -t UTF-8 -o $1
else
echo "Text file not encoded in ISO-8859"
fi
The part that I'm struggling with is how to get the complete substring that has been successfully mached in the grep command.
Let's say I have the file helloworld.txt and it's encoded in ISO-8859-15. In this case
$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15
will be the output in the terminal. Internally the grep finds the iso-8859 (since I use the -i flag it processes the input in a case-insensitive way). At this point the script needs to "extract" the whole substring namely not just iso-8859 but iso-8859-15 and store it inside the encoding variable to use it later with iconv (which is case insensitive (phew!) when it comes to the name of the encodings).
NOTE: The script above can be extended even further by simply retrieving the value that follows charset and using it for the encoding. However this has one huge flaw - what if the input file has an encoding that has a larger character set than UTF-8 (simple example: UTF-16 and UTF-32)?
Or using bash features like below
$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15
To save in variable
$ myvar="${str#*=}"
You can use cut or awk to get at this:
awk:
encoding=$(echo $output | awk -F"=" '{print $2}')
cut:
encoding=$(echo $output | cut -d"=" -f2)
I think you could just feed this over to your iconv command directly and reduce your script to:
iconv -f $(file $1 | cut -d"=" -f2) -t UTF-8 file
Well, in this case it is rather pointless…
$ file --brief --mime-encoding "$1"
iso-8859-15
file manual
-b, --brief
Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
Like -i, but print only the specified element(s).

Replacing Foreign Characters with English Equivalents in filenames with UNIX Bash Script

I'm trying to use sed to process a list of filenames and replace every foreign character in the file name with an English equivelent. E.g.
málaga.txt -> malaga.txt
My script is the following:
for f in *.txt
do
newf=$(echo $f | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
This currently has no effect on the filenames. However if I use the same regex to process a text file. E.g.
cat blah.txt | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/
It works perfectly - all foreign characters are substituted with their English equivalents. Any help would be greatly appreciated. This is on Mac OsX in a UNIX shell.
This should do it:
for f in *.txt; do
newf=$(echo $f | iconv -f utf-8-mac -t utf-8 | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
iconv -f utf-8-mac -t utf-8 converts the text from utf-8-mac to utf-8, which resolves the precomposed/decomposed problem discussed in the comments by #PavelGurkov and #ninjalj.

How to search and replace text in an xml file with SED?

I have to convert a list of xml files in a folder from UTF-16 to UTF-8, remove the BOM, and then replace the keyword inside the file from UTF-16 to UTF-8.
I'm using cygwin to run a bash shell script to accomplish this, but I've never worked with SED before today and I need help!
I found a SED one liner for removing the BOM, now I need another for replacing the text from UTF-16 to UTF-8 in the xml header.
This is what I have so far:
#!/bin/bash
mkdir -p outUTF8
#Convert files to unix format.
find -exec dos2unix {} \;
#Use a for loop to convert all the xml files.
for f in `ls -1 *.xml`; do
sed -i -e '1s/^\xEF\xBB\xBF//' FILE
iconv -f utf-16 -t utf-8 $f > outUTF8/$f
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
echo $f
done
However, this line:
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
is hanging the script. Any ideas as to the proper format for this?
Try something like this -
for filename in *.xml; do
sed -i".bak" -e '1s/^\xEF\xBB\xBF//' "$filename"
iconv -f utf-16 -t utf-8 "$filename" > outUTF8/"$filename"
sed -i 's/UTF-16/UTF-8/g' outUTF8/"$filename"
done
The first sed will make a backup of your original files with an extension .bak. Then it will use iconv to convert the file and save it under a newly created directory with same filename. Lastly, you will make an in-file change with sed to remove the text.
2 things
How big is your $f file, if it's really really big, it may just take a long to complete.
Opps, I see you have an echo $f at the bottom of your loop. Move it before the sed command so you can see if there any spaces in the filenames.
2a:-). OR just change all references to $f to "$f" to protect against spaces.
I hope this helps.

sed command to fix filenames in a directory

I run a script which generated about 10k files in a directory. I just discovered that there is a bug in the script which causes some filenames to have a carriage return (presumably a '\n' character).
I want to run a sed command to remove the carriage return from the filenames.
Anyone knows which params to pass to sed to clean up the filenames in the manner described?
I am running Linux (Ubuntu)
I don't know how sed would do this, but this python script should do the trick:.
This isn't sed, but I find python a lot easier to use when doing things like these:
#!/usr/bin/env python
import os
files = os.listdir('.')
for file in files:
os.rename(file, file.replace('\r', '').replace('\n', ''))
print 'Processed ' + file.replace('\r', '').replace('\n', '')
It strips any occurrences of both \r and \n from all of the filenames in a given directory.
To run it, save it somewhere, cd into your target directory (with the files to be processed), and run python /path/to/the/file.py.
Also, if you plan on doing more batch renaming, consider Métamorphose. It's a really nice and powerful GUI for this stuff. And, it's free!
Good luck!
Actually, try this: cd into the directory, type in python, and then just paste this in:
exec("import os\nfor file in os.listdir('.'):\n os.rename(file, file.replace('\\r', '').replace('\\n', ''))\n print 'Processed ' + file.replace('\\r', '').replace('\\n', '')")
It's a one-line version of the previous script, and you don't have to save it.
Version 2, with space replacement powers:
#!/usr/bin/env python
import os
for file in os.listdir('.'):
os.rename(file, file.replace('\r', '').replace('\n', '').replace(' ', '_')
print 'Processed ' + file.replace('\r', '').replace('\n', '')
And here's the one-liner:
exec("import os\nfor file in os.listdir('.'):\n os.rename(file, file.replace('\\r', '').replace('\\n', '')replace(' ', '_'))\n print 'Processed ' + file.replace('\\r', '').replace('\\n', '');")
If there are no spaces in your filenames, you can do:
for f in *$'\n'; do mv "$f" $f; done
It won't work if the newlines are embedded, but it will work for trailing newlines.
If you must use sed:
for f in *$'\n'; do mv "$f" "$(echo "$f" | sed '/^$/d')"; done
Using the rename Perl script:
rename 's/\n//g' *$'\n'
or the util-linux-ng utility:
rename $'\n' '' *$'\n'
If the character is a return instead of a newline, change the \n or ^$ to \r in any places they appear above.
The reason you aren't getting any pure-sed answers is that fundamentally sed edits file contents, not file names; thus the answers that use sed all do something like echo the filename into a pipe (pseudo file), edit that with sed, then use mv to turn that back into a filename.
Since sed is out, here's a pure-bash version to add to the Perl, Python, etc scripts you have so far:
killpattern=$'[\r\n]' # remove both carriage returns and linefeeds
for f in *; do
if [[ "$f" == *$killpattern* ]]; then
mv "$f" "${f//$killpattern/}"
fi
done
...but since ${var//pattern/replacement} isn't available in plain sh (along with [[...]]), here's a version using sh-only syntax, and tr to do the character replacement:
for f in *; do
new="$(printf %s "$f" | tr -d "\r\n")"
if [ "$f" != "$new" ]; then
mv "$f" "$new"
fi
done
EDIT: If you really want it with sed, take a look at this:
http://www.linuxquestions.org/questions/programming-9/merge-lines-in-a-file-using-sed-191121/
Something along these lines should work similar to the perl below:
for i in *; do echo mv "$i" `echo "$i"|sed ':a;N;s/\n//;ta'`; done
With perl, try something along these lines:
for i in *; do mv "$i" `echo "$i"|perl -pe 's/\n//g'`; done
This will rename all files in the current folder by removing all newline characters from them. If you need to go recursive, you can use find instead - be aware of the escaping in that case, though.
In fact there is a way to use sed:
carr='\n' # specify carriage return
files=( $(ls -f) ) # array of files in current dir
for i in ${files[#]}
do
if [[ -n $(echo "$i" | grep $carr) ]] # filenames with carriage return
then
mv "$i" "$(echo "$i" | sed 's/\\n//g')" # move!
fi
done
This actually works.

Resources