Replacing Foreign Characters with English Equivalents in filenames with UNIX Bash Script - bash

I'm trying to use sed to process a list of filenames and replace every foreign character in the file name with an English equivelent. E.g.
málaga.txt -> malaga.txt
My script is the following:
for f in *.txt
do
newf=$(echo $f | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
This currently has no effect on the filenames. However if I use the same regex to process a text file. E.g.
cat blah.txt | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/
It works perfectly - all foreign characters are substituted with their English equivalents. Any help would be greatly appreciated. This is on Mac OsX in a UNIX shell.

This should do it:
for f in *.txt; do
newf=$(echo $f | iconv -f utf-8-mac -t utf-8 | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
iconv -f utf-8-mac -t utf-8 converts the text from utf-8-mac to utf-8, which resolves the precomposed/decomposed problem discussed in the comments by #PavelGurkov and #ninjalj.

Related

How to replace characters within a substring with sed or awk?

I need to replace special characters from some file names (and only file names) in an HTML document. I know how to replace special characters in the whole text with tr or sed, I know how to replace the file name with another given string with sed (e.g. 's,src="\([^"]*\)",src="newprefixtofilename_\1"'), but I am not sure sed can in some way match characters inside what I get in \1?
If sed is not able to do this, how can I do it e.g. with awk? It is probably possible to isolate the " delimited strings that are prefixed with src= and go a gsub on these only. I can assume that src= appears only in tags (so no "real" html parsing) and that there is only one string to match per file line.
Example input line:
<img src="spécial.png"> Spécial
<img src="piètre.png"> Some text including "piètre"
Desired output with [éî] replaced by [ei] only in filenames:
<img src="special.png"> Spécial
<img src="pietre.png"> Some text including "piètre"
You cant do this with sed directly (don't know about awk, tho). First you need to create a secondary file in which you replace every character for an UTF8 character, than parse and replace the differences.
I will strongly suggest to try it on test data first.
# Translate non UTF8
$ iconv -f utf-8 -t ascii//translit files.html > tmp.txt
# Create arrays (IFS if files have spaces, otherwise redundant)
$ IFS=$'\n'
$ FROM=($(diff files.html tmp.txt | grep '^<.*<img' | sed -r 's/.*src="([^"]*)".*/\1/'))
$ TO=($(diff files.html tmp.txt | grep '^>.*<img' | sed -r 's/.*src="([^"]*)".*/\1/'))
# Rename files (mv spécial.png special.png)
$ for ((i=0; i < ${#FROM[#]}; i++)); do mv "${FROM[$i]}" "${TO[$i]}"; done
# Change html src attributes
$ for ((i=0; i < ${#FROM[#]}; i++)); do sed -i "s/${FROM[$i]}/${TO[$i]}/" files.html; done
# End result
$ cat files.html
<img src="special.png"> Spécial
<img src="pietre.png"> Some text including "piètre"
Stating the requirement: replace special character ( é->e, î->i), only inside src="..." tokens.
Assuming the XML files are formatted reasonable (more specific, the full IMG tag is on one line), can be achieved replacing each of the special characters using 's' command.
First line é->e, second line î->i
sed -e 's,src="\([^"]*\)é\([^"]*"\),src=\1e\2,g' \
-e 's,src="\([^"]*\)î\([^"]*"\),src=\1i\2,g'
The above solution will not handle src that has the same special characters more than once. (e.g., src-"xîzîtîFi.png". If this is an issue, and assuming small number of repeats is accepted 92 in below example, then
# é->e
sed -e 's,src="\([^"]*\)é\([^"]*"\),src="\1e\2,g' \
-e 's,src="\([^"]*\)é\([^"]*"\),src="\1e\2,g' \
-e 's,src="\([^"]*\)é\([^"]*"\),src="\1e\2,g' \
-e 's,src="\([^"]*\)î\([^"]*"\),src="\1i\2,g' \
-e 's,src="\([^"]*\)î\([^"]*"\),src="\1i\2,g' \
-e 's,src="\([^"]*\)î\([^"]*"\),src="\1i\2,g'
I'm sure that there is a possibility to using labels/branch to perform above substitution more beneficently to handle unlimited number of special characters.
Renaming files
The other question can leverage 'sed' Transliterate command. Something like:
for file in FILELIST ; do
new_name=$(echo $file | sed -e 'y/éî/ei/')
if [ "$file" != "$new_name] ; then
mv $file $new_name
if
done

Substitute special character

I have a special character in my .txt file.
I want to substitute that special character ý with |
and rename the file to .mnt from .txt.
Here is my code: it renames the file to .mnt, but does not substitue the special character
#!/bin/sh
for i in `ls *.txt 2>/dev/null`;
do
filename=`echo "$i" | cut -d'.' -f1`
sed -i 's/\ý/\|/g' $i
mv $i ${filename}.mnt
done
How to do that?
Example:
BEGIN_RUN_SQLýDELETE FROM PRC_DEAL_TRIG WHERE DEAL_ID = '1:2:1212'
You have multiple problems in your code. Don't use ls in scripts and quote your variables. You should probably use $(command substitution) rather than the legacy `command substitution` syntax.
If your task is to replace ý in the file's contents -- not in its name -- sed -i is not wrong, but superfluous; just write the updated contents to the new location and delete the old file.
#!/bin/sh
for i in *.txt
do
filename=$(echo "$i" | cut -d'.' -f1)
sed 's/ý/|/g' "$i" >"${filename}.mnt" && rm "$i"
done
If your system is configured for UTF-8, the character ý is representable with either the byte sequence
\xc3 \xbd (representing U+00FD) or the decomposed sequence \0x79 \xcc \x81 (U+0079 + U+0301) - you might find that the file contains one representation, while your terminal prefers another. The only way to really be sure is to examine the hex bytes in the file and on your terminal. It is also entirely possible that your terminal is not capable of displaying the contents of the file exactly. Try
bash$ printf 'ý' | xxd
00000000: c3bd
bash$ head -c 16 file | xxd
00000000: 4245 4749 4e5f 5255 4e5f 5351 4cff 4445 BEGIN_RUN_SQL.DE
If (as here) you find that they are different (the latter outputs the single byte \xff between "BEGIN_RUN_SQL" and "DE") then the trivial approach won't work. Your sed may or may not support passing in literal hex sequences to say exactly what to substitute; or perhaps try e.g. Perl if not.

Bash: how to get the complete substring of a match in a string?

I have a TXT file, which is shipped from a Windows machine and is encoded in ISO-8859-1. My Qt application is supposed to read this file but QString supports only UTF-8 (I want to avoid working with QByteArray). I've been sturggling to find a way to do that in Qt so I decided to write a small script that does the conversion for me. I have no problem writing it for exactly my case but I would like to make it more general - for all ISO-8859 encoding.
So far I have the following:
#!/usr/bin/env bash
output=$(file -i $1)
# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
# Retrieve actual encoding
encoding=...
# run iconv to convert
iconv -f $encoding $1 -t UTF-8 -o $1
else
echo "Text file not encoded in ISO-8859"
fi
The part that I'm struggling with is how to get the complete substring that has been successfully mached in the grep command.
Let's say I have the file helloworld.txt and it's encoded in ISO-8859-15. In this case
$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15
will be the output in the terminal. Internally the grep finds the iso-8859 (since I use the -i flag it processes the input in a case-insensitive way). At this point the script needs to "extract" the whole substring namely not just iso-8859 but iso-8859-15 and store it inside the encoding variable to use it later with iconv (which is case insensitive (phew!) when it comes to the name of the encodings).
NOTE: The script above can be extended even further by simply retrieving the value that follows charset and using it for the encoding. However this has one huge flaw - what if the input file has an encoding that has a larger character set than UTF-8 (simple example: UTF-16 and UTF-32)?
Or using bash features like below
$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15
To save in variable
$ myvar="${str#*=}"
You can use cut or awk to get at this:
awk:
encoding=$(echo $output | awk -F"=" '{print $2}')
cut:
encoding=$(echo $output | cut -d"=" -f2)
I think you could just feed this over to your iconv command directly and reduce your script to:
iconv -f $(file $1 | cut -d"=" -f2) -t UTF-8 file
Well, in this case it is rather pointless…
$ file --brief --mime-encoding "$1"
iso-8859-15
file manual
-b, --brief
Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
Like -i, but print only the specified element(s).

How to recursively convert all filenames in folder subtree from UTF-8 to ASCII in Linux

I'm quiet new to bash scripting, and I would like to convert recursively all my filenames in folder from UTF-8 encoding to ASCII (which is very portable encoding).
I think that iconv command would be of some use:
iconv -f utf8 -t ascii ...
But I'm not sure how to use it exactly.
At best the bash script should print some hint about it's progress, like name of file it just converted.
Thank you very much.
find /my/path -type f > utf8list
iconv utf8list > asciilist
i=1
for file in $(cat utf8list); do
newname=$(head -$i asciilist | tail -1 | tr -d '\n')
#mv $file $newname
echo "mv $file $newname"
let i++
done

How to search and replace text in an xml file with SED?

I have to convert a list of xml files in a folder from UTF-16 to UTF-8, remove the BOM, and then replace the keyword inside the file from UTF-16 to UTF-8.
I'm using cygwin to run a bash shell script to accomplish this, but I've never worked with SED before today and I need help!
I found a SED one liner for removing the BOM, now I need another for replacing the text from UTF-16 to UTF-8 in the xml header.
This is what I have so far:
#!/bin/bash
mkdir -p outUTF8
#Convert files to unix format.
find -exec dos2unix {} \;
#Use a for loop to convert all the xml files.
for f in `ls -1 *.xml`; do
sed -i -e '1s/^\xEF\xBB\xBF//' FILE
iconv -f utf-16 -t utf-8 $f > outUTF8/$f
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
echo $f
done
However, this line:
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
is hanging the script. Any ideas as to the proper format for this?
Try something like this -
for filename in *.xml; do
sed -i".bak" -e '1s/^\xEF\xBB\xBF//' "$filename"
iconv -f utf-16 -t utf-8 "$filename" > outUTF8/"$filename"
sed -i 's/UTF-16/UTF-8/g' outUTF8/"$filename"
done
The first sed will make a backup of your original files with an extension .bak. Then it will use iconv to convert the file and save it under a newly created directory with same filename. Lastly, you will make an in-file change with sed to remove the text.
2 things
How big is your $f file, if it's really really big, it may just take a long to complete.
Opps, I see you have an echo $f at the bottom of your loop. Move it before the sed command so you can see if there any spaces in the filenames.
2a:-). OR just change all references to $f to "$f" to protect against spaces.
I hope this helps.

Resources