Remove Unicode characters from textfiles - sed , other Bash/shell methods - bash

How do I remove Unicode characters from a bunch of text files in the terminal?
I've tried this, but it didn't work:
sed 'g/\u'U+200E'//' -i *.txt
I need to remove these Unicode characters from the text files:
U+0091 - sort of weird "control" space
U+0092 - same sort of weird "control" space
A0 - non-space break
U+200E - left to right mark

Clear all non-ASCII characters of file.txt:
$ iconv -c -f utf-8 -t ascii file.txt
$ strings file.txt
Options:
-c # discard unconvertible characters
-f # from ENCODING
-t # to ENCODING

If you want to remove only particular characters and you have Python, you can:
CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")')
sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt

For UTF-8 encoding of Unicode, you can use this regular expression for sed:
sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g'

Use iconv:
iconv -f utf8 -t ascii//TRANSLIT < /tmp/utf8_input.txt > /tmp/ascii_output.txt
This will translate characters like "Š" into "S" (most similar looking ones).

Convert Swift files from UTF-8 to ASCII:
for file in *.swift; do
iconv -f utf-8 -t ascii "$file" > "$file".tmp
mv -f "$file".tmp "$file"
done
Swift auto completion not working in Xcode 6 Beta

Related

Bash: how to get the complete substring of a match in a string?

I have a TXT file, which is shipped from a Windows machine and is encoded in ISO-8859-1. My Qt application is supposed to read this file but QString supports only UTF-8 (I want to avoid working with QByteArray). I've been sturggling to find a way to do that in Qt so I decided to write a small script that does the conversion for me. I have no problem writing it for exactly my case but I would like to make it more general - for all ISO-8859 encoding.
So far I have the following:
#!/usr/bin/env bash
output=$(file -i $1)
# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
# Retrieve actual encoding
encoding=...
# run iconv to convert
iconv -f $encoding $1 -t UTF-8 -o $1
else
echo "Text file not encoded in ISO-8859"
fi
The part that I'm struggling with is how to get the complete substring that has been successfully mached in the grep command.
Let's say I have the file helloworld.txt and it's encoded in ISO-8859-15. In this case
$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15
will be the output in the terminal. Internally the grep finds the iso-8859 (since I use the -i flag it processes the input in a case-insensitive way). At this point the script needs to "extract" the whole substring namely not just iso-8859 but iso-8859-15 and store it inside the encoding variable to use it later with iconv (which is case insensitive (phew!) when it comes to the name of the encodings).
NOTE: The script above can be extended even further by simply retrieving the value that follows charset and using it for the encoding. However this has one huge flaw - what if the input file has an encoding that has a larger character set than UTF-8 (simple example: UTF-16 and UTF-32)?
Or using bash features like below
$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15
To save in variable
$ myvar="${str#*=}"
You can use cut or awk to get at this:
awk:
encoding=$(echo $output | awk -F"=" '{print $2}')
cut:
encoding=$(echo $output | cut -d"=" -f2)
I think you could just feed this over to your iconv command directly and reduce your script to:
iconv -f $(file $1 | cut -d"=" -f2) -t UTF-8 file
Well, in this case it is rather pointless…
$ file --brief --mime-encoding "$1"
iso-8859-15
file manual
-b, --brief
Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
Like -i, but print only the specified element(s).

Replacing Foreign Characters with English Equivalents in filenames with UNIX Bash Script

I'm trying to use sed to process a list of filenames and replace every foreign character in the file name with an English equivelent. E.g.
málaga.txt -> malaga.txt
My script is the following:
for f in *.txt
do
newf=$(echo $f | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
This currently has no effect on the filenames. However if I use the same regex to process a text file. E.g.
cat blah.txt | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/
It works perfectly - all foreign characters are substituted with their English equivalents. Any help would be greatly appreciated. This is on Mac OsX in a UNIX shell.
This should do it:
for f in *.txt; do
newf=$(echo $f | iconv -f utf-8-mac -t utf-8 | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
iconv -f utf-8-mac -t utf-8 converts the text from utf-8-mac to utf-8, which resolves the precomposed/decomposed problem discussed in the comments by #PavelGurkov and #ninjalj.

Command line text find/replace for ^M (\r) and ^K (\v)

I'm trying to write a shell script that (among other things) will replace windows line endings (^M) and vertical tabs (^K) with new lines. Sed looks like the tool to use, but I can't quite get it. I can't see why this won't work..
$ sed -i 's/^K/\n/g' article_filemakerExport.xml
sed: 1: "article_filemakerExport ...": command a expects \ followed by text
Note: I'm working on a mac.
With the Windows line ending, you want to remove the ^M (or \r or carriage return), but you want to replace the ^K with newline, it would seem.
The command I'd use is tr, twice.
tr -d '\r' < article_filemakerExport.xml | tr '\13' '\12' > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$
Given that one operation is delete and the other substitute, I don't think you can combine those into a single tr invocation. You can use cp tmp.$$ article_filemakerExport.xml; rm -f tmp.$$ if you're worried about links, etc.
You could also use dos2unix to convert the CRLF to NL line endings instead of tr.
Note that tr is a pure filter; it only reads standard input and only writes to standard output. It does not read or write files directly.
Actually, I need to replace both of these with a newline.
That's easier: a single invocation of tr will do the job:
tr '\13\15' '\12\12' < article_filemakerExport.xml > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$
Or, if you prefer:
tr '\13\r' '\n\n' < article_filemakerExport.xml > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$
I don't think there's a \z-style notation for control-K, but I'm willing to learn otherwise (it might be vertical tab, \v).
(Added the && and || rm -f tmp.$$ commands at the hinting of Ed Morton.)
Partial list of control characters
C Oct Dec Hex Unicode Name
\a 07 7 07 U+0007 BELL
\b 10 8 08 U+0008 BACKSPACE
\t 11 9 09 U+0009 HORIZONTAL TABULATION
\n 12 10 0A U+000A LINE FEED
\v 13 11 0B U+000B VERTICAL TABULATION
\f 14 12 0C U+000C FORM FEED
\r 15 13 0D U+000D CARRIAGE RETURN
You can find a complete set of these control characters at the Unicode site (http://www.unicode.org/charts/PDF/U0000.pdf). No doubt there are many other possible places to look too.
dos2unix <article_filemakerExport.xml | tr '\013\015' '\n\n'
A BSD (OS X) sed solution, assisted by ANSI C-quoted bash strings:
sed -i "" $'s/\r$/\\\n/g; s/\v/\\\n/g' article_filemakerExport.xml
Note:
BSD sed - unlike GNU sed - requires an argument with the -i option; so, to indicate that no backup file should be created, an empty string ("") must be passed - see below for how that explains the error you got.
The command replaces \r\n with \n\n rather than \n, which is what I understand you want (to get just \n, simply make the 2nd substitution string empty; to replace \r even when not followed directly by \n, remove the $ after \r).
Here's a proof of concept with sample input:
$ sed $'s/\r$/\\\n/g; s/\v/\\\n/g' <<<$'one\vtwo\r\nthree\nfour'
one
two
three
four
(All line breaks in the output above are \n.)
An ANSI C-quoted string ($'...') is needed to compensate for the lack of support for escape sequences in BSD sed: the shell creates desired control characters ($'\v' creates a vertical tab (^K; $'\13' would work too), $'\r' the CR (^M), $'\n' the newline) and passes the resulting literals to sed.
\\\n results in a literal \ followed by a literal newline - BSD sed requires literal newlines in the replacement string to be \-escaped (and doesn't support the escape code \n).
As for why your command didn't work:
Note: It looks like your problems stem at least in part from assuming that BSD sed works the same as GNU sed, which, unfortunately, is not the case: there are many subtle and not so subtle differences - see https://stackoverflow.com/a/24276470/45375
The missing argument for the -i option caused sed to interpret your program as the -i argument, and your filename as the program. Since your filename starts with a, sed saw the a (append text) command, and choked on the rest of the filename (because it's not a valid a command).
Even fixing the missing -i option argument wouldn't have made the command work, for the reasons listed above (in short: no support for control-char. escape sequences), and also your attempt to represent a vertical tab as string ^K (in GNU sed you could have used \v directly).

bash-replacing a number with unicode character using sed

So I have this output generated from printf
011010
Now I want to pipe it and use sed to replace 0's and 1's with unicode character, so I get unicode characters printed instead of binary (011010).
I can do this just copy-pasting the characters themselves, but I want to use values instead like the ones found in unicode table:
Position: 0x2701
Decimal: 9985
Symbol: ✁
How do I use the above values with sed to generate the character?
With bash (since v4.2) or zsh, the simple solution is to use the $'...' syntax, which understands C escapes including \u escapes:
$ echo 011010 | sed $'s/1/\u2701/g'
0✁✁0✁0
If you have Gnu sed, you can use escape sequences in the s// command. Gnu sed, unfortunately, does not understand \u unicode escapes, but it does understand \x hex escapes. However, to get it to decode them, you need to make sure that it sees the backslashes. Then you can do the translation in UTF-8, assuming you know the UTF-8 sequence corresponding to the Unicode codepoint:
$ # Quote the argument
$ echo 011010 | sed 's/1/\xE2\x9C\x81/g'
0✁✁0✁0
$ # Or escape the backslashes
$ echo 011010 | sed s/1/\\xE2\\x9C\\x81/g
0✁✁0✁0
$ # This doesn't work because the \ is removed by bash before sed sees it
$ echo 011010 | sed s/1/\xE2\x9C\x81/g
0xE2x9Cx81xE2x9Cx810xE2x9Cx810
$ # So that was the same as: sed s/1/xE2x9Cx81/g

How to search and replace text in an xml file with SED?

I have to convert a list of xml files in a folder from UTF-16 to UTF-8, remove the BOM, and then replace the keyword inside the file from UTF-16 to UTF-8.
I'm using cygwin to run a bash shell script to accomplish this, but I've never worked with SED before today and I need help!
I found a SED one liner for removing the BOM, now I need another for replacing the text from UTF-16 to UTF-8 in the xml header.
This is what I have so far:
#!/bin/bash
mkdir -p outUTF8
#Convert files to unix format.
find -exec dos2unix {} \;
#Use a for loop to convert all the xml files.
for f in `ls -1 *.xml`; do
sed -i -e '1s/^\xEF\xBB\xBF//' FILE
iconv -f utf-16 -t utf-8 $f > outUTF8/$f
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
echo $f
done
However, this line:
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
is hanging the script. Any ideas as to the proper format for this?
Try something like this -
for filename in *.xml; do
sed -i".bak" -e '1s/^\xEF\xBB\xBF//' "$filename"
iconv -f utf-16 -t utf-8 "$filename" > outUTF8/"$filename"
sed -i 's/UTF-16/UTF-8/g' outUTF8/"$filename"
done
The first sed will make a backup of your original files with an extension .bak. Then it will use iconv to convert the file and save it under a newly created directory with same filename. Lastly, you will make an in-file change with sed to remove the text.
2 things
How big is your $f file, if it's really really big, it may just take a long to complete.
Opps, I see you have an echo $f at the bottom of your loop. Move it before the sed command so you can see if there any spaces in the filenames.
2a:-). OR just change all references to $f to "$f" to protect against spaces.
I hope this helps.

Resources