Bash: how to get the complete substring of a match in a string? - bash

I have a TXT file, which is shipped from a Windows machine and is encoded in ISO-8859-1. My Qt application is supposed to read this file but QString supports only UTF-8 (I want to avoid working with QByteArray). I've been sturggling to find a way to do that in Qt so I decided to write a small script that does the conversion for me. I have no problem writing it for exactly my case but I would like to make it more general - for all ISO-8859 encoding.
So far I have the following:
#!/usr/bin/env bash
output=$(file -i $1)
# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
# Retrieve actual encoding
encoding=...
# run iconv to convert
iconv -f $encoding $1 -t UTF-8 -o $1
else
echo "Text file not encoded in ISO-8859"
fi
The part that I'm struggling with is how to get the complete substring that has been successfully mached in the grep command.
Let's say I have the file helloworld.txt and it's encoded in ISO-8859-15. In this case
$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15
will be the output in the terminal. Internally the grep finds the iso-8859 (since I use the -i flag it processes the input in a case-insensitive way). At this point the script needs to "extract" the whole substring namely not just iso-8859 but iso-8859-15 and store it inside the encoding variable to use it later with iconv (which is case insensitive (phew!) when it comes to the name of the encodings).
NOTE: The script above can be extended even further by simply retrieving the value that follows charset and using it for the encoding. However this has one huge flaw - what if the input file has an encoding that has a larger character set than UTF-8 (simple example: UTF-16 and UTF-32)?

Or using bash features like below
$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15
To save in variable
$ myvar="${str#*=}"

You can use cut or awk to get at this:
awk:
encoding=$(echo $output | awk -F"=" '{print $2}')
cut:
encoding=$(echo $output | cut -d"=" -f2)
I think you could just feed this over to your iconv command directly and reduce your script to:
iconv -f $(file $1 | cut -d"=" -f2) -t UTF-8 file

Well, in this case it is rather pointless…
$ file --brief --mime-encoding "$1"
iso-8859-15
file manual
-b, --brief
Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
Like -i, but print only the specified element(s).

Related

Bash 'cut' command for Mac

I want to cut everything with a delimiter ":" The input file is in the following format:
data1:data2
data11:data22
...
I have a linux command
cat merged.txt | cut -f1 -d ":" > output.txt
On mac terminal it gives an error:
cut: stdin: Illegal byte sequence
what is the correct way to do it on a mac terminal?
Your input file (merged.txt) probably contains bytes/byte sequences that are not valid in your current locale. For example, your locale might specify UTF-8 character encoding, but the file be in some other encoding and cannot be parsed as valid UTF-8. If this is the problem, you can work around it by telling tr to assume the "C" locale, which basically tells it to process the input as a stream of bytes without paying attention to encoding.
BTW, cat file | is what's commonly referred to as a Useless Use of Cat (UUOC) -- you can just use a standard input redirect < file instead, which cleaner and more efficient. Thus, my version of your command would be:
LC_ALL=C cut -f1 -d ":" < merged.txt > output.txt
Note that since the LC_ALL=C assignment is a prefix to the tr command, it only applies to that one command and won't mess up other operations that should assume UTF-8 (or whatever your normal locale is).
Your cut command works for me on my Mac, you can try awk for the same result
awk -F: '{print $1}' merged.txt
data1
data11

Substitute special character

I have a special character in my .txt file.
I want to substitute that special character ý with |
and rename the file to .mnt from .txt.
Here is my code: it renames the file to .mnt, but does not substitue the special character
#!/bin/sh
for i in `ls *.txt 2>/dev/null`;
do
filename=`echo "$i" | cut -d'.' -f1`
sed -i 's/\ý/\|/g' $i
mv $i ${filename}.mnt
done
How to do that?
Example:
BEGIN_RUN_SQLýDELETE FROM PRC_DEAL_TRIG WHERE DEAL_ID = '1:2:1212'
You have multiple problems in your code. Don't use ls in scripts and quote your variables. You should probably use $(command substitution) rather than the legacy `command substitution` syntax.
If your task is to replace ý in the file's contents -- not in its name -- sed -i is not wrong, but superfluous; just write the updated contents to the new location and delete the old file.
#!/bin/sh
for i in *.txt
do
filename=$(echo "$i" | cut -d'.' -f1)
sed 's/ý/|/g' "$i" >"${filename}.mnt" && rm "$i"
done
If your system is configured for UTF-8, the character ý is representable with either the byte sequence
\xc3 \xbd (representing U+00FD) or the decomposed sequence \0x79 \xcc \x81 (U+0079 + U+0301) - you might find that the file contains one representation, while your terminal prefers another. The only way to really be sure is to examine the hex bytes in the file and on your terminal. It is also entirely possible that your terminal is not capable of displaying the contents of the file exactly. Try
bash$ printf 'ý' | xxd
00000000: c3bd
bash$ head -c 16 file | xxd
00000000: 4245 4749 4e5f 5255 4e5f 5351 4cff 4445 BEGIN_RUN_SQL.DE
If (as here) you find that they are different (the latter outputs the single byte \xff between "BEGIN_RUN_SQL" and "DE") then the trivial approach won't work. Your sed may or may not support passing in literal hex sequences to say exactly what to substitute; or perhaps try e.g. Perl if not.

How to make 'grep' use text and regex at same time?

I have a bash script where I'm using grep to find text in a file. The search-text is stored in a variable.
found=$(grep "^$line$" "$file")
I need grep to use regex while not interpret the variable $line as regex. If for example $line contains a character which is a regex operator, like [, an error is triggered:
grep: Unmatched [
Is it somehow possible to make grep not interpret the content of $line as regex?
You can use the -F flag of grep to make it interpret the patterns as fixed strings instead of regular expressions.
In addition,
if you want the patterns to match entire lines (as implied by your ^$line$ pattern),
you can combine with the -x flag.
So the command in your post can be written as:
found=$(grep -Fx "$line" "$file")
Way to tell grep that the characters provided are ordinary characters is to escape them properly, for example:
$ cat file
this
[that
not[
$ line=\\[ # escaped to mark [ as an ordinary character
$ grep "$line" file
that[
[not
$ line=[is] # [ is part of a proper regex
$ grep "$line" file
this
So the key is to escape the regex chars:
$ line=[
$ echo "${line/[/\\[}"
\[
$ grep "^${line/[/\\[}" file
[that
Or you could use awk:
$ line=[that
$ awk -v s="$line" '$0==s' file
[that

Replacing Foreign Characters with English Equivalents in filenames with UNIX Bash Script

I'm trying to use sed to process a list of filenames and replace every foreign character in the file name with an English equivelent. E.g.
málaga.txt -> malaga.txt
My script is the following:
for f in *.txt
do
newf=$(echo $f | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
This currently has no effect on the filenames. However if I use the same regex to process a text file. E.g.
cat blah.txt | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/
It works perfectly - all foreign characters are substituted with their English equivalents. Any help would be greatly appreciated. This is on Mac OsX in a UNIX shell.
This should do it:
for f in *.txt; do
newf=$(echo $f | iconv -f utf-8-mac -t utf-8 | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
iconv -f utf-8-mac -t utf-8 converts the text from utf-8-mac to utf-8, which resolves the precomposed/decomposed problem discussed in the comments by #PavelGurkov and #ninjalj.

How to search and replace text in an xml file with SED?

I have to convert a list of xml files in a folder from UTF-16 to UTF-8, remove the BOM, and then replace the keyword inside the file from UTF-16 to UTF-8.
I'm using cygwin to run a bash shell script to accomplish this, but I've never worked with SED before today and I need help!
I found a SED one liner for removing the BOM, now I need another for replacing the text from UTF-16 to UTF-8 in the xml header.
This is what I have so far:
#!/bin/bash
mkdir -p outUTF8
#Convert files to unix format.
find -exec dos2unix {} \;
#Use a for loop to convert all the xml files.
for f in `ls -1 *.xml`; do
sed -i -e '1s/^\xEF\xBB\xBF//' FILE
iconv -f utf-16 -t utf-8 $f > outUTF8/$f
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
echo $f
done
However, this line:
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
is hanging the script. Any ideas as to the proper format for this?
Try something like this -
for filename in *.xml; do
sed -i".bak" -e '1s/^\xEF\xBB\xBF//' "$filename"
iconv -f utf-16 -t utf-8 "$filename" > outUTF8/"$filename"
sed -i 's/UTF-16/UTF-8/g' outUTF8/"$filename"
done
The first sed will make a backup of your original files with an extension .bak. Then it will use iconv to convert the file and save it under a newly created directory with same filename. Lastly, you will make an in-file change with sed to remove the text.
2 things
How big is your $f file, if it's really really big, it may just take a long to complete.
Opps, I see you have an echo $f at the bottom of your loop. Move it before the sed command so you can see if there any spaces in the filenames.
2a:-). OR just change all references to $f to "$f" to protect against spaces.
I hope this helps.

Resources