How to convert files in Unix using iconv? - bash

I'm new to Bash scripting. I have a requirement to convert multiple input files in UTF-8 encoding to ISO 8859-1.
I am using the below command, which is working fine for the conversion part:
cd ${DIR_INPUT}/
for f in *.txt; do iconv -f UTF-8 -t ISO-8859-1 $f > ${DIR_LIST}/$f; done
However, when I don't have any text files in my input directory ($DIR_INPUT), it still creates an empty .txt file in my output directory ($DIR_LIST).
How can I prevent this from happening?

The empty file *.txt is being created in your output directory because by default, bash expands an unmatched expansions to the literal string that you supplied. You can change this behaviour in a number of ways, but what you're probably looking for is shopt -s nullglob. Observe:
$ for i in a*; do echo "$i"; done
a*
$ shopt -s nullglob
$ for i in a*; do echo "$i"; done
$
You can find documentation about this in the bash man page under Pathname Expansion. Or here or here.
In your case, I'd probably rewrite this in this way:
shopt -s nullglob
for f in "$DIR_INPUT"/*.txt; do
iconv -f UTF-8 -t ISO-8859-1 "$f" > "${DIR_LIST}/${f##*/}"
done
This avoids the need for the initial cd, and uses parameter expansion to strip off the path portion of $f for the output redirection. The nullglob will obviously eliminate the work being done on a nonexistent file.

As #ghoti pointed out, in the absence of files matching the wildcard expression a* the expression itself becomes the result of pathname expansion. By default (when nullglob option is unset), a* is expanded to, literally, a*.
You can set nullglob option, of course. But then you should be aware of the fact that all subsequent pathname expansions will be affected, unless you unset the option after the loop.
I would rather use find command which has a clear interface (and, in my opinion, is less likely to perform implicit conversions as opposed to the Bash globbing). E.g.:
cmd='iconv --verbose -f UTF-8 -t ISO-8859-1 "$0" > "$1"/$(basename "$0")'
find "${DIR_INPUT}/" \
-mindepth 1 \
-maxdepth 1 \
-type f \
-name '*.txt' \
-exec sh -c "$cmd" {} "${DIR_LIST}" \;
In the example above, $0 and $1 are positional arguments for the file path and ${DIR_LIST} respectively. The command is invoked via standard shell (sh) because of the need to refer to the file path {} twice. Although most modern implementations of find may handle multiple occurrences of {} correctly, the POSIX specification states:
If more than one argument containing the two characters "{}" is present, the behavior is unspecified.
As in the for loop, the -name pattern *.txt is evaluated as true if the basename of the current pathname matches the operand (*.txt) using the pattern matching notation. But, unlike the for loop, filename expansion do not apply as this is a matching operation, not an expansion.

Related

Use golb patterns stored in a file for searching - bash

How can I use glob patterns that are stored in a textfile as an input for a filesearch?
I want to search all subdirectories for files that are matching global patterns which are stored in a textfile.
The file with global patterns looks like:
/Dongle/src/*.c
/App/files/**/*.xml
...
Who can I use that file as an input for the find command in bash?
What I tried so far is:
modelFile=$1
root="Project/"
regexes=$(cat $modelFile)
outFile="out.txt"
for re in $regexes; do
find $root -type f -regex $re > $outFile
done
But it does not match any files. It works if I use it like:
(...)
for re in $regexes; do
find $root -type f -regex "/App/files/**/*.xml" > $outFile
done
for re in $regexes; do
find $root -type f -regex "/Dongle/src/*.c " >> $outFile
done
I don't necessarily have to use find. Every other bash command would work as well.
The output should be every file that matches the glob patterns.
These are not regexes, these are glob patterns.
Furthermore, find would not be very effective to find the matches, since the patterns themselves should be resolvable by the shell directly,
if your shell supports the globstar and nullglob extensions.
shopt -s globstar nullglob
while read glob; do
for path in $glob; do
echo "$path"
done
done < "$modelFile" > "$outFile"
Some other important issues with your original code:
Redirection with > in a loop to the same file would overwrite the file on every iteration
Always enclose in double quotes variables used as command line arguments
Process line by line input using a while loop instead of a for loop

How to deal with `*` expansion when there are no files

I am making a shell script that allows you to select a file from a directory using YAD. I am doing this:
list='';
exc='!'
for f in "$SHOTS_NOT_CONVERTED_DIR"/*;do
f=`basename $f`
list="${list}${exc}${f}"
done
The problem is that if there are no files in that directory, I end up with a selection with *.
What's the easiest, most elegant way to make this work in Bash?
The goal is to have an empty list if there are no files there.
* expansion is called a glob expressions. The bash manual calls it filename expansion.
You need to set the nullglob option. Doing so gives you an empty result if the glob expression does not find files:
shopt -s nullglob
list='';
exc='!'
for f in "$SHOTS_NOT_CONVERTED_DIR"/*;do
# Btw, use $() instead of ``
f=$(basename "$f")
list="${list}${exc}${f}"
done

How can I make this code shorter and more correct? (searching and copying files)

This code searches and recursively copies the files after the above date.
#!/bin/bash
directory=~/somefolder
DAYSAGO=8
for ((a=0; a <= DAYSAGO ; a++))
do
find $directory -mtime $a -type f | while read file;
do
cp "$file" -t ~/The\ other\ folder/
done
done
Try the following:
#!/usr/bin/env bash
directory=~/'somefolder'
DAYSAGO=8
find "$directory" -mtime -$(( DAYSAGO + 1 )) -type f -exec cp -t ~/'The other folder'/ {} +
Using - to prefix the -mtime argument applies less-than logic to the argument value. All find tests that take numeric arguments support this logic (and its counterpart, +, for more-than logic). Tip of the hat to miracle 173.
Since the desired logic is <= $DAYSAGO, 1 is added using an arithmetic expansion ($(( ... ))), to achieve the desired logic (needless to say, $DAYSAGO could be redefined with less-than logic in mind, to 9, so as to make the arithmetic expansion unnecessary).
Using -exec with the + terminator invokes the specified command with (typically) all matching filenames at once, which is much more efficient than piping to a shell loop.
{} is the placeholder for the list of matching filenames, and note that with + it must be the last argument before the + terminator (by contrast, with the invoke-once-for-each-matching-file terminator \;, the {} can be placed anywhere).
Note that the command above therefore only works with cp implementations that support the -t option, which allows placing the target directory first, notably, GNU cp (BSD/OSX cp and the POSIX specification, by contrast, do NOT support -t).
Also note the changes in quoting:
directory=~/'somefolder': single-quoting literal somefolder - while not strictly necessary in this particular case - ensures that the enclosed name works even if it contains embedded spaces or other shell metacharacters.
Note, however, that the ~/ part must remain unquoted for the ~ to expand to the current user's home dir.
"$directory": double-quoting the variable reference ensures that its value is not interpreted further by the shell, making it safe to use paths with embedded whitespace and other shell metacharacters.
~/'The other folder'/ provides a more legible alternative to ~/The\ other\ folder/ (and is also easier to type), demonstrating the same mix of unquoted and quoted parts as above.
You don't need the while loop at all. Using it as you are exposes you to problems with some corner cases like filenames containing newlines and other whitespace. Just use the -exec primary.
find "$directory" -mtime "$a" -type f -exec cp {} -t ~/The\ other\ folder/ \;
UPDATE: use mklement0's answer, though; it's more efficient.

Remove all files except files with certain extension

This removes all files that end with .a or .b
$ ls *.a
a.a b.a c.a
$ ls *.b
a.b b.b c.b
$ rm *.a *.b
How would I do the opposite and remove all files that end with *.* except the ones that end with *.a and *.b?
The linked answer has useful info, though the question is somewhat ambiguous and the answers use differing interpretations.
The simplest approach in your case is probably (a streamlined version of https://stackoverflow.com/a/10448940/45375):
(GLOBIGNORE='*.a:*.b'; rm *.*)
Note the use of a subshell ((...)) to localize setting the GLOBIGNORE variable.
The patterns assigned to GLOBIGNORE must be :-separated.
The appeal of this approach is that you can use a single subshell without changing global state.
By contrast, getting away with a single subshell with shopt -s extglob requires a bit of trickery:
(shopt -s extglob; glob='*.!(a|b)'; echo $glob)
Note the mandatory use of an intermediate variable, without which the command would break (because a literal glob would be expanded BEFORE executing the commands, at which point the extended globbing syntax is not yet recognized).
Caveat: Using GLOBIGNORE has an unexpected side effect (bug?):
If GLOBIGNORE is set - to whatever value - pathname expansion of * and *.* behaves as if shell option dotglob were in effect - even if it isn't.
In other words: If GLOBIGNORE is set, hidden files not explicitly exempted by a pattern in GLOBIGNORE are always matched by * and *.*.
dotglob is OFF by default, causing * NOT to include hidden files (if GLOBIGNORE is not set, which is true by default).
If you also wanted to exclude hidden files while using GLOBIGNORE, add the following pattern: .*; applied to the question, you'd get:
(GLOBIGNORE='*.a:*.b:.*'; rm *.*)
By contrast, using extended globbing after turning on the extglob shell option DOES respect the dotglob option.
You can enable extended glob in bash:
shopt -s extglob
Then you can use:
rm *.!(a|b)
To remove all files that end with *.* except the ones that end with *.a OR *.b
Update: (Thanks to #mklement0) Here is a way to localize setting extglob (without altering the global state) by doing this in a subshell and using an intermediate variable:
(shopt -s extglob; glob='*.!(a|b)'; rm $glob)
There are some shells that are capable of this (I think?), however, bash is not by default. If you are running bash on Cygwin, you can do this:
rm $(ls -1 | grep -v '.*\.a' | grep -v '.*\.b')
ls -1 (that's a one) list all files in current directory one per line.
grep -v '.*\.a' return all matches that don't end in .a
grep -v '.*\.b' return all matches that don't end in .b
Sometimes it's better to not insist on solving a problem a certain way. And for the general problem of "acting on certain files to be determined in some tricky way", find is probably the best all-around tool you'll find.
find . -type f -maxdepth 1 ! -name \*.[ab] -delete
Omit the -maxdepth 1 if you want to recurse into subdirectories.

how do I use the grep --include option for multiple file types?

When I want to grep all the html files in some directory, I do the following
grep --include="*.html" pattern -R /some/path
which works well. The problem is how to grep all the html,htm,php files in some directory?
From this Use grep --exclude/--include syntax to not grep through certain files, it seems that I can do the following
grep --include="*.{html,php,htm}" pattern -R /some/path
But sadly, it would not work for me.
FYI, my grep version is 2.5.1.
You can use multiple --include flags. This works for me:
grep -r --include=*.html --include=*.php --include=*.htm "pattern" /some/path/
However, you can do as Deruijter suggested. This works for me:
grep -r --include=*.{html,php,htm} "pattern" /some/path/
Don't forget that you can use find and xargs for this sort of thing too:
find /some/path/ -name "*.htm*" -or -name "*.php" | xargs grep "pattern"
tl;dr
# Works in bash, ksh, and zsh.
grep -R '--include=*.'{html,php,htm} pattern /some/path
Using {html,php,htm} can only work as a brace expansion, which is a nonstandard (not POSIX-compliant) feature of bash, ksh, and zsh.
In other words: do not try to use it in a script that targets /bin/sh - use explicit multiple --include arguments in that case.
grep itself does not understand {...} notation.
For a brace expansion to be recognized, it must be an unquoted (part of a) token on the command line.
A brace expansion expands to multiple arguments, so in the case at hand grep ends up seeing multiple --include=... options, just as if you had passed them individually.
The results of a brace expansion are subject to globbing (filename expansion), which has pitfalls:
Each resulting argument could further be expanded to matching filenames if it happens to contain unquoted globbing metacharacters such as *.
While this is unlikely with tokens such as --include=*.html (e.g., you'd have to have a file literally named something like --include=foo.html for something to match), it is worth keeping in mind in general.
If the nullglob shell option happens to be turned on (shopt -s nullglob) and globbing matches nothing, the argument will be discarded.
Therefore, for a fully robust solution, use the following:
grep -R '--include=*.'{html,php,htm} pattern /some/path
'--include=*.' is treated as a literal, due to being single-quoted; this prevents inadvertent interpretation of * as a globbing character.
{html,php,htm}, the - of necessity - unquoted brace expansion[1]
, expands to 3 arguments, which, due to {...} directly following the '...' token, include that token.
Therefore, after quote removal by the shell, the following 3 literal arguments are ultimately passed to grep:
--include=*.html
--include=*.php
--include=*.htm
[1] More accurately, it's only the syntax-relevant parts of the brace expansion that must be unquoted, the list elements may still be individually quoted and must be if they contain globbing metacharacters that could result in unwanted globbing after the brace expansion; while not necessary in this case, the above could be written as
'--include=*.'{'html','php','htm'}
Try removing the double quotes
grep --include=*.{html,php,htm} pattern -R /some/path
It works for the same purpose, but without --include option. It works on grep 2.5.1 as well.
grep -v -E ".*\.(html|htm|php)"
is this not working?
grep pattern /some/path/*.{html,php,htm}
Try this.
-r will do a recursive search.
-s will suppress file not found errors.
-n will show you the line number of the file where the pattern is found.
grep "pattern" <path> -r -s -n --include=*.{c,cpp,C,h}
Use grep with find command
find /some/path -name '*.html' -o -name '*.htm' -o -name '*.php' -type f
-exec grep PATTERN {} \+
You can use -regex and -regextype options too.

Resources