bash "map" equivalent: run command on each file [duplicate] - bash

This question already has answers here:
Execute command on all files in a directory
(10 answers)
Closed 1 year ago.
I often have a command that processes one file, and I want to run it on every file in a directory. Is there any built-in way to do this?
For example, say I have a program data which outputs an important number about a file:
./data foo
137
./data bar
42
I want to run it on every file in the directory in some manner like this:
map data `ls *`
ls * | map data
to yield output like this:
foo: 137
bar: 42

If you are just trying to execute your data program on a bunch of files, the easiest/least complicated way is to use -exec in find.
Say you wanted to execute data on all txt files in the current directory (and subdirectories). This is all you'd need:
find . -name "*.txt" -exec data {} \;
If you wanted to restrict it to the current directory, you could do this:
find . -maxdepth 1 -name "*.txt" -exec data {} \;
There are lots of options with find.

If you just want to run a command on every file you can do this:
for i in *; do data "$i"; done
If you also wish to display the filename that it is currently working on then you could use this:
for i in *; do echo -n "$i: "; data "$i"; done

It looks like you want xargs:
find . --maxdepth 1 | xargs -d'\n' data
To print each command first, it gets a little more complex:
find . --maxdepth 1 | xargs -d'\n' -I {} bash -c "echo {}; data {}"

You should avoid parsing ls:
find . -maxdepth 1 | while read -r file; do do_something_with "$file"; done
or
while read -r file; do do_something_with "$file"; done < <(find . -maxdepth 1)
The latter doesn't create a subshell out of the while loop.

The common methods are:
ls * | while read file; do data "$file"; done
for file in *; do data "$file"; done
The second can run into problems if you have whitespace in filenames; in that case you'd probably want to make sure it runs in a subshell, and set IFS:
( IFS=$'\n'; for file in *; do data "$file"; done )
You can easily wrap the first one up in a script:
#!/bin/bash
# map.bash
while read file; do
"$1" "$file"
done
which can be executed as you requested - just be careful never to accidentally execute anything dumb with it. The benefit of using a looping construct is that you can easily place multiple commands inside it as part of a one-liner, unlike xargs where you'll have to place them in an executable script for it to run.
Of course, you can also just use the utility xargs:
find -maxdepth 0 * | xargs -n 1 data
Note that you should make sure indicators are turned off (ls --indicator-style=none) if you normally use them, or the # appended to symlinks will turn them into nonexistent filenames.

GNU Parallel specializes in making these kind of mappings:
parallel data ::: *
It will run one job on each CPU core in parallel.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Since you specifically asked about this in terms of "map", I thought I'd share this function I have in my personal shell library:
# map_lines: evaluate a command for each line of input
map_lines()
{
while read line ; do
$1 $line
done
}
I use this in the manner that you for a solution:
$ ls | map_lines ./data
I named it map_lines instead of map as I assumed some day I may implement a map_args where you would use it like this:
$ map_args ./data *
That function would look like this:
map_args()
{
cmd="$1" ; shift
for arg ; do
$cmd "$arg"
done
}

Try this:
for i in *; do echo ${i}: `data $i`; done

You can create a shell script like so:
#!/bin/bash
cd /path/to/your/dir
for file in `dir -d *` ; do
./data "$file"
done
That loops through every file in /path/to/your/dir and runs your "data" script on it. Be sure to chmod the above script so that it is executable.

You could also use PRLL.

ls doesn't handle blanks, linefeeds and other funky stuff in filenames and should be avoided where possible.
find is only useful if you like to dive into subdirs, or if you want to make usage from the other options (mtime, size, you name it).
But many commands handle multiple files themself, so don't need a for-loop:
for d in * ; do du -s $d; done
but
du -s *
md5sum e*
identify *jpg
grep bash ../*.sh

I have just written this script specifically to address the same need.
http://gist.github.com/kindaro/4ba601d19f09331750bd
It uses find to build a set of files to transpose, which allows for finer selection of files to map from but allows a window for harder mistakes as well.
I designed two modes of operation: the first mode runs a command with "source file" and "target file" arguments, while the second mode supplies source file contents to a command as stdin and writes its stdout into a target file.
We may further consider adding support for parallel execution and maybe limiting the set of custom find arguments to a few most necessary ones. I am not really sure if that's the right things to do.

Related

Rename files in bash based on content inside

I have a directory which has 70000 xml files in it. Each file has a tag which looks something like this, for the sake of simplicity:
<ns2:apple>, <ns2:orange>, <ns2:grapes>, <ns2:melon>. Each file has only one fruit tag, i.e. there cannot be both apple and orange in the same file.
I would like rename every file (add "1_" before the beginning of each filename) which has one of: <ns2:apple>, <ns2:orange>, <ns2:melon> inside of it.
I can find such files with egrep:
egrep -r '<ns2:apple>|<ns2:orange>|<ns2:melon>'
So how would it look as a bash script, which I can then user as a cron job?
P.S. Sorry I don't have any bash script draft, I have very little experience with it and the time is of the essence right now.
This may be done with this script:
#!/bin/sh
find /path/to/directory/with/xml -type f | while read f; do
grep -q -E '<ns2:apple>|<ns2:orange>|<ns2:melon>' "$f" && mv "$f" "1_${f}"
done
But it will rescan the directory each time it runs and append 1_ to each file containing one of your tags. This means a lot of excess IO and files with certain tags will be getting 1_ prefix each run, resulting in names like 1_1_1_1_file.xml.
Probably you should think more on design, e.g. move processed files to two directories based on whether file has certain tags or not:
#!/bin/sh
# create output dirs
mkdir -p /path/to/directory/with/xml/with_tags/ /path/to/directory/with/xml/without_tags/
find /path/to/directory/with/xml -maxdepth 1 -mindepth 1 -type f | while read f; do
if grep -q -E '<ns2:apple>|<ns2:orange>|<ns2:melon>'; then
mv "$f" /path/to/directory/with/xml/with_tags/
else
mv "$f" /path/to/directory/with/xml/without_tags/
fi
done
Run this command as a dry run, then remove --dry_run to actually rename the files:
grep -Pl '(<ns2:apple>|<ns2:orange>|<ns2:melon>)' *.xml | xargs rename --dry-run 's/^/1_/'
The command-line utility rename comes in many flavors. Most of them should work for this task. I used the rename version 1.601 by Aristotle Pagaltzis. To install rename, simply download its Perl script and place into $PATH. Or install rename using conda, like so:
conda install rename
Here, grep uses the following options:
-P : Use Perl regexes.
-l : Suppress normal output; instead print the name of each input file from which output would normally have been printed.
SEE ALSO:
grep manual

Bash: remove first line of file, create new file with prefix in new dir

I have a bunch of files in a directory, old_dir. I want to:
remove the first line of each file (e.g. using "sed '1d'")
save the output as a new file with a prefix, new_, added to the original filename (e.g. using "{,new_}old_filename")
add these files to a different directory, new_dir, overwriting any conflicting filenames
How do I do this with a Bash script? Having trouble putting the pieces together.
#!/usr/bin/env bash
old_dir="/path/to/somewhere"
new_dir="/path/to/somewhere_else"
prefix="new_"
if [ ! -d "$old_dir" -o ! -d "$new_dir" ]; then
echo "ERROR: We're missing a directory. Aborting." >&2
exit 1
fi
for file in "$old_dir"/*; do
tail +2 "$file" > "$new_dir"/"${prefix}${file##*/}"
done
The important parts of this are:
The for loop, which allows you do to work on each $file.
tail +2 which is notation which should remove the first line of the file. If your tail does not support this, you can get the same result with sed -e 1d.
${file##*/} which is functionally equivalent to basename "$file" but without spawning a child.
Really, none of this is bash-specific. You could run this in /bin/sh in most operating systems.
Note that the code above is intended to explain a process. Once you understand that process, you may be able to come up with faster, shorter strategies for achieving the same thing. For example:
find "$old_dir" -depth 1 -type f -exec sh -c "tail +2 \"{}\" > \"$new_dir/$prefix\$(basename {})\"" \;
Note: I haven't tested this. If you plan to use either of these solutions, do make sure you understand them before you try, so that you don't clobber your data by accident.

find folders and cd into them

I wanted to write a short script with the following structure:
find the right folders
cd into them
replace an item
So my problem is that I get the right folders from findbut I don't know how to do the action for every line findis giving me. I tried it with a for loop like this:
for item in $(find command)
do magic for item
done
but the problem is that this command will print the relative pathnames, and if there is a space within my path it will split the path at this point.
I hope you understood my problem and can give me a hint.
You can run commands with -exec option of find directly:
find . -name some_name -exec your_command {} \;
One way to do it is:
find command -print0 |
while IFS= read -r -d '' item ; do
... "$item" ...
done
-print0 and read ... -d '' cause the NUL character to be used to separate paths, and ensure that the code works for all paths, including ones that contain spaces and newlines. Setting IFS to empty and using the -r option to read prevents the paths from being modified by read.
Note that the while loop runs in a subshell, so variables set within it will not be visible after the loop completes. If that is a problem, one way to solve it is to use process substitution instead of a pipe:
while IFS= ...
...
done < <(find command -print0)
Another option, if you have got Bash 4.2 or later, is to use the lastpipe option (shopt -s lastpipe) to cause the last command in pipelines to be run in the current shell.
If the pattern you want to find is simple enough and you have bash 4 you may not need find. In that case, you could use globstar instead for recursive globbing:
#!/bin/bash
shopt -s globstar
for directory in **/*pattern*/; do
(
cd "$directory"
do stuff
)
done
The parentheses make each operation happen in a subshell. That may have performance cost, but usually doesn't, and means you don't have to remember to cd back each time.
If globstar isn't an option (because your find instructions are not a simple pattern, or because you don't have a shell that supports it) you can use find in a similar way:
find . -whatever -exec bash -c 'cd "$1" && do stuff' _ {} \;
You could use + instead of ; to pass multiple arguments to bash each time, but doing one directory per shell (which is what ; would do) has similar benefits and costs to using the subshell expression above.

Find recursive/xargs/cp/awk/sed/single quote in single quote together in a one-liner

I am trying to create a shell one liner to find all jpegs in a directory recursively. Then I want to copy them all out to an external directory, while renaming them according to their date and time and then append a random integer in order to avoid overwrites with images that have the same timestamp.
First Attempt:
find /storage/sdcard0/tencent/MicroMsg/ -type f -iname '*.jpg' -print0 | xargs -0 sh -c 'for filename; do echo "$filename" && cp "$filename" $(echo /storage/primary/legacy/image3/$(stat $filename |awk '/Mod/ {print $2"_"$3}'|sed s/:/-/g)_$RANDOM.jpg);done' fnord
Among other things, the above doesn't work because there are the single quotes of the awk within the sh -c single quotes.
The second attempt should do the same thing without sh -c, but gives me this error on stat:
stat: can't stat '': No such file or directory
/system/bin/sh: file: not found
Second Attempt:
find /storage/sdcard0/tencent/MicroMsg/ -type f -iname '*.jpg' -print0 | xargs -0 file cp "$file" $(echo /storage/primary/legacy/image3/$(stat "$file" | awk '/Mod/ {print $2"_"$3}'|sed s/:/-/g)_$RANDOM.jpg)
I think the problem with the second attempt may be too many subshells?
Can anyone help me know where I'm going wrong here?
On another note: if anyone knows how to preserve the actual modified date/time stamps when copying a file, I would love the throw that in here.
Thank you Thank you
Were it my problem, I'd create a script — call it filecopy.sh — like this:
TARGET="/storage/primary/legacy/image3"
for file in "$#"
do
basetime=$(date +'%Y-%m-%d.%H-%M-%S' -d #$(stat -c '%Y' "$file"))
cp "$file" "$TARGET/$basetime.$RANDOM.jpg"
done
The basetime line runs stat to get the modification time of the file in seconds since The Epoch, then uses that with date to format the time as a modified ISO 8601 format (using - in place of :, and . in place of T). This is then used to create the target file name, along with a semi-random number.
Then the find command becomes simply:
SOURCE="/storage/sdcard0/tencent/MicroMsg"
find "$SOURCE" -type f -iname '*.jpg' -exec /path/to/filecopy.sh {} +
Personally, I'd not bother to try making it work without a separate shell script. It could be done, but it would not be trivial:
SOURCE="/storage/sdcard0/tencent/MicroMsg"
find "$SOURCE" -type f -iname '*.jpg' -exec bash -c \
'TARGET="/storage/primary/legacy/image3"
for file in "$#"
do
basetime=$(date +%Y-%m-%d.%H-%M-%S -d #$(stat -c %Y "$file"))
cp "$file" "$TARGET/$basetime.$RANDOM.jpg"
done' command {} +
I've taken some liberties in that by removing the single quotes that I used in the main shell script. They were optional, but I'd use them automatically under normal circumstances.
If you have GNU Parallel > version 20140722 you can run:
find . | parallel 'cp {} ../destdir/{= $a = int(10000*rand); $_ = `date -r "$_" +%FT%T"$a"`; chomp; =}'
It will work on file names containing ' and space, but fail on file names containing ".
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
Run the same program on many files
Run the same program for every line in a file
Run the same program for every block in a file
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

List the contents of all subdirectories of the current directory - parse ls vs. globbing

In the book "Beginning Portable Shell Scripting" by Peter Seebach there is an example how to list the contents of all subdirectories of the current directory:
#!/bin/sh
/bin/ls | while read file
do
if test -d "$file"; then
( cd "$file" && ls )
fi
done
I learned that parsing ls is bad and globbing should be prefered. Do you think the author chooses parsing because there is a portability issue?
I would do:
#!/bin/sh
for file in *
do
if test -d "$file"; then
( cd "$file" && ls )
fi
done
Thanks,
Somebody
Both solutions are not robust against weird filenames, nor do they handle directories which begin with ".". I would write this using find, e.g.:
find . -maxdepth 1 -type d -exec ls '{}' ';'
but first I'd question what output is actually required, either for a person to eyeball or a further script to digest.
You'll probably be able to do in a single "find" what is going to cost a lot of process forks with the for/while ... do ... done loop.
Globbing is much preferred over parsing ls since it will handle filenames that include spaces and other characters.
For the specific case in your question, you may be able to use nothing more than:
ls */

Resources