Bash descriptors in sub shell - bash

The snippet below was intended to listen and select a few songs from a directory:
exec 3<&1
find /some/directory -name '*.mp3' -print0 | xargs -0 bash -c '
for i; do
mplayer -ss 10 "$i" 1<&3
read -p "Select? (y/n)" -n 1 choice 1<&3
if [ "$choice" = "y" -o "$choice" = "Y" ]; then
echo "$i" > /tmp/selected_songs.txt
fi
done
'
exec 3<&-
The intention was to have mplayer and shell read accept input from the keyboard, but ins't working out! For that effect, I thought FD 3 will point to keyboard input for both find and xargs processes. This again gets passed on to the shell that xargs execs, where mplayer and shell read was executed; but it didn't!
What's going wrong here?

First of all, I think you're making this harder than it has to be. If you have bash 4, use globstar.
shopt -s globstar
for i in /some/directory/**/*.mp3; do
mplayer -ss 10 "$i"
read -p "Select? (y/n)" -n 1 choice
case "$choice" in
[yY]) echo "$i" >> /tmp/selected_songs.txt;;
esac
done
Even if you don't, you can do this using find … -exec bash -c 'yourscript' _ {} + instead of using xargs.
(I also changed your > to >> because I assumed you didn't want to truncate the file at each pass.)
As for understanding the problem, there's all kinds of complex things going wrong here, but I'll point out a few important things:
Standard Input is FD 0, but exec 3<&1 duplicates 1 as 3 and opens it for reading.
You seem to be trying to change where keyboard input is going. That's tricky, because doing that is similar to sending a EOF to an interactive shell. Most shells will close when they encounter that. Instead, consider changing where xargs gets its input, and leave the keyboard alone. (BSD xargs has the -o option that is relevant. Check your manpage.)
You are doing the same redirect for both mplayer and read without a timeout. If you're looping and reading, how do you know where you are in the loop when you provide input?

Silly me! It's the FD 0 -- not 1 -- that should be redirected. A version that works as intended below:
exec 3<&0
find /media/jeenu/USB/ -type f -print0 | xargs -0 bash -c '
for i; do
mplayer -ss 10 "$i"
read -p "Select? (y/n/q)" -n 1 choice
case "$choice" in
[yY]) echo "$i" >> /tmp/selected_songs.txt;;
[qQ]) break;;
esac
done 0<&3
'
exec 3<&-

If you want a subshell with bash then you can do this:
find /media/jeenu/USB/ -type f -print0 | xargs -0 bash <<EOF
#This is a bash subshell
#put your code here
EOF

Related

Speed up shell script/Performance enhancement of shell script

Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help
#!/bin/bash
DATA_FILE_SOURCE="<path_to_source_data/${1}"
DATA_FILE_DEST="<path_to_dest>"
for fname in $(ls -1 "${DATA_FILE_SOURCE}")
do
for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
do
FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
then
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
else
if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
then
sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
fi
fi
done
done
There are still parts of your published script that are unclear like the sed command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.
#!/usr/bin/env sh
DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"
for fname in "$DATA_FILE_SOURCE/"*; do
while IFS=, read -r a b content || [ "$a" ]; do
destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
if grep -Fxq "$content" "$destfile"; then
sed -i "/$1/d" "$destfile"
fi
printf '%s\n' "$content" >>"$destfile"
done < "$fname"
done
Make it parallel (as much as you can).
#!/bin/bash
set -e -o pipefail
declare -ir MAX_PARALLELISM=20 # pick a limit
declare -i pid
declare -a pids
# ...
for fname in "${DATA_FILE_SOURCE}/"*; do
if ((${#pids[#]} >= MAX_PARALLELISM)); then
wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
while IFS= read -r line; do
FILE_TO_WRITE_TO="..."
# ...
done < "${fname}" & # forking here
pids[$!]="${fname}"
done
for pid in "${!pids[#]}"; do
wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done
Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):
#!/bin/bash
set -e -o pipefail
declare -ir MAX_PARALLELISM=20 # pick a limit
declare -i pid
declare -a pids
do_something_and_maybe_fail() {
sleep $((RANDOM % 10))
return $((RANDOM % 2 * 5))
}
for fname in some_name_{a..f}{0..5}.txt; do # 36 items
if ((${#pids[#]} >= MAX_PARALLELISM)); then
wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
unset 'pids[pid]'
fi
do_something_and_maybe_fail & # forking here
pids[$!]="${fname}"
echo "${#pids[#]} running" 1>&2
done
for pid in "${!pids[#]}"; do
wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done
Strictly avoid external processes (such as awk, grep and cut) when processing one-liners for each line. fork()ing is extremely inefficient in comparison to:
Running one single awk / grep / cut process on an entire input file (to preprocess all lines at once for easier processing in bash) and feeding the whole output into (e.g.) a bash loop.
Using Bash expansions instead, where feasible, e.g. "${line/,/.}" and other tricks from the EXPANSION section of the man bash page, without fork()ing any further processes.
Off-topic side notes:
ls -1 is unnecessary. First, ls won’t write multiple columns unless the output is a terminal, so a plain ls would do. Second, bash expansions are usually a cleaner and more efficient choice. (You can use nullglob to correctly handle empty directories / “no match” cases.)
Looping over the output from cat is a (less common) useless use of cat case. Feed the file into a loop in bash instead and read it line by line. (This also gives you more line format flexibility.)

xargs output buffering -P parallel

I have a bash function that i call in parallel using xargs -P like so
echo ${list} | xargs -n 1 -P 24 -I# bash -l -c 'myAwesomeShellFunction #'
Everything works fine but output is messed up for obvious reasons (no buffering)
Trying to figure out a way to buffer output effectively. I was thinking I could use awk, but I'm not good enough to write such a script and I can't find anything worthwhile on google? Can someone help me write this "output buffer" in sed or awk? Nothing fancy, just accumulate output and spit it out after process terminates. I don't care the order that shell functions execute, just need their output buffered... Something like:
echo ${list} | xargs -n 1 -P 24 -I# bash -l -c 'myAwesomeShellFunction # | sed -u ""'
P.s. I tried to use stdbuf as per
https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe but did not work, i specified buffering on o and e but output still unbuffered:
echo ${list} | xargs -n 1 -P 24 -I# stdbuf -i0 -oL -eL bash -l -c 'myAwesomeShellFunction #'
Here's my first attempt, this only captures first line of output:
$ bash -c "echo stuff;sleep 3; echo more stuff" | awk '{while (( getline line) > 0 )print "got ",$line;}'
$ got stuff
This isn't quite atomic if your output is longer than a page (4kb typically), but for most cases it'll do:
xargs -P 24 bash -c 'for arg; do printf "%s\n" "$(myAwesomeShellFunction "$arg")"; done' _
The magic here is the command substitution: $(...) creates a subshell (a fork()ed-off copy of your shell), runs the code ... in it, and then reads that in to be substituted into the relevant position in the outer script.
Note that we don't need -n 1 (if you're dealing with a large number of arguments -- for a small number it may improve parallelization), since we're iterating over as many arguments as each of your 24 parallel bash instances is passed.
If you want to make it truly atomic, you can do that with a lockfile:
# generate a lockfile, arrange for it to be deleted when this shell exits
lockfile=$(mktemp -t lock.XXXXXX); export lockfile
trap 'rm -f "$lockfile"' 0
xargs -P 24 bash -c '
for arg; do
{
output=$(myAwesomeShellFunction "$arg")
flock -x 99
printf "%s\n" "$output"
} 99>"$lockfile"
done
' _

What is the equivalent to xargs -r under OsX

Are they any equivalent under OSX to the xargs -r under Linux ? I'm trying to find a way to interupt a pipe if there's no data.
For instance imagine you do the following:
touch test
cat test | xargs -r echo "content: "
That doesn't yield any result because xargs interrupts the pipe.
Is there either some hidden xargs option or something else to achieve the same result under OSX?
The POSIX standard for xargs mandates that the command be executed once, even if there are no arguments. This is a nuisance, which is why GNU xargs has the -r option. Unfortunately, neither BSD (MacOS X) nor the other mainstream Unix versions (AIX, HP-UX, Solaris) support it.
If it is crucial to you, obtain and install GNU xargs somewhere that your environment will find it, without affecting the system (so don't replace /usr/bin/xargs unless you're a braver man than I am — but /usr/local/bin/xargs might be OK, or $HOME/bin/xargs, or …).
You can use test or [:
if [ -s test ] ; then cat test | xargs echo content: ; fi
There is no standard way to determine if the xargs you are running is GNU or not. I set $gnuargs to either "true" or "false" and then have a function that replaces xargs and does the right thing.
On Linux, FreeBSD and MacOS this script works for me. The POSIX standard for xargs mandates that the command be executed once, even if there are no arguments. FreeBSD and MacOS X violate this rule, thus don't need "-r". GNU finds it annoying, and adds -r. This script does the right thing and can be enhanced if you find a version of Unix that does it some other way.
#!/bin/bash
gnuxargs=$(xargs --version 2>&1 |grep -s GNU >/dev/null && echo true || echo false)
function portable_xargs_r() {
if $gnuxargs ; then
cat - | xargs -r "$#"
else
cat - | xargs "$#"
fi
}
echo 'this' > foo
echo '=== Expect one line'
portable_xargs_r <foo echo "content: "
echo '=== DONE.'
cat </dev/null > foo
echo '=== Expect zero lines'
portable_xargs_r <foo echo "content: "
echo '=== DONE.'
Here's a quick and dirty xargs-r using a temporary file.
#!/bin/sh
t=$(mktemp -t xargsrXXXXXXXXX) || exit
trap 'rm -f $t' EXIT HUP INT TERM
cat >"$t"
test -s "$t" || exit
exec xargs "$#" <"$t"
with POSIX xargs¹, to avoid running the-command when the input is empty, you could use moreutils's ifne (for if not empty):
... | ifne xargs ... the-command ...
Or use a sh wrapper that checks the number of arguments:
... | xargs ... sh -c '[ "$#" -eq 0 ] || exec the-command ... "$#"' sh
¹ though one can hardly use xargs POSIXly as it doesn't support -0, has unspecified behaviour when the input is non-text (like for filenames which on most systems are not guaranteed to be text except in the POSIX locale), parses its input in a very arcane way and that is locale-dependant, and doesn't give any guarantee if any word is more than 255 bytes long!
You could make sure that the input always has at least one line. This may not always be possible, but you'd be surprised how many creative ways this can be done.
A typical use case looks like:
find . -print0 | xargs -r -0 grep PATTERN
Some versions of xargs do not have an -r flag. In that case, you can supply /dev/null as the first filename so that grep is never handed an empty list of filenames. Since the pattern will never be found in /dev/null, this won't affect the output:
find . -print0 | xargs -0 grep PATTERN /dev/null
You can test if the stream has any content:
cat test | { if IFS= read -r tmp; then { printf "%s\n" "$tmp"; cat; } | xargs echo "content: "; fi; }
# ^^^ - otherwise just do nothing
# ^^^^^^^^^^^^^^^^^^^^^^^ - to xargs
# ^^^ - and the rest of input
# ^^^^^^^^^^^^^^^^^^^^^^ - redirect first line
# ^^^^^^^^^^^^^^^^^^^ - try reading anything
# or with a function
# even TODO: add the check of `portable_xargs_r` in the other answer and call `xargs -r` when available.
xargs_r() {
if IFS= read -r tmp; then
{ printf "%s\n" "$tmp"; cat; } | xargs "$#"
fi
}
cat test | xargs_r echo "content: "
This method runs the check inside the pipe inside the subshell, so it effectively can be used in a complicated pipe setup.

Shell Scripting: Using xargs to execute parallel instances of a shell function

I'm trying to use xargs in a shell script to run parallel instances of a function I've defined in the same script. The function times the fetching of a page, and so it's important that the pages are actually fetched concurrently in parallel processes, and not in background processes (if my understanding of this is wrong and there's negligible difference between the two, just let me know).
The function is:
function time_a_url ()
{
oneurltime=$($time_command -p wget -p $1 -O /dev/null 2>&1 1>/dev/null | grep real | cut -d" " -f2)
echo "Fetching $1 took $oneurltime seconds."
}
How does one do this with an xargs pipe in a form that can take number of times to run time_a_url in parallel as an argument? And yes, I know about GNU parallel, I just don't have the privilege to install software where I'm writing this.
Here's a demo of how you might be able to get your function to work:
$ f() { echo "[$#]"; }
$ export -f f
$ echo -e "b 1\nc 2\nd 3 4" | xargs -P 0 -n 1 -I{} bash -c f\ \{\}
[b 1]
[d 3 4]
[c 2]
The keys to making this work are to export the function so the bash that xargs spawns will see it and to escape the space between the function name and the escaped braces. You should be able to adapt this to work in your situation. You'll need to adjust the arguments for -P and -n (or remove them) to suit your needs.
You can probably get rid of the grep and cut. If you're using the Bash builtin time, you can specify an output format using the TIMEFORMAT variable. If you're using GNU /usr/bin/time, you can use the --format argument. Either of these will allow you to drop the -p also.
You can replace this part of your wget command: 2>&1 1>/dev/null with -q. In any case, you have those reversed. The correct order would be >/dev/null 2>&1.
On Mac OS X:
xargs: max. processes must be >0 (for: xargs -P [>0])
f() { echo "[$#]"; }
export -f f
echo -e "b 1\nc 2\nd 3 4" | sed 's/ /\\ /g' | xargs -P 10 -n 1 -I{} bash -c f\ \{\}
echo -e "b 1\nc 2\nd 3 4" | xargs -P 10 -I '{}' bash -c 'f "$#"' arg0 '{}'
If you install GNU Parallel on another system, you will see the functionality is in a single file (called parallel).
You should be able to simply copy that file to your own ~/bin.

Using xargs to assign stdin to a variable

All that I really want to do is make sure everything in a pipeline succeeded and assign the last stdin to a variable. Consider the following dumbed down scenario:
x=`exit 1|cat`
When I run declare -a, I see this:
declare -a PIPESTATUS='([0]="0")'
I need some way to notice the exit 1, so I converted it to this:
exit 1|cat|xargs -I {} x={}
And declare -a gave me:
declare -a PIPESTATUS='([0]="1" [1]="0" [2]="0")'
That is what I wanted, so I tried to see what would happen if the exit 1 didn't happen:
echo 1|cat|xargs -I {} x={}
But it fails with:
xargs: x={}: No such file or directory
Is there any way to have xargs assign {} to x? What about other methods of having PIPESTATUS work and assigning the stdin to a variable?
Note: these examples are dumbed down. I'm not really doing an exit 1, echo 1 or a cat, but used these commands to simplify so we can focus on my particular issue.
When you use backticks (or the preferred $()) you're running those commands in a subshell. The PIPESTATUS you're getting is for the assignment rather than the piped commands in the subshell.
When you use xargs, it knows nothing about the shell so it can't make variable assignments.
Try set -o pipefail then you can get the status from $?.
xargs is run in a child process, as are all the commands you call. So they can't effect the environment of your shell.
You might be able to do something with named pipes (mkfifo), or possible bash's read function?
EDIT:
Maybe just redirect the output to a file, then you can use PIPESTATUS:
command1 | command2 | command3 >/tmp/tmpfile
## Examine PIPESTATUS
X=$(cat /tmp/tmpfile)
How about ...
read x <<<"$(echo 1)"
read x < <(echo 1)
echo "$x"
Why not just populate a new array?
IFS=$'\n' read -r -d '' -a result < <(echo a | cat | cat; echo "PIPESTATUS='${PIPESTATUS[*]}'" )
IFS=$'\n' read -r -d '' -a result < <(echo a | exit 1 | cat; echo "PIPESTATUS='${PIPESTATUS[*]}'" )
echo "${#result[#]}"
echo "${result[#]}"
echo "${result[0]}"
echo "${result[1]}"
There are already a few helpful solutions. It turns out that I actually had an example that matches the question as framed above; close-enough anyway.
Consider this:
XX=$(ls -l *.cpp | wc -l | xargs -I{} echo {})
echo $XX
3
Meaning that I had 3 x .cpp files to in my working directory. Now $XX is 3 and I can make use of that result in my script. It is contrived, because I don't actually need the xargs in this example. It works though.
In the example from the question ...
x=`exit 1|cat`
I don't think that will give you what was specified. exit will quit the sub-shell before the cat gets a mention. Also on that note,
I might start with something like
declare -a PIPESTATUS='([0]="0")'
x=$?
x now has the status from the last command.
Assign each line of input to an array, e.g. all python files in a directory
declare -a pyFiles=($(ls -l *.py | awk '{print $9}'))
where $9 is the nineth field in ls -l corresponding to the filename

Resources