BASH Script - Safe limits for string from command output - bash

Good day,
I am writing a relatively simple BASH script that performs an SVN UP command, captures the console output, then does some post processing on the text.
For example:
#!/bin/bash
# A script to alter SVN logs a bit
# Update and get output
echo "Waiting for update command to complete..."
TEST_TEXT=$(svn up --set-depth infinity)
echo "Done"
# Count number of lines in output and report it
NUM_LINES=$(echo $TEST_TEXT | grep -c '.*')
echo "Number of lines in output log: $NUM_LINES"
# Print out only lines containing Makefile
echo $TEST_TEXT | grep Makefile
This works as expected (ie: as commented in the code above), but I am concerned about what would happen if I ran this on a very large repository. Is there a limit on the maximum buffer size BASH can use to hold the output of a console command?
I have looked for similar questions, but nothing quite like what I'm searching for. I've read up on how certain scripts need to use the xargs in cases of large intermediate buffers, and I'm wondering if something similar applies here with respect to capturing console output.
eg:
# Might fail if we have a LOT of results
find -iname *.cpp | rm
# Shouldn't fail, regardless of number of results
find -iname *.cpp | xargs rm
Thank you.

Using
var=$(hexdump /dev/urandom | tee out)
bash didn't complain; I killed it at a bit over 1G and 23.5M lines. You don't need to worry as long as your output fits in your system's memory.

I see no reason not to use a temporary file here.
tmp_file=$(mktemp XXXXX)
svn up --set-depth=infinity > $tmp_file
echo "Done"
# Count number of lines in output and report it
NUM_LINES=$(wc -l $tmp_file)
echo "Number of lines in output log: $NUM_LINES"
# Print out only lines containing Makefile
grep Makefile $tmp_file
rm $tmp_file

Related

How to delay `redirection operator` of BASH `>`

First I create 3 files:
$ touch alpha bravo carlos
Then I want to save the list to a file:
$ ls > info.txt
However, I always got my info.txt inside:
$ cat info.txt
alpha
bravo
carlos
info.txt
It looks like the redirection operator creates my info.txt first.
In this case, my question is. How can I save my list of files before creating the info.txt first?
The main question is about the redirection operator. Why does it act first, and how to delay it so I complete my task first? Using the example above to answer it.
When you redirect a command's output to a file, the shell opens a file handle to the destination file, then runs the command in a child process whose standard output is connected to this file handle. There is no way to change this order, but you can redirect to a file in a different directory if you don't want the ls output to include the new file.
ls >/tmp/info.txt
mv /tmp/info.txt ./
In a production script, you should make sure that the file name is unique and unpredictable.
t=$(mktemp -t lstemp.XXXXXXXXXX) || exit
trap 'rm -f "$t"' INT HUP
ls >"$t"
mv "$t" ./info.txt
Alternatively, capture the output into a variable, and then write that variable to a file.
files=$(ls)
echo "$files" >info.txt
As an aside, probably don't use ls in scripts. If you want a list of files in the current directory
printf '%s\n' *
does that.
One simple approach is to save your command output to a variable, like this:
ls_output="$(ls)"
and then write the value of that variable to the file, using any of these commands:
printf '%s\n' "$ls_output" > info.txt
cat <<< "$ls_output" > info.txt
echo "$ls_output" > info.txt
Some caveats with this approach:
Bash variables can't contain null bytes. If the output of the command includes a null byte, that byte and everything after it will be discarded.
In the specific case of ls, though, this shouldn't be an issue, because the output of ls should never contain a null byte.
$(...) removes trailing newlines. The above compensates for this by adding a newline while creating info.txt, but if the the command output ends with multiple newlines, then the above will effectively collapse them into a single newline.
In the specific case of ls, this could happen if a filename ends with a newline — very unusual, and unlikely to be intentional, but nonetheless possible.
Since the above adds a newline while creating info.txt, it will put a newline there even if the command output doesn't end with a newline.
In the specific case of ls, this shouldn't be an issue, because the output of ls should always end with a newline.
If you want to avoid the above issues, another approach is to save your command output to a temporary file in a different directory, and then move it to the right place; for example:
tmpfile="$(mktemp)"
ls > "$tmpfile"
mv -- "$tmpfile" info.txt
. . . which obviously has different caveats (e.g., it requires access to write to a different directory), but should work on most systems.
One way to do what you want is to exclude the info.txt file from the ls output.
If you can rename the list file to .info.txt then it's as simple as:
ls >.info.txt
ls doesn't list files whose names start with . by default.
If you can't rename the list file but you've got GNU ls then you can use:
ls --ignore=info.txt >info.txt
Failing that, you can use:
ls | grep -v '^info\.txt$' >info.txt
All of the above options have the advantage that you can safely run them after the list file has been created.
Another general approach is to capture the output of ls with one command and save it to the list file with a second command. As others have pointed out, temporary files and shell variables are two specific ways to capture the output. Another way, if you've got the moreutils package installed, is to use the sponge utility:
ls | sponge info.txt
Finally, note that you may not be able to reliably extract the list of files from info.txt if it contains plain ls output. See ParsingLs - Greg's Wiki for more information.

how to untar certain files from an archive and grep in parallel in bash

We've got extensive amount of tarballs and in each tarball I need to search for a particular pattern only in some files which names are known before hand.
As the disk access is slower and there is quite a few cores and plenty of memory available on this system, we aim minimising the disk writes and going through the memory as much as possible.
echo "a.txt" > file_subset_in_tar.txt
echo "b.txt" >> file_subset_in_tar.txt
echo "c.txt" >> file_subset_in_tar.txt
tarball_name="tarball.tgz";
pattern="mypattern"
echo "pattern: $pattern"
(parallel -j-2 tar xf $tarball_name -O ::: `cat file_subset_in_tar.txt` | grep -ac "$pattern")
This works just fine on the bash terminal directly. However, when I paste this in a script with bash bang on the top, it just prints zero.
If I change the $pattern to a hard coded string, it runs ok. It feels like there is something wrong with the pipe sequencing or something similar. So, ideally an update to the attempt above or another solution which satisfies the mentioned disk/memory use requirements would be much appreciated.
I believe your parallel command is constructed incorrectly. You can run the pipeline of commands like the following:
parallel -j -2 "tar xf $tarball_name -O {} | grep -ac $pattern" :::: file_subset_in_tar.txt
Also note that the backticks and use of cat is unnecessary, parameters can be fed to parallel from a file using ::::.

Can Unix shell be used to report completion status in some manner?

I have seen some ideas for progress bars around SO and externally for specific commands (such as cat). However, my question seems to deviate slightly from the standard...
Currently, I am using the capability of the find command in shell, such as the follow example:
find . -name file -exec cmd "{}" \;
Where "cmd" is generally a zipping capability or removal tool to free up disk space.
When "." is very large, this can take minutes, and I would like some ability to report "status".
Is there some way to have some type of progress bar, percentage completion, or even print periods (i.e., Working....) until completed? If at all possible, I would like to avoid increasing the duration of this execution by adding another find. Is it possible?
Thanks in advance.
Clearly, you can only have a progress meter or percent completion if you know how long the command will take to run, or if it can tell you that it has finished x tasks out of y.
Here's a simple way to show an indicator while something is working:
#!/bin/sh
echo "launching: $#"
spinner() {
while true; do
for char in \| / - \\; do
printf "\r%s" "$char"
sleep 1
done
done
}
# start the spinner
spinner &
spinner_pid=$!
# launch the command
"$#"
# shut off the spinner
kill $spinner_pid
echo ""
So, you'd do (assuming the script is named "progress_indicator")
find . -name file -exec progress_indicator cmd "{}" \;
The trick with find is that you add two -print clauses, one at the start, and
one at the end. You then use awk (or perl) to update and print a line counter for each
unique line. In this example I tell awk to print to stderr.
Any duplicate lines must be the result of the conditions we specified, so we treat that special.
In this example, we just print that line:
find . -print -name aa\* -print |
awk '$0 == last {
print "" > "/dev/fd/2"
print
next
}
{
printf "\r%d", n++ > "/dev/fd/2"
last=$0
}'
It's best to leave find to just report pathnames, and do further processing from awk,
or just add another pipeline. (Because the counters are printed to stderr, those will not
interfere.)
If you have the dialog utility installed (), you can easily make a nice rolling display:
find . -type f -name glob -exec echo {} \; -exec cmd {} \; |
dialog --progressbox "Files being processed..." 12 $((COLUMNS*3/2))
The arguments to --progressbox are the box's title (optional, can't look like a number); the height in text rows and the width in text columns. dialog has a bunch of options to customize the presentation; the above is just to get you started.
dialog also has a progress bar, otherwise known as a "gauge", but as #glennjackman points out in his answer, you need to know how much work there is to do in order to show progress. One way to do this would be to collect the entire output of the find command, count the number of files in it, and then run the desired task from the accumulated output. However, that means waiting until the find command finishes in order to start doing the work, which might not be desirable.
Just because it was an interesting challenge, I came up with the following solution, which is possibly over-engineered because it tries to work around all the shell gotchas I could think of (and even so, it probably misses some). It consists of two shell files:
# File: run.sh
#!/bin/bash
# Usage: run.sh root-directory find-tests
#
# Fix the following path as required
PROCESS="$HOME/bin/process.sh"
TD=$(mktemp --tmpdir -d gauge.XXXXXXXX)
find "$#" -print0 |
tee >(awk -vRS='\0' 'END{print NR > "'"$TD/_total"'"}';
ln -s "$TD/_total" "$TD/total") |
{ xargs -0 -n50 "$PROCESS" "$TD"; printf "XXX\n100\nDone\nXXX\n"; } |
dialog --gauge "Starting..." 7 70
rm -fR "$TD"
# File: process.sh
#!/bin/bash
TD="$1"; shift
TOTAL=
if [[ -f $TD/count ]]; then COUNT=$(cat "$TD/count"); else COUNT=0; fi
for file in "$#"; do
if [[ -z $TOTAL && -f $TD/total ]]; then TOTAL=$(cat "$TD/total"); fi
printf "XXX\n%d\nProcessing file\n%q\nXXX\n" \
$((COUNT*100/${TOTAL:-100})) "$file"
#
# do whatever you want to do with $file
#
((++COUNT))
done
echo $COUNT > "$TD/count"
Some notes:
There are a lot of gnu extensions scattered in the above. I haven't made a complete list, but it certainly includes the %q printf format (which could just be %s); the flags used to NUL-terminate the filename list, and the --tmpdir flag to mktemp.
run.sh uses tee to simultaneously count the number of files found (with awk) and to start processing the files.
The -n50 argument to xargs causes it to wait only for the first 50 files to avoid delaying startup if find spends a lot of time not finding the first files; it might not be necessary.
The -vRS='\0' argument to awk causes it to use a NUL as a line delimiter, to match the -print0 action to find (and the -0 option to xargs); all this is only necessary if filepaths could contain a new-line.
awk writes the count to _total and then we symlink _total to total to avoid a really unlikely race condition where total is read before it is completely written. symlinking is atomic, so doing it this way guarantees that total either doesn't exist or is completely written.
It might have been better to have counted the total size of the files rather than just counting them, particularly if the processing work is related to file size (compression, for example). That would be a reasonably simple modification. Also, it would be tempting to use xargs parallel execution feature, but that would require a bit more work coordinating the sum of processed files between the parallel processes.
If you're using a managed environment which doesn't have dialog, the simplest solution is to just run the above script using ssh from an environment which does have dialog. Remove | dialog --gauge "Starting..." 7 70 from run.sh, and put it in your ssh invocation instead: ssh user#host /path/to/run.sh root-dir find-tests | dialog --gauge "Starting..." 7 70

Shell script takes a list of commands as input, tries to execute them, and fails

I am, like many non-engineers or non-mathematicians who try writing algorithms, an intuitive. My exact psychological typology makes it quite difficult for me to learn anything serious like computers or math. Generally, I prefer audio, because I can engage my imagination more effectively in the learning process.
That said, I am trying to write a shell script that will help me master Linux. To that end, I copied and pasted a list of Linux commands from the O'Reilly website's index to the book Python In a Nutshell. I doubt they'll mind, and I thank them for providing it. These are the textfile `massivelistoflinuxcommands,' not included fully below in order to save space...
OK, now comes the fun part. How do I get this script to work?
#/bin/sh
read -d 'massivelistoflinuxcommands' commands <<EOF
accept
bison
bzcmp
bzdiff
bzgrep
bzip2
bzless
bzmore
c++
lastb
lastlog
strace
strfile
zmore
znew
EOF
for i in $commands
do
$i --help | less | cat > masterlinuxnow
text2wave masterlinuxnow -o ml.wav
done
It really helps when you include error messages or specific ways that something deviates from expected behavior.
However, your problem is here:
read -d 'massivelistoflinuxcommands' commands <<EOF
It should be:
read -d '' commands <<EOF
The delimiter to read causes it to stop at the first character it finds that matches the first character in the string, so it stops at "bzc" because the next character is "m" which matches the "m" at the beginning of "massive..."
Also, I have no idea what this is supposed to do:
$i --help | less | cat > masterlinuxnow
but it probably should be:
$i --help > masterlinuxnow
However, you should be able to pipe directly into text2wave and skip creating an intermediate file:
$i --help | text2wave -o ml.wav
Also, you may want to prevent each file from overwriting the previous one:
$i --help | text2wave -o ml-$i.wav
That will create files named like "ml-accept.wav" and "ml-bison.wav".
I would point out that if you're learning Linux commands, you should prioritize them by frequency of use and/or applicability to a beginner. For example, you probably won't be using bison right away`.
The first problem here is that not every command has a --help option!! In fact the very first command, accept, has no such option! A better approach might be executing man on each command since a manual page is more likely to exist for each of the commands. Thus change;
$i --help | less | cat > masterlinuxnow
to
man $i >> masterlinuxnow
note that it is essential you use the append output operator ">>" instead of the create output operator ">" in this loop. Using the create output operator will recreate the file "masterlinuxnow" on each iteration thus containing only the output of the last "man $i" processed.
you also need to worry about whether the command exists on your version of linux (many commands are not included in the standard distribution or may have different names). Thus you probably want something more like this where the -n in the head command should be replace by the number of lines you want, so if you want only the first 2 lines of the --help output you would replace -n with -2:
if [ $(which $i) ]
then
$i --help | head -n >> masterlinuxnow
fi
and instead of the read command, simply define the variable commands like so:
commands="
bison
bzcmp
bzdiff
bzgrep
bzip2
bzless
bzmore
c++
lastb
lastlog
strace
strfile
zmore
znew
"
Putting this all together, the following script works quite nicely:
commands="
bison
bzcmp
bzdiff
bzgrep
bzip2
bzless
bzmore
c++
lastb
lastlog
strace
strfile
zmore
znew
"
for i in $commands
do
if [ $(which $i) ]
then
$i --help | head -1 >> masterlinuxnow 2>/dev/null
fi
done
You're going to learn to use Linux by listening to help descriptions? I really think that's a bad idea.
Those help commands usually list every obscure option to a command, including many that you will never use-- especially as a beginner.
A guided tutorial or book would be much better. It would only present the commands and options that will be most useful. For example, that list of commands you gave has many that I don't know-- and I've been using Linux/Unix extensively for 10 years.

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.
#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi
The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.
i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..
for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

Resources