Linux command execution in different process and sending their results in variables - bash

I'm trying to create a Bash script to calculate MD5 checksum of big files using different process. I learned that one should use & for that purpose.
At the same time, I wanted to capture the results of the check sum in different variables and write them in file in order to read them after.
So, I wrote the following script "test_base.sh" and executed it using the command "sh ./test_base.sh" and the results were sent to the following file "test.txt" which was empty.
My OS is LUBUNTU 22.04 LTS.
Why the "test.txt" is empty?
Code of the "test_base.sh":
#!/bin/bash
md51=`md5sum -b ./source/test1.mp4|cut -b 1-32` &
md52=`md5sum -b ./source/test2.mp4|cut -b 1-32` &
wait
echo "md51=$md51">./test.txt
echo "md52=$md52">>./test.txt
Result of "test.txt":
md51=
md52=

Updated Answer
If you really, really want to avoid intermediate files, you can use GNU Parallel as suggested by #Socowi in the comments. So, if you run this:
parallel -k md5sum {} ::: test1.mp4 test2.mp4
you will get something like this, where -k keeps the output in order regardless of which one finishes first:
d5494cafb551b56424d83889086bd128 test1.mp4
3955a4ddb985de2c99f3d7f7bc5235f8 test2.mp4
Now, if you transpose the linefeed into a space, like this:
parallel -k md5sum {} ::: test1.mp4 test2.mp4 | tr '\n' ' '
You will get:
d5494cafb551b56424d83889086bd128 test1.mp4 3955a4ddb985de2c99f3d7f7bc5235f8 test2.mp4
You can then read this into bash variables, using _ for the interspersed parts you aren't interested in:
read mp51 _ mp52 _ < <(parallel -k md5sum {} ::: test1.mp4 test2.mp4 | tr '\n' ' ')
echo $mp51, $mp52
d5494cafb551b56424d83889086bd128,3955a4ddb985de2c99f3d7f7bc5235f8
Yes, this will fail if there are spaces or linefeeds in your filenames, but if required, you can make a successively more and more complicated command to deal with cases your question doesn't mention, but then you kind of miss the salient points of what I am suggesting.
Original Answer
bash doesn’t really have the concept of awaiting the result of a promise. So you could go with something like:
md5sum test1.mp4 > md51.txt &
md5sum test2.mp4 > md52.txt &
wait # for both
md51=$(awk ‘{print $1}’ md51.txt)
md52=$(awk ‘{print $1}’ md52.txt)
rm md5?.txt

Related

How to get the highest numbered link from curl result?

i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.
Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)
Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).

How to remove extra spaces in od output when reading from /dev/random

I'm making a script to auto-generate a url with a random number at a specific location. This will be for calling a JSON API for a random endpoint. The end goal is to generate something like this:
curl -s http://api.openbeerdatabase.com/v1/beers/<RAND_INT>.json | jq '.'
where <RAND_INT> is a randomly-generated number. I can create this random number with the following command:
$ od -An -N2 -i /dev/random
126
I do not know why the 10 extra spaces are in the output. When I chain the above commands together to generate the URL, I get this:
$ echo http://api.openbeerdatabase.com/v1/beers/`od -An -N2 -i /dev/random`.json
http://api.openbeerdatabase.com/v1/beers/ 43250.json
As you see, there is a single extra space in the generated URL. How do I avoid this?
I've also tried subshelling the rand_int command $(od -An -N2 -i /dev/random) but that produces the same thing. I've thought about piping the commands together, but I don't know how to capture the output of the rand_int command in a variable to be used in the URL.
As the comments show, there's more than one way to do this. Here's what I would do:
(( n = $( od -An -N2 -i /dev/urandom ) ))
echo http://api.openbeerdatabase.com/v1/beers/${n}.json
Or, to put it in one line:
echo http://api.openbeerdatabase.com/v1/beers/$(( $( od -An -N2 -i /dev/urandom ) )).json
Or, just use ${RANDOM} instead, since bash provides it, although its values top out at 32767, which might be one reason you preferred your od-based method.

awk: Output to different processes

I have awk script which splits big file into several files by some condition. Than I'm running another script over each file in parallel.
awk -f script.awk -v DEST_FOLDER=tmp input.file
find tmp/ -name "*.part" | xargs -P $ALLOWED_CPUS --replace --verbose /bin/bash -c "./process.sh {}"
The question is: are there any way to run ./process.sh:
before first script is done, because process.sh processes file line by line (one line too long to be passed to xargs directly);
each new file has a header (added in script.awk) that should be run before the rest of file;
limit amount of parallel processes;
GNU parallel,inotifywait is not an option;
assume dest folder is empty, files name are unknown.
The purpose of optimization to get rid of waiting until the awk is done while some files are ready to be processed.
Once you have created a file, you can pass the filename to a process' or script's input:
awk '{print name_of_created_file | "./process.sh &"}'
& sends process.sh to the background, so that they can run in parallel. However, this is a gawk extension and not POSIX. Check the manual
You basically give the answer yourself: GNU Parallel + inotifywait will work.
Since you are not allowed to use inotifywait, you can make your substitute for inotifywait. If you are allowed to write your own script, you are also allowed to run GNU Parallel (as that is just a script).
So something like this:
awk -f script.awk -v DEST_FOLDER=tmp input.file &
sleep 1
record file sizes of files in tmp
while tmp is not empty do
for files in tmp:
if file size is unchanged: print file
record new file size
sleep 1
done | parallel 'process {}; rm {}'
It is assumed that awk will produce some output with one second. If that takes longer, adjust the sleeps accordingly.

how to split a file into smaller files (one file per line) [split doesn't work]

I'm trying to split a very large file to one new file per line.
Why? It's going to be input for Mahout. but there are too many lines and not enough suffixes for split.
Is there a way to do this in bash?
Increase Your Suffix Length with Split
If you insist on using split, then you have to increase your suffix length. For example, assuming you have 10,000 lines in your file:
split --suffix-length=5 --lines=1 foo.txt
If you really want to go nuts with this approach, you can even set the suffix length dynamically with the wc command and some shell arithmetic. For example:
file='foo.txt'
split \
--suffix-length=$(( $(wc --chars < <(wc --lines < "$file")) - 1 )) \
--lines=1 \
"$file"
Use Xargs Instead
However, the above is really just a kludge anyway. A more correct solution would be to use xargs from the GNU findutils package to invoke some command once per line. For example:
xargs --max-lines=1 --arg-file=foo.txt your_command
This will pass one line at a time to your command. This is a much more flexible approach and will dramatically reduce your disk I/O.
split --lines=1 --suffix-length=5 input.txt output.
This will use 5 characters per suffix, which is enough for 265 = 11881376 files. If you really have more than that, increase suffix-length.
Here's another way to do something for each line:
while IFS= read -r line; do
do_something_with "$line"
done < big.file
GNU Parallel can do this:
cat big.file | parallel --pipe -N1 'cat > {#}'
But if Mahout can read from stdin then you can avoid the temporary files:
cat big.file | parallel --pipe -N1 mahout --input-file -
Learn more about GNU Parallel https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

appending file contents as parameter for unix shell command

I'm looking for a unix shell command to append the contents of a file as the parameters of another shell command. For example:
command << commandArguments.txt
xargs was built specifically for this:
cat commandArguments.txt | xargs mycommand
If you have multiple lines in the file, you can use xargs -L1 -P10 to run ten copies of your command at a time, in parallel.
xargs takes its standard in and formats it as positional parameters for a shell command. It was originally meant to deal with short command line limits, but it is useful for other purposes as well.
For example, within the last minute I've used it to connect to 10 servers in parallel and check their uptimes:
echo server{1..10} | tr ' ' '\n' | xargs -n 1 -P 50 -I ^ ssh ^ uptime
Some interesting aspects of this command pipeline:
The names of the servers to connect to were taken from the incoming pipe
The tr is needed to put each name on its own line. This is because xargs expects line-delimited input
The -n option controls how many incoming lines are used per command invocation. -n 1 says make a new ssh process for each incoming line.
By default, the parameters are appended to the end of the command. With -I, one can specify a token (^) that will be replaced with the argument instead.
The -P controls how many child processes to run concurrently, greatly widening the space of interesting possibilities..
command `cat commandArguments.txt`
Using backticks will use the result of the enclosed command as a literal in the outer command

Resources