Automate command lines - bash

I executing following cmd lines and now I need to put it in a script that I can call and pass file1, file2, file3 as arguments.
sort file1.csv > file1.csv.sorted
sort file2.csv > file2.csv.sorted
diff --speed-large-files \
file1.csv.sorted \
file2.csv.sorted \
> file3.difftmp
rm file1.csv.sorted
rm file2.csv.sorted
I have tried to create bash script, but following eval was not working:
s="diff --speed-large-files $file1.csv.sorted $file2.csv.sorted > $file3"
eval s
I do not necessarily need to create a bash script, but I need to automate this process so that other processes could call it and pass arguments.

Don't use eval here; a simple function will suffice:
filediff() {
sort "$1".csv > "$1".csv.sorted
sort "$2".csv > "$2".csv.sorted
diff --speed-large-files "$1".csv.sorted "$2".csv.sorted > "$3".difftmp
rm "$1".csv.sorted
rm "$2".csv.sorted
}
As Tom Fenech suggested, you can also use process substitution and avoid creating temporary files.

As you're using bash, you can take advantage of process substitution:
#!/bin/bash
diff --speed-large-files <(sort "$1") <(sort "$2")
You can pass the two file names to the script as arguments. This avoids the creation of temporary files and the need for manual cleanup.

I think, you could just do
k=$(s)
So k is a variable where the result returned from command s is stored. If you want to use a string as s command, do $("cmd")

Related

automatice bash command for multiple files

I have a directory with multiple files
file1_1.txt
file1_2.txt
file2_1.txt
file2_2.txt
...
And I need to run a command structured like this
command [args] file1 file2
So I was wondering if there was a way to call the command just one time on all the files, instead of having to call It each time on each pair of files.
Use find and xargs, with sort, since the order appears meaningful in your case:
find . -name 'file?_?.txt' | sort | xargs -n2 command [args]
If your command can take multiple pairs of files on the command line then it should be sufficient to run
command ... *_[12].txt
The files in expanded glob patterns (such as *_[12].txt) are automatically sorted so the files will be paired correctly.
If the command can only take one pair of files then it will need to be run multiple times to process all of the files. One way to do this automatically is:
for file1 in *_1.txt; do
file2=${file1%_1.txt}_2.txt
[[ -f $file2 ]] && echo command "$file1" "$file2"
done
You'll need to replace echo command with the correct command name and arguments.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${file1%_1.txt}.
#!/bin/bash
cmd (){
readarray -d " " arr <<<"$#"
for ((i=0; i<${#arr[#]}; i+=2))
do
n=$(($i+1))
firstFile="${arr[$i]}"
secondFile="${arr[$n]}"
echo "pair -- ${firstFile} ${secondFile}"
done
}
cmd file*_[12].txt
pair -- file1_1.txt file1_2.txt
pair -- file2_1.txt file2_2.txt

Bash while loop from txt file?

i trying to make a script to organize a pair of list i have, and process with other programs, but im a little bit stuck now.
I want from a List in Txt process every line first creating a folder to each line in the list and then process due to different scripts i have.
But my problem is is the list i give to the script is like 3-4 elements works great and create there own directory, but if i put a list with +1000 lines, then my script process only a few elements thru the scripts.
EDIT: the process are like 30-35 scripts, different language python,bash,python and golang
Any suggestions?
cat $STORES+NEW.txt | while read NEWSTORES
do
cd $STORES && mkdir $NEWSTORES && cd $NEWSTORES && mkdir .Files
python3 checkstatus.py -n $NEWSTORES
checkemployes $NEWSTORES -status
storemanagers -s $NEWSTORES -o $NEWSTORES+managers.txt
curl -s https://redacted.com/store?=$NEWSTORES | grep -vE "<|^[\*]*[\.]*$NEWSTORES" | sort -u | awk 'NF' > $NEWSTORES+site.txt
..
..
..
..
..
..
cd ../..
done
I'm not supposed to give an answer yet but I mistakenly answered my what should be a comment reply. Anyway here a few things I can suggest:
Avoid unnecessary use of cat.
Open your input file using another FD to prevent commands that read input inside the loop from eating the input: IFS= read -ru 3 NEWSTORES; do ...; done 3< "$STORES+NEW.txt" or { IFS= read -ru "$FD" NEWSTORES; do ...; done; } {FD}< "$STORES+NEW.txt". Also see https://stackoverflow.com/a/28837793/445221.
Not completely related but don't use while loop in a pipeline since it will execute in a subshell. In the future if you try to alter a variable and expect it to be saved outside the loop, it won't. You can use lastpipe to avoid it but it's unnecessary most of the time.
Place your variable expansions around double quotes to prevent unwanted word splitting and filename expansion.
Use -r option unless you want backslashes to escape characters.
Specify IFS= before read to prevent stripping of leading and trailing spaces.
Using readarray or mapfile makes it more convenient: readarray -t ALL_STORES_DATA < "$STORES+NEW.txt"; for NEWSTORES IN "${ALL_STORES_DATA[#]}"; do ...; done
Use lowercase characters on your variables when you don't use them in a global manner to avoid conflict with bash's variables.

How to delay `redirection operator` of BASH `>`

First I create 3 files:
$ touch alpha bravo carlos
Then I want to save the list to a file:
$ ls > info.txt
However, I always got my info.txt inside:
$ cat info.txt
alpha
bravo
carlos
info.txt
It looks like the redirection operator creates my info.txt first.
In this case, my question is. How can I save my list of files before creating the info.txt first?
The main question is about the redirection operator. Why does it act first, and how to delay it so I complete my task first? Using the example above to answer it.
When you redirect a command's output to a file, the shell opens a file handle to the destination file, then runs the command in a child process whose standard output is connected to this file handle. There is no way to change this order, but you can redirect to a file in a different directory if you don't want the ls output to include the new file.
ls >/tmp/info.txt
mv /tmp/info.txt ./
In a production script, you should make sure that the file name is unique and unpredictable.
t=$(mktemp -t lstemp.XXXXXXXXXX) || exit
trap 'rm -f "$t"' INT HUP
ls >"$t"
mv "$t" ./info.txt
Alternatively, capture the output into a variable, and then write that variable to a file.
files=$(ls)
echo "$files" >info.txt
As an aside, probably don't use ls in scripts. If you want a list of files in the current directory
printf '%s\n' *
does that.
One simple approach is to save your command output to a variable, like this:
ls_output="$(ls)"
and then write the value of that variable to the file, using any of these commands:
printf '%s\n' "$ls_output" > info.txt
cat <<< "$ls_output" > info.txt
echo "$ls_output" > info.txt
Some caveats with this approach:
Bash variables can't contain null bytes. If the output of the command includes a null byte, that byte and everything after it will be discarded.
In the specific case of ls, though, this shouldn't be an issue, because the output of ls should never contain a null byte.
$(...) removes trailing newlines. The above compensates for this by adding a newline while creating info.txt, but if the the command output ends with multiple newlines, then the above will effectively collapse them into a single newline.
In the specific case of ls, this could happen if a filename ends with a newline — very unusual, and unlikely to be intentional, but nonetheless possible.
Since the above adds a newline while creating info.txt, it will put a newline there even if the command output doesn't end with a newline.
In the specific case of ls, this shouldn't be an issue, because the output of ls should always end with a newline.
If you want to avoid the above issues, another approach is to save your command output to a temporary file in a different directory, and then move it to the right place; for example:
tmpfile="$(mktemp)"
ls > "$tmpfile"
mv -- "$tmpfile" info.txt
. . . which obviously has different caveats (e.g., it requires access to write to a different directory), but should work on most systems.
One way to do what you want is to exclude the info.txt file from the ls output.
If you can rename the list file to .info.txt then it's as simple as:
ls >.info.txt
ls doesn't list files whose names start with . by default.
If you can't rename the list file but you've got GNU ls then you can use:
ls --ignore=info.txt >info.txt
Failing that, you can use:
ls | grep -v '^info\.txt$' >info.txt
All of the above options have the advantage that you can safely run them after the list file has been created.
Another general approach is to capture the output of ls with one command and save it to the list file with a second command. As others have pointed out, temporary files and shell variables are two specific ways to capture the output. Another way, if you've got the moreutils package installed, is to use the sponge utility:
ls | sponge info.txt
Finally, note that you may not be able to reliably extract the list of files from info.txt if it contains plain ls output. See ParsingLs - Greg's Wiki for more information.

read multiple files in bash

I have two .txt files that I want to read line per line simultaneously in .sh script. Both .txt files have the same number of lines. Inside the loop I want to use the sed-command to change the full_sample_name and sample_name in another file.
I know how this works if you just read one file, but I cannot get it work for two files.
#! /bin/bash
FULL_SAMPLE="file1.txt"
SAMPLE="file2.txt"
while read ... && ...
do
sed -e "s/\<full_sample_name\>/$FULL_SAMPLE/g" -e "s/\<sample_name\>/$SAMPLE/g" pipeline.sh > $SAMPLE.sh
done < ...?
Charles provided a very good answer.
You could use paste to join the lines of the files with some delimiter (that shouldn't appear in the files):
paste -d ":" file1.txt file2.txt | while IFS=":" read -r full samp; do
do_stuff_with "$full" and "$samp"
done
#!/bin/bash
full_sample_file="file1.txt"
sample_file="file2.txt"
while read -r -u 3 full_sample_name && read -r -u 4 sample_name; do
sed -e "s/\<full_sample_name\>/$full_sample_name/g" \
-e "s/\<sample_name\>/$sample_name/g" \
pipeline.sh >"$sample_name.sh"
done 3<"$full_sample_file" 4<"$sample_file" # automatically closed on loop exit
In this case, I'm assigning file descriptor 3 to file1.txt and file descriptor 4 to file2.txt.
By the way, with bash 4.1 or newer, you no longer need to handle file descriptors manually:
# opening explicitly, since even if opened on the loop, these need
# to be explicitly closed.
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
: do stuff here with "$full_sample_name" and "$sample_name"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
One more note: You could make this a bit more efficient (and also more correct, if your sample_name and full_sample_name values aren't guaranteed to evaluate to themselves when interpreted as regular expressions, if your input file contains no literal NULs [which, as a shell script, it shouldn't], and if the arrow brackets are intended to be literal rather than word-boundary regex characters) by not using sed at all, but just reading the input to be converted into a shell variable, and doing the replacements there!
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
IFS= read -r -d '' input_file <pipeline.sh
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
output=${input_file//'<full_sample_name>'/${full_sample_name}}
output=${output//'<sample_name>'/${sample_name}}
printf '%s' "$output" >"${sample_name}.sh"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
With GNU Parallel it will look like this:
#! /bin/bash
do_sed() {
sed -e "s/\<full_sample_name\>/$1/g" -e "s/\<sample_name\>/$2/g" pipeline.sh > "$2".sh
}
export -f do_sed
parallel --xapply do_sed {1} {2} :::: file1.txt file2.txt
The added benefit is that you get it run in parallel. Depending on your storage system this may speed up the processing: On a raid6 I have seen a 6x speedup by running 10 jobs in parallel. YMMV, so the only way to know for sure is to test and measure.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

bash script to execute a program sequentially

I have a problem with a bash script I am trying to use. I have a directory with 1000s of files and I want to run a command sequentially using each file. However, each file is paired with another, e.g File1.sam, File1.gz, File2.sam, File2.gz etc.. and the command I am executing requires that I use both files from a pair as arguments. I have been using something similar to the command below when only a single argument was required, and I thought (wrongly) that I could just simply extend it like below.
shopt -s nullglob
for myfile1 in *.sam && for myfile2 in *.gz
do
./bwa samse -r "#RG\tID:$myfile1\tLB:$myfile1\tSM:$myfile1\tPL:ILLUMINA" lope_V1.2.fasta $myfile1 $myfile2 > $myfile1.sam2 2>$myfile1.log
done
Anyone know how I can modify this or point me in the direction of another way of doing it?
Why not generate the second filename, e.g. replace .sam with .gz
for myfile1 in *.sam ; do
myfile2="${myfile1%.sam}.gz"
[ -e "$myfile2" ] || continue
./bwa samse -r "#RG\tID:$myfile1\tLB:$myfile1\tSM:$myfile1\tPL:ILLUMINA" lope_V1.2.fasta "$myfile1" "$myfile2" > "$saiFile".sam 2>"$saiFile".log
done
shopt -s nullglob
for myfile1 in *.sam
do
myfile2=$(echo $myfile1|sed s/.sam$/.gz/)
./bwa samse -r "#RG\tID:$myfile1\tLB:$myfile1\tSM:$myfile1\tPL:ILLUMINA" lope_V1.2.fasta $myfile1 $myfile2 > $saiFile.sam 2>$saiFile.log
done
Iterate only over files with one of the extensions (for instance *.gz) and use for instance sed to get the matching .sam file.
Something like this:
for myfile1 in *.sam
do
sam_name=`echo $myfile | sed -e s#gz\\$#sam#`
./bwa samse -r "#RG\tID:$myfile1\tLB:$myfile1\tSM:$myfile1\tPL:ILLUMINA" lope_V1.2.fasta $myfile1 $myfile2 > $saiFile.sam 2>$saiFile.log
done
Change your for loop using one of the file extensions and calculate the other file name. For example:
for p in a b c; do touch $p.1 $p.2; done
for f in *.1; do g=${f%%.}.2; echo $f $g; done
This displays:
a.1 a.2
b.1 b.2
c.1 c.2

Resources