read multiple files in bash - bash

I have two .txt files that I want to read line per line simultaneously in .sh script. Both .txt files have the same number of lines. Inside the loop I want to use the sed-command to change the full_sample_name and sample_name in another file.
I know how this works if you just read one file, but I cannot get it work for two files.
#! /bin/bash
FULL_SAMPLE="file1.txt"
SAMPLE="file2.txt"
while read ... && ...
do
sed -e "s/\<full_sample_name\>/$FULL_SAMPLE/g" -e "s/\<sample_name\>/$SAMPLE/g" pipeline.sh > $SAMPLE.sh
done < ...?

Charles provided a very good answer.
You could use paste to join the lines of the files with some delimiter (that shouldn't appear in the files):
paste -d ":" file1.txt file2.txt | while IFS=":" read -r full samp; do
do_stuff_with "$full" and "$samp"
done

#!/bin/bash
full_sample_file="file1.txt"
sample_file="file2.txt"
while read -r -u 3 full_sample_name && read -r -u 4 sample_name; do
sed -e "s/\<full_sample_name\>/$full_sample_name/g" \
-e "s/\<sample_name\>/$sample_name/g" \
pipeline.sh >"$sample_name.sh"
done 3<"$full_sample_file" 4<"$sample_file" # automatically closed on loop exit
In this case, I'm assigning file descriptor 3 to file1.txt and file descriptor 4 to file2.txt.
By the way, with bash 4.1 or newer, you no longer need to handle file descriptors manually:
# opening explicitly, since even if opened on the loop, these need
# to be explicitly closed.
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
: do stuff here with "$full_sample_name" and "$sample_name"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
One more note: You could make this a bit more efficient (and also more correct, if your sample_name and full_sample_name values aren't guaranteed to evaluate to themselves when interpreted as regular expressions, if your input file contains no literal NULs [which, as a shell script, it shouldn't], and if the arrow brackets are intended to be literal rather than word-boundary regex characters) by not using sed at all, but just reading the input to be converted into a shell variable, and doing the replacements there!
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
IFS= read -r -d '' input_file <pipeline.sh
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
output=${input_file//'<full_sample_name>'/${full_sample_name}}
output=${output//'<sample_name>'/${sample_name}}
printf '%s' "$output" >"${sample_name}.sh"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-

With GNU Parallel it will look like this:
#! /bin/bash
do_sed() {
sed -e "s/\<full_sample_name\>/$1/g" -e "s/\<sample_name\>/$2/g" pipeline.sh > "$2".sh
}
export -f do_sed
parallel --xapply do_sed {1} {2} :::: file1.txt file2.txt
The added benefit is that you get it run in parallel. Depending on your storage system this may speed up the processing: On a raid6 I have seen a 6x speedup by running 10 jobs in parallel. YMMV, so the only way to know for sure is to test and measure.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

Bash while loop from txt file?

i trying to make a script to organize a pair of list i have, and process with other programs, but im a little bit stuck now.
I want from a List in Txt process every line first creating a folder to each line in the list and then process due to different scripts i have.
But my problem is is the list i give to the script is like 3-4 elements works great and create there own directory, but if i put a list with +1000 lines, then my script process only a few elements thru the scripts.
EDIT: the process are like 30-35 scripts, different language python,bash,python and golang
Any suggestions?
cat $STORES+NEW.txt | while read NEWSTORES
do
cd $STORES && mkdir $NEWSTORES && cd $NEWSTORES && mkdir .Files
python3 checkstatus.py -n $NEWSTORES
checkemployes $NEWSTORES -status
storemanagers -s $NEWSTORES -o $NEWSTORES+managers.txt
curl -s https://redacted.com/store?=$NEWSTORES | grep -vE "<|^[\*]*[\.]*$NEWSTORES" | sort -u | awk 'NF' > $NEWSTORES+site.txt
..
..
..
..
..
..
cd ../..
done
I'm not supposed to give an answer yet but I mistakenly answered my what should be a comment reply. Anyway here a few things I can suggest:
Avoid unnecessary use of cat.
Open your input file using another FD to prevent commands that read input inside the loop from eating the input: IFS= read -ru 3 NEWSTORES; do ...; done 3< "$STORES+NEW.txt" or { IFS= read -ru "$FD" NEWSTORES; do ...; done; } {FD}< "$STORES+NEW.txt". Also see https://stackoverflow.com/a/28837793/445221.
Not completely related but don't use while loop in a pipeline since it will execute in a subshell. In the future if you try to alter a variable and expect it to be saved outside the loop, it won't. You can use lastpipe to avoid it but it's unnecessary most of the time.
Place your variable expansions around double quotes to prevent unwanted word splitting and filename expansion.
Use -r option unless you want backslashes to escape characters.
Specify IFS= before read to prevent stripping of leading and trailing spaces.
Using readarray or mapfile makes it more convenient: readarray -t ALL_STORES_DATA < "$STORES+NEW.txt"; for NEWSTORES IN "${ALL_STORES_DATA[#]}"; do ...; done
Use lowercase characters on your variables when you don't use them in a global manner to avoid conflict with bash's variables.

Bash script using gzip and bcftools running out of memory with large files

This bash script is meant to be part of a pipeline that processes zipped .vcf file that contain genomes from multiple patients (which means the files are huge even when zipped, like 3-5GB).
My problem is that I keep running out of memory when running this script. It is being run in a GCP high mem VM.
I am hoping there is a way to optimize the memory usage so that this doesn't fail. I looked into it but found nothing.
#!/bin/bash
for filename in ./*.vcf.gz; do
[ -e "$filename" ] || continue
name=${filename##*/}
base=${name%.vcf.gz}
bcftools query -l "$filename" >> ${base}_list.txt
for line in `cat ${base}_list.txt`; do
bcftools view -s "$line" "$filename" -o ${line}.vcf.gz
gzip ${line}.vcf
done
done
If you run out of memory when using bcftools query/view or gzip look for options in the manual that might reduce the memory footprint. In case of gzip you might also switch to an alternative implementation. You could even consider switching the compression algorithm altogether (zstd is pretty good).
However, I have a feeling that the problem could be for line in `cat ${base}_list.txt`;. The whole file ..._list.txt is loaded into memory before the loop even starts. Also, reading lines that way has all kinds of problems, like splitting lines at whitespace, expanding globs like * and so on. Use this instead:
while read -r line; do
bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
gzip "$line.vcf"
done < "${base}_list.txt"
By the way: Are you sure you want bcftools query -l "$filename" >> ${base}_list.txt to append. The file ${base}_list.txt will keep growing each time the script is executed. Consider overwriting the file using > instead of >>.
However, in that case you might not need the file at all as you could use this instead:
bcftools query -l "$filename" |
while read -r line; do
bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
gzip "$line.vcf"
done
You can try to use split on each file (into a constant size) and then gzip the file splits.
https://man7.org/linux/man-pages/man1/split.1.html

how to untar certain files from an archive and grep in parallel in bash

We've got extensive amount of tarballs and in each tarball I need to search for a particular pattern only in some files which names are known before hand.
As the disk access is slower and there is quite a few cores and plenty of memory available on this system, we aim minimising the disk writes and going through the memory as much as possible.
echo "a.txt" > file_subset_in_tar.txt
echo "b.txt" >> file_subset_in_tar.txt
echo "c.txt" >> file_subset_in_tar.txt
tarball_name="tarball.tgz";
pattern="mypattern"
echo "pattern: $pattern"
(parallel -j-2 tar xf $tarball_name -O ::: `cat file_subset_in_tar.txt` | grep -ac "$pattern")
This works just fine on the bash terminal directly. However, when I paste this in a script with bash bang on the top, it just prints zero.
If I change the $pattern to a hard coded string, it runs ok. It feels like there is something wrong with the pipe sequencing or something similar. So, ideally an update to the attempt above or another solution which satisfies the mentioned disk/memory use requirements would be much appreciated.
I believe your parallel command is constructed incorrectly. You can run the pipeline of commands like the following:
parallel -j -2 "tar xf $tarball_name -O {} | grep -ac $pattern" :::: file_subset_in_tar.txt
Also note that the backticks and use of cat is unnecessary, parameters can be fed to parallel from a file using ::::.

Running Shell Script in parallel for each line in a file

I have a delimited (|) input file (TableInfo.txt) that has data as shown below
dbName1|Table1
dbName1|Table2
dbName2|Table3
dbName2|Table4
...
I have a shell script (LoadTables.sh) that parses each line and calls a executable passing args from the line like dbName, TableName. This process reads data from a SQL Server and loads it into HDFS.
while IFS= read -r line;do
fields=($(printf "%s" "$line"|cut -d'|' --output-delimiter=' ' -f1-))
query=$(< ../sqoop/"${fields[1]}".sql)
sh ../ProcessName "${fields[0]}" "${fields[1]}" "$query"
done < ../TableInfo.txt
Right now my process is running in sequential for each line in the file and its time consuming based on the number of entries in the file.
Is there any way I can execute the process in parallel? I have heard about using xargs/GNU parallel/ampersand and wait options. I am not familiar on how to construct and use it. Any help is appreciated.
Note:I don't have GNU parallel installed on the Linux machine. So xargs is the only option as I have heard some cons on using ampersand and wait option.
Put an & on the end of any line you want to move to the background. Replacing the silly (buggy) array-splitting method used in your code with read's own field-splitting, this looks something like:
while IFS='|' read -r db table; do
../ProcessName "$db" "$table" "$(<"../sqoop/${table}.sql")" &
done < ../TableInfo.txt
...FYI, re: what I meant about "buggy" --
fields=( $(foo) )
...performs not only string-splitting but also globbing on the output of foo; thus, a * in the output is replaced with a list of filenames in the current directory; a name such as foo[bar] can be replaced with files named foob, fooa or foor; the globfail shell option can cause such an expansion to result in a failure, the nullglob shell option can cause it to result in an empty result; etc.
If you have GNU xargs, consider the following:
# assuming you have "nproc" to get the number of CPUs; otherwise, hardcode
xargs -P "$(nproc)" -d $'\n' -n 1 bash -c '
db=${1%|*}; table=${1##*|}
query=$(<"../sqoop/${table}.sql")
exec ../ProcessName "$db" "$table" "$query"
' _ < ../TableInfo.txt

Shell script takes a list of commands as input, tries to execute them, and fails

I am, like many non-engineers or non-mathematicians who try writing algorithms, an intuitive. My exact psychological typology makes it quite difficult for me to learn anything serious like computers or math. Generally, I prefer audio, because I can engage my imagination more effectively in the learning process.
That said, I am trying to write a shell script that will help me master Linux. To that end, I copied and pasted a list of Linux commands from the O'Reilly website's index to the book Python In a Nutshell. I doubt they'll mind, and I thank them for providing it. These are the textfile `massivelistoflinuxcommands,' not included fully below in order to save space...
OK, now comes the fun part. How do I get this script to work?
#/bin/sh
read -d 'massivelistoflinuxcommands' commands <<EOF
accept
bison
bzcmp
bzdiff
bzgrep
bzip2
bzless
bzmore
c++
lastb
lastlog
strace
strfile
zmore
znew
EOF
for i in $commands
do
$i --help | less | cat > masterlinuxnow
text2wave masterlinuxnow -o ml.wav
done
It really helps when you include error messages or specific ways that something deviates from expected behavior.
However, your problem is here:
read -d 'massivelistoflinuxcommands' commands <<EOF
It should be:
read -d '' commands <<EOF
The delimiter to read causes it to stop at the first character it finds that matches the first character in the string, so it stops at "bzc" because the next character is "m" which matches the "m" at the beginning of "massive..."
Also, I have no idea what this is supposed to do:
$i --help | less | cat > masterlinuxnow
but it probably should be:
$i --help > masterlinuxnow
However, you should be able to pipe directly into text2wave and skip creating an intermediate file:
$i --help | text2wave -o ml.wav
Also, you may want to prevent each file from overwriting the previous one:
$i --help | text2wave -o ml-$i.wav
That will create files named like "ml-accept.wav" and "ml-bison.wav".
I would point out that if you're learning Linux commands, you should prioritize them by frequency of use and/or applicability to a beginner. For example, you probably won't be using bison right away`.
The first problem here is that not every command has a --help option!! In fact the very first command, accept, has no such option! A better approach might be executing man on each command since a manual page is more likely to exist for each of the commands. Thus change;
$i --help | less | cat > masterlinuxnow
to
man $i >> masterlinuxnow
note that it is essential you use the append output operator ">>" instead of the create output operator ">" in this loop. Using the create output operator will recreate the file "masterlinuxnow" on each iteration thus containing only the output of the last "man $i" processed.
you also need to worry about whether the command exists on your version of linux (many commands are not included in the standard distribution or may have different names). Thus you probably want something more like this where the -n in the head command should be replace by the number of lines you want, so if you want only the first 2 lines of the --help output you would replace -n with -2:
if [ $(which $i) ]
then
$i --help | head -n >> masterlinuxnow
fi
and instead of the read command, simply define the variable commands like so:
commands="
bison
bzcmp
bzdiff
bzgrep
bzip2
bzless
bzmore
c++
lastb
lastlog
strace
strfile
zmore
znew
"
Putting this all together, the following script works quite nicely:
commands="
bison
bzcmp
bzdiff
bzgrep
bzip2
bzless
bzmore
c++
lastb
lastlog
strace
strfile
zmore
znew
"
for i in $commands
do
if [ $(which $i) ]
then
$i --help | head -1 >> masterlinuxnow 2>/dev/null
fi
done
You're going to learn to use Linux by listening to help descriptions? I really think that's a bad idea.
Those help commands usually list every obscure option to a command, including many that you will never use-- especially as a beginner.
A guided tutorial or book would be much better. It would only present the commands and options that will be most useful. For example, that list of commands you gave has many that I don't know-- and I've been using Linux/Unix extensively for 10 years.

Resources