The code I have looks at the last active device used in the sequence and then continues it. If there is a gap in the sequence that is not currently being used I would like to fill it. How can I build that into the code?
Script works as expected as written for next in sequence. I'm not sure where to begin with adding a feature to fill in the gaps.
Input:
bash script WABEL8499IPM 3
Script:
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
# base name, such as "WABEL8499IPM"
device_name=$1
# quantity, such as "2"
quantityNum=$2
# the largest in sequence, such as "WABEL8499IPM108"
max_sequence_name=$(cat $SRCFILE | grep -o -e "$device_name[0-9]*" | sort --reverse | head -n 1)
# extract the last 3digit number (such as "108") from max_sequence_name
max_sequence_num=$(echo $max_sequence_name | rev | cut -c 1-3 | rev)
# create a sequence of files starting from "WABEL8499IPM101" if there is not any "WABEL8499IPM".
if [ -z "$max_sequence_name" ]; then
max_sequence_name=device_name
max_sequence_num=100
fi
# Fill In Sequence If Any Spots are Available If 101, 102, 104,
# 105, 106, 107 and 108 are used I want to output 103 (to fill in),
# 109 and 110 (to continue sequence).
# create new sequence_name
# such as ["WABEL8499IPM109", "WABEL8499IPM110"]
array_new_sequence_name=()
for i in $(seq 1 $quantityNum); do
cnum=$((max_sequence_num + i))
array_new_sequence_name+=($(echo $device_name$cnum))
done
#CODE FOR CREATING OUTPUT FILE HERE
#for fn in ${array_new_sequence_name[#]}; do touch $fn; done;
# write log
for sqn in ${array_new_sequence_name[#]};
do
echo $sqn >> $LOGFILE
done
Actual result as written:
#OUTPUT FROM WABEL8499IPM, 3
#IF WABEL8499IPM101,102,104,105 ARE USED THEN OUTPUT IS THIS:
WABEL8499IPM106
WABEL8499IPM107
WABEL8499IPM108
Desired/Expected Result:
#OUTPUT FROM WABEL8499IPM, 3
#IF WABEL8499IPM101,102,104,105 ARE USED THEN OUTPUT IS THIS:
WABEL8499IPM103
WABEL8499IPM106
WABEL8499IPM107
Basically in my current script I'm making an API call to see what is currently enrolled into an MDM and then looking at the highest number in the sequence and outputting the next number in the sequence. The goal is to fill in the sequence if there are any gaps where the sequence isn't completed.
This might work for you:
# create a test source file
$ cat > src_file <<-EOF
foo WABEL8499IPM102 bar WABEL8499IPM108 foo bar
WABEL8499IPM106
foo WABEL8499IPM104
foo bar
WABEL8499IPM105 WAbel8499IPM110 bar
EOF
# the actual code
$ cat script
#!/usr/bin/env bash
pre=$1
num=$2
f=src_file
s=101
((num==0)) && exit
grep -oP "$pre\K[0-9]+" "$f" | sort -n > tmp
comm -13 tmp <(seq $s $((s+num+$(wc -l < tmp)))) | awk -v n=$num -v p="$pre" '{print p $0}NR>=n{exit}'
# execute the script
$ ./script WABEL8499IPM 5
WABEL8499IPM101
WABEL8499IPM103
WABEL8499IPM107
WABEL8499IPM109
WABEL8499IPM110
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
I am currently building a bash script for class, and I am trying to use the grep command to grab the values from a simple calculator program and store them in the variables I assign, but I keep receiving a syntax error message when I try to run the script. Any advice on how to fix it? my script looks like this:
#!/bin/bash
addanwser=$(grep -o "num1 + num2" Lab9 -a 5 2)
echo "addanwser"
subanwser=$(grep -o "num1 - num2" Lab9 -s 10 15)
echo "subanwser"
multianwser=$(grep -o "num1 * num2" Lab9 -m 3 10)
echo "multianwser"
divanwser=$(grep -o "num1 / num2" Lab9 -d 100 4)
echo "divanwser"
modanwser=$(grep -o "num1 % num2" Lab9 -r 300 7)
echo "modawser"`
You want to grep the output of a command.
grep searches from either a file or standard input. So you can say either of these equivalent:
grep X file # 1. from a file
... things ... | grep X # 2. from stdin
grep X <<< "content" # 3. using here-strings
For this case, you want to use the last one, so that you execute the program and its output feeds grep directly:
grep <something> <<< "$(Lab9 -s 10 15)"
Which is the same as saying:
Lab9 -s 10 15 | grep <something>
So that grep will act on the output of your program. Since I don't know how Lab9 works, let's use a simple example with seq, that returns numbers from 5 to 15:
$ grep 5 <<< "$(seq 5 15)"
5
15
grep is usually used for finding matching lines of a text file. To actually grab a part of the matched line other tools such as awk are used.
Assuming the output looks like "num1 + num2 = 54" (i.e. fields are separated by space), this should do your job:
addanwser=$(Lab9 -a 5 2 | awk '{print $NF}')
echo "$addanwser"
Make sure you don't miss the '$' sign before addanwser when echo'ing it.
$NF selects the last field. You may select nth field using $n.
I try to take the first number from each file.dat of the form:
5.01 1 56.413481000 -0.00063400 0.00095770
5.01 2 61.193808800 0.00102170 0.00078280
5.01 3 65.974136600 -0.00108170 0.00102620
5.01 4 70.754464300 0.00082490 0.00103630
and then use this number (5.01) as the title of a .png file.
I use a bash script and I know the command line=$(head -n 1 $f) as found in a question here, but this take to me the first line of the file $f.
In this case also the space in the line is saved and the .png file title became:
plot 5.01 1 56.413481000 -0.00063400 0.00095770.png
There is some way to take only 5.01 and have a trim title for the plot?
Thanks to all.
I'd probably just do it with perl:
VAL=$( echo "$line" | perl -pe 's/^[^\d]+//g;s/[^\d\.].*$//' )
Something like that anyway.
Should remove:
anything that isn't a digit from the start of line.
Anything not-digit or not . to the end of line.
Or with grep:
grep -o "[0-9]*\.[0-9]*" file.dat | head -1
Edit:
Testing without the head -1 for a oneline input:
echo " 5.01 2 61.193808800 0.00102170 0.00078280" | grep -o "[0-9]*\.[0-9]*"
5.01
61.193808800
0.00102170
0.00078280
Using head -1 will return the first match on the first line.
When you know the match will be on the first line, so can we ignore files with an incorrect first line (and don't grep through complete files):
Make a two-headed monster:
head -1 | grep -o "[0-9]*\.[0-9]*" file.dat | head -1
To extract the first field, assuming they are tab separated:
val=$(head -n 1 $f | cut -f 1)
or, if they are space separated instead:
val=$(head -n 1 $f | cut -f 1 -d ' ')
OR you can avoid calling any extra processes and keep all data manipulation in the bash shell with
while read realNum restOfLine ;
break
done < $f
echo $realNum
This grabs the first "word" and puts the remaining into "restOfLine".
The break ensures that you only read the first line of the file.
IHTH
Example i run
sh mycode Manu gg44
And I need to get file with name Manu
with content:
gg44
192.168.1.2.(second line) (this number I explain below)
(in the directory DIR=/h/Manu/HOME/hosts there is already file Alex
cat Alex
ff55
198.162.1.1.(second line))
So mycode creates file named Manu with the first line gg44 and generate IP at the second line.
BUT for generating IP he has compare with Alex file IP. So second line of Manu has to be 198.162.1.2. If we have more than one files in the directory then we have to check all second lines of all files and then generate according to them.
[CODE]
DIR=/h/Manu/HOME/hosts #this is a directory where i have my files (structure of the files above)
for j in $1 $2 #$1 is Manu; $2 is gg44
do
if [ -d $DIR ] #checking if directory exists (it exists already)
then #if it exists
for i in $* # for every file in this directory do operation
do
sort /h/ManuHOME/hosts/* | tail -2 | head -1 # get second line of every file
IFS="." read A B C D # divide number in second line into 4 parts (our number 192.168.1.1. for example)
if [ "$D" != 255 ] #compare D (which is 1 in our example: if its less than 255)
then
D=` expr $D + 1 ` #then increment it by 1
else
C=` expr $C + 1 ` #otherwise increment C and make D=0
D=0
fi
echo "$2 "\n" $A.$B.$C.$D." >/h/Manu/HOME/hosts/$1
done done #get $2 (which is gg44 in example as a first line and get ABCD as a second line)[/CODE]
In the result it creates file with name Manu and first line, but second line is totally wrong. It gives me ...1.
Also error message
sort: open failed: /h/u15/c2/00/c2rsaldi/HOME/hosts/yu: No such file or directory
yu n ...1.
#!/bin/bash
dir=/h/Manu/HOME/hosts
filename=$dir/$1
firstline=$2
# find the max IP address from all current files:
maxIP=$(awk 'FNR==2' $dir/* | cut -d. -f4 | sort -nr | head -1)
ip=198.162.1.$(( maxIP + 1 ))
cat > $filename <<END
$firstline
$ip
END
I'll leave it up to you to decide what to do when you get more than 255 files...