Bash Grep Takes 3 Days To Run. Anyway to Enhance it? - bash

I have a script like this that I would like to seek some suggestions on enhancing it.
cd /home/output/
cat R*op.txt > R.total.op.txt
awk '{if( (length($8)>9) || ($8 ~ /^AAA/) ) {print $0}}' R.total.op.txt > temp && mv temp R.total.op.txt
cat S*op.txt > S.total.op.txt
awk '{if( (length($8)>9) || ($8 ~ /^AAA/) ) {print $0}}' S.total.op.txt > temp && mv temp S.total.op.txt
cat R.total.op.txt S.total.op.txt | awk '{print $4}' | sort -k1,1 | awk '!x[$1]++' > genes.txt
rm *total.op.txt
head genes.txt
cd /home/output/
for j in R1_with-genename R2_with-genename S1_with-genename S2_with-genename
do
**for i in `cat genes.txt`; do cat $j'.op.txt' | grep -w $i >> $j'_'$i'_gene.txt'**;done
done
ls -m1 *gene.txt | wc -l
find . -size 0 -delete
ls -m1 *gene.txt | wc -l
rm genes.txt
cd /home/output/
for i in `ls *gene.txt`
do
paste <(awk '{print $4"\t"$8"\t"$9"\t"$13}' $i | awk '!x[$1]++' | awk '{print $1}') <(awk '{print $4"\t"$8"\t"$9}' $i | awk '{if( (length($2)>9) || ($2 ~ /^AAA/) ) {print $0}}' | sort -k2,2 | awk '{ sum += $3 } END { if (NR > 0) print sum / NR }') <(awk '{print $4"\t"$8"\t"$9}' $i| awk '{if( (length($2)>9) || ($2 ~ /^AAA/) ) {print $0}}' | sort -k2,2 | wc -l) <(awk '{print $4"\t"$8"\t"$9"\t"$13}' $i | awk '{if( (length($2)>9) || ($2 ~ /^AAA/) ) {print $0}}' | sort -k2,2 | grep -v ":::" | wc -l) > $i'_stats.txt'
done
rm *gene.txt
cd /home/output/
for j in R1_with-genename R2_with-genename S1_with-genename S2_with-genename
do
cat $j*stats.txt > $j'.final.txt'
done
rm *stats.txt
cd /home/output/
for i in `ls *final.txt`
do
sed "1iGene_Name\tMean1\tCalculated\tbases" $i > temp && mv temp $i
done
head *final.txt
The very first for loop (marked with asterisks) that has cat genes.txt is the grep loop that is taking 3 days to finish. Can someone please advice any enhancements to the command and if this entire script can be made into a single command? Thanks in advance.

Try replacing the nested loops with a single awk.
awk 'FNR = NR {words[$0] = "\\b" $0 "\\b"; next}
{ for (i in words) if ($0 ~ words[i]) {
fn = FILENAME "_" i "_gene.txt";
print >> fn;
close(fn);
}' genes.txt {{R,S}{1,2}_with-genename}.op.txt

I suggest creating a sed script:
# name script
SEDSCRIPT=split.sed
# Make sure it is empty
echo "" > ${SEDSCRIPT}
# Loop through all the words in genes.txt and
# create sed command that will write that line to a file
for word in `cat genes.txt`; do
echo "/${word}/w ${word}.txt" >> ${SEDSCRIPT}
done
basenames="R1_with-genename R2_with-genename S1_with-genename S2_with-genename"
# Loop over input files
for name in "${basenames}"; do
# Run sed script against file
sed -n -f ${SEDSCRIPT} ${name}.op.txt
# Move the temporary files created by sed to their permanent names
for word in `cat genes.txt`; do
mv ${word}.txt ${name}_${word}_gene.txt
done
done

Related

How to grep first match and second match(ignore first match) with awk or sed or grep?

> root# ps -ef | grep [j]ava | awk '{print $2,$9}'
> 45134 -Dapex=APEC
> 45135 -Dapex=JAAA
> 45136 -Dapex=APEC
I need to put the first APEC of first as First PID, third line of APEC and Second PID and last one as Third PID.
I've tried awk but no expected result.
> First_PID =ps -ef | grep [j]ava | awk '{print $2,$9}'|awk '{if ($0 == "[^0-9]" || $1 == "APEC:") {print $0; exit;}}'
Expected result should look like this.
> First_PID=45134
> Second_PID=45136
> Third_PID=45135
With your shown samples and attempts please try following awk code. Written and tested in GNU awk.
ps -ef | grep [j]ava |
awk '
{
val=$2 OFS $9
match(val,/([0-9]+) -Dapex=APEC ([0-9]+) -Dapex=JAAA\s([0-9]+)/,arr)
print "First_PID="arr[1],"Second_PID=",arr[3],"Third_PID=",arr[2]
}
'
How about this:
$ input=("1 APEC" "2 JAAA" "3 APEC")
$ printf '%s\n' "${input[#]}" | grep APEC | sed -n '2p'
3 APEC
Explanation:
input=(...) - input data in an array, for testing
printf '%s\n' "${input[#]}" - print input array, one element per line
grep APEC - keep lines containing APEC only
sed -n - run sed without automatic print
sed -n '2p' - print only the second line
If you just want the APECs first...
ps -ef |
awk '/java[ ].* -Dapex=APEC/{print $2" "$9; next; }
/java[ ]/{non[NR]=$2" "$9}
END{ for (rec in non) print non[rec] }'
If possible, use an array instead of those ordinally named vars.
mapfile -t pids < <( ps -ef | awk '/java[ ].* -Dapex=APEC/{print $2; next; }
/java[ ]/{non[NR]=$2} END{ for (rec in non) print non[rec] }' )
After read from everyone idea,I end up with the very simple solution.
FIRST_PID=$(ps -ef | grep APEC | grep -v grep | awk '{print $2}'| sed -n '1p')
SECOND_PID=$(ps -ef | grep APEC | grep -v grep | awk '{print $2}'| sed -n '2p')
JAWS_PID=$(ps -ef | grep JAAA | grep -v grep | awk '{print $2}')

Bash script not returning any values

#!/bin/bash
for i in $(ls $1); do
echo -n $i | sed 's/.dat//g'
grep '<Overall>' $i | sed 's/<Overall>//g'
awk 'BEGIN{sum=0} {sum+=$1} END{print sum/NR}'
sed -re 's/([0-9]+\.[0-9]{2})[0-9]+/\1/g'
echo 1
done | sort -nrk2
This script should return the average overall rating. I cannot find the mistake, since im not getting any output.
untested:
gawk -F'>' '
BEGINFILE {sum = n = 0}
$1 == "<Overall" {sum += $2; n++}
ENDFILE {print FILENAME, (n == 0) ? "n/a" : sum/n}
' *.dat

awk append in CSV file

How to use awk command, as I need to add or append a 000 to my below timestamp column. I try to use the below command,
head -n 10000001 ratings.csv | tail -n +2 | awk '{print $1 "000"}' >> ratings_1.csv
but data is not as expected.
$ cat ratings.csv |wc -l
20000264
$ head ratings.csv
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940
My expected output should look like
1,2,3.5,1112486027000
awk '{ if (NR > 1) { $1 = $1 "000" } print }'
Maybe a faster version that wouldn't run the if on every line would be:
awk 'BEGIN { getline; print } { print $0 "000" }'

bash: how to redirect the result in to another file

Now I have this code, which can show the result on my terminal
cat temp | sort -n | uniq -c | awk '{ print $2, $1 }'
but How can I redirect this into another file?
I tried this echo temp | sort -n | uniq -c | awk '{ print $2, $1 }' > temp2, but not working
Thanks
echo temp | sort -n | uniq -c | awk '{ print $2, $1 }' > temp2
You used echo:
cat temp | sort -n | uniq -c | awk '{ print $2, $1 }' > temp2
Also you don't need to use cat:
sort -n temp | uniq -c | awk '{ print $2, $1 }' > temp2
Any command that displays results to your terminal can be redirected to a file by adding to the end of the command a redirect: > out.txt
cat temp | sort -n | uniq -c | awk '{ print $2, $1 }' > temp2
Your second attempt (echo temp ...) simply sent the string "temp" to the sort command, which sent it to the uniq command, and so fort. echo temp is not a valid way to direct results of file "temp". echo prints the actual string "temp" to the terminal and has nothing to do with the file "temp"
[root#www ~]# echo THIS IS FILE CONTENTS > temp
[root#www ~]# cat temp
THIS IS FILE CONTENTS
[root#www ~]# echo temp
temp
[root#www ~]# cat temp > temp2
[root#www ~]# cat temp2
THIS IS FILE CONTENTS
[root#www ~]#

Join conditions in awk

How can I join this command
top -b -n 5 -d.2 | grep "Cpu" | awk 'NR==3{ print($2)}'
into only awk command (joining grep and awk into one) ?
I have tried this but with no success:
top -b -n 5 -d.2 | awk '{if( $1 == "Cpu(s):" && NR==3 ){ print($2)} }'
or
top -b -n 5 -d.2 | awk '{$1 ~ /Cpu/ && (NR==3) { print($2)}}'
awk '/Cpu/ {x++; if(x==3) { print $2}}'
Note: you can add exit for short-circuiting.
top -b -n 5 -d.2 | awk '/Cpu/ { if (++cnt==3) print $2 }'

Resources