How to grep and execute command for every multiline match - bash

Is there the possibility to process a multiline grep-output by one command each?
I've got something like
<fulldata>
<value>1</value>
<value>2</value>
</fulldata>
<fulldata>
<value>2</value>
<value>3</value>
</fulldata>
and want to get means, standard deviation and do some other things with data-element on its own.
In this case, I want to execute
function printStatistics {
mean1=$(awk -F ';' '{print $1}' $1 | awk '{sum += $1; square += $1^2} END {print sum / NR}')
deviation1=$(awk -F ';' '{print $1}' $1 | awk '{sum += $1; square += $1^2} END {print sqrt(square / NR - (sum/NR)^2)}')
size=$(cat $1 | wc -l)
echo $mean1 $deviation1 $size
}
with the expected result (for the sample data), idealy separated by newline:
1,5 0,7 2
2,5 0,7 2
Running
cat add.xml | grep "<fulldata" -A 2001 | while read line ; do echo "Line: $line" ; done
like suggested in How to grep and execute a command (for every match) does result in one entry for each line; but I want one entry for each entry (in order to execute awk stuff on it later).
Is this possible with grep, or is this a use case where another language would be more appropriate?

It is bad practice to parse html/xml with grep, because its not reliable. If you are using Mac OS X, you can use a preinstalled cli tool called xmllint to select specific elements. On linux, you can use the standard package manager to get it.
There is also xgrep, and probably others that I dont know about.

awk to the rescue!
$ awk -v RS='\n?</?fulldata>\n' -F'\n' '
!(NR%2){gsub("</?value>","");
s=ss=0;
for(i=1;i<=NF;i++) {s+=$i; ss+=$i^2}
printf "%.1f %.1f %d\n", s/NF, sqrt((ss-s^2/NF)/(NF-1)), NF} ' file
1.5 0.7 2
2.5 0.7 2
for the sample standard deviation as computed you need to guard for single observation (NF==1) case.

Complex xmlstarlet + awk solution:
xmlstarlet ed -u "//fulldata/value" -x "concat(.,',')" add.xml \
| xmlstarlet sel -B -t -v "//fulldata" -n \
| awk -F, '{ n=NF-1; sum=sq=0; for(i=1;i<=n;i++) { sum+=$i; sq+=$i^2 }
printf "%.1f\n%.1f\n%d\n", sum/n, sqrt((sq-sum^2/n)/(n-1)), n }'
The output:
1.5
0.7
2
2.5
0.7
2

Related

Random line using sed

I want to select a random line with sed. I know shuf -n and sort -R | head -n does the job, but for shuf you have to install coreutils, and for the sort solution, it isn't optimal on large data :
Here is what I tested :
echo "$var" | shuf -n1
Which gives the optimal solution but I'm afraid for portability
that's why I want to try it with sed.
`var="Hi
i am a student
learning scripts"`
output:
i am a student
output:
hi
It must be Random.
It depends greatly on what you want your pseudo-random probability distribution to look like. (Don't try for random, be content with pseudo-random. If you do manage to generate a truly random value, go collect your nobel prize.) If you just want a uniform distribution (eg, each line has equal probability of being selected), then you'll need to know a priori how many lines of are in the file. Getting that distribution is not quite so easy as allowing the earlier lines in the file to be slightly more likely to be selected, and since that's easy, we'll do that. Assuming that the number of lines is less than 32769, you can simply do:
N=$(wc -l < input-file)
sed -n -e $((RANDOM % N + 1))p input-file
-- edit --
After thinking about it for a bit, I realize you don't need to know the number of lines, so you don't need to read the data twice. I haven't done a rigorous analysis, but I believe that the following gives a uniform distribution:
awk 'BEGIN{srand()} rand() < 1/NR { out=$0 } END { print out }' input-file
-- edit --
Ed Morton suggests in the comments that we should be able to invoke rand() only once. That seems like it ought to work, but doesn't seem to. Curious:
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed); r=rand()} r < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 205
2 64
3 37
4 21
5 9
6 9
7 9
8 46
real 0m1.862s
user 0m0.689s
sys 0m0.907s
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed)} rand() < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 55
2 60
3 37
4 50
5 57
6 45
7 50
8 46
real 0m1.924s
user 0m0.710s
sys 0m0.932s
var="Hi
i am a student
learning scripts"
mapfile -t array <<< "$var" # create array from $var
echo "${array[$RANDOM % (${#array}+1)]}"
echo "${array[$RANDOM % (${#array}+1)]}"
Output (e.g.):
learning scripts
i am a student
See: help mapfile
This seems to be the best solution for large input files:
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file
as it uses standard UNIX tools, it's not restricted to files that are 32,769 lines long or less, it doesn't have any bias towards either end of the input, it'll produce different output even if called twice in 1 second, and it exits immediately after the target line is printed rather than continuing to the end of the input.
Update:
Having said the above, I have no explanation for why a script that calls rand() once per line and reads every line of input is about twice as fast as a script that calls rand() once and exits at the first matching line:
$ seq 100000 > file
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file;
done > o3
real 1m0.712s
user 0m8.062s
sys 0m9.340s
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" 'BEGIN{srand(seed)} rand() < 1/NR{ out=$0 } END { print out}' file;
done > o4
real 0m29.950s
user 0m9.918s
sys 0m2.501s
They both produced very similar types of output:
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o3 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
498 500 1 2
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o4 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
490 500 1 3
Final Update:
Turns out it was calling wc that (unexpectedly to me at least!) was taking most of the time. Here's the improvement when we take it out of the loop:
$ time { max=$(wc -l < file); for i in $(seq 500); do awk -v seed="$RANDOM" -v max="$max" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file; done } > o3
real 0m24.556s
user 0m5.044s
sys 0m1.565s
so the solution where we call wc up front and rand() once is faster than calling rand() for every line as expected.
on bash shell, first initialize seed to # line cube or your choice
$ i=;while read a; do let i++;done<<<$var; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;echo -e $var |sed -En "$l p"
if move your data to varfile
$ echo -e $var >varfile
$ i=;while read a; do let i++;done<varfile; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;sed -En "$l p" varfile
put the last inside loop e.g. for((c=0;c<9;c++)) { ;}
Using GNU sed and bash; no wc or awk:
f=input-file
sed -n $((RANDOM%($(sed = $f | sed '2~2d' | sed -n '$p')) + 1))p $f
Note: The three seds in $(...) are an inefficient way to fake wc -l < $f. Maybe there's a better way -- using only sed of course.
Using shuf:
$ echo "$var" | shuf -n 1
Output:
Hi

How to process string in using awk in shell script

I am very new to shell scripting and have to do so many tasks around it. I am trying to learn as fast a possible but some times shell scripting makes a task look very easy and at other times it just toys with me. And I am facing similar situation now.
I have a command which gives me an output like this.
File Dependents
----------------------------------------------------------------------------
<File> is a requisite of <Dependents>
Path: /usr/lib/obj
Java 1.0.0.0 analysis 0.0.0.2
runtime 1.2.0.0
client 1.2.0.0
framework 6.1.9.100
sguide 1.9.10.0
sysmgt 6.1.9.100
dsm 6.1.9.200
Path: /etc/obj
Java 1.0.0.0 analysis 1.2.0.2
runtime 2.0.0.0
client3 6.1.9.0
sysmgt 6.1.9.0
dsm2 6.1.9.0
Now I want to get the list of dependencies into an array for further processing. This is what I am able to do so far:
<command> | cut -f1 | grep '[a-z]' | grep -v File | grep -v : | awk '{ print $1}'
output is:
Java<<< I want this to be analysis
runtime
client
framework
sguide
sysmgt
dsm
Java<<< want this to be analysis
runtime
client3
sysmgt
dsm2
I have to capture these two lists in two separate arrays.
Can someone please help me in achieving this output in an elegant way. I don't want to butcher this code with my brute force method involving lot of conditions and comparisions.
awk to the rescue!
$ arr1=$(command ... | awk -v c=1 '!NF{f=0} f && s==c{print $1} /Java/{f=1; s++; if(s==c) print $(NF-1)}')
$ arr2=$(command ... | awk -v c=2 '!NF{f=0} f && s==c{print $1} /Java/{f=1; s++; if(s==c) print $(NF-1)}')
$ echo $arr1
analysis runtime client framework sguide sysmgt dsm
$ echo $arr2
analysis runtime client3 sysmgt dsm2
perhaps better if you run the command once and split the results into two arrays.
Explanation
awk -v c=1 set awk variable c to 1 (describes group instance number)
'!NF{f=0} if there are no fields (empty line) reset f
f && s==c{print $1} if f is set and counter equals to c print the first field
/Java/{f=1; s++; when pattern matched to Java, set f and increment counter and
...if(s==c) print $(NF-1)}' if counter matches c print the penultimate field.
You can fix your solution by removing the substring with Java first:
command | sed 's/Java [^ ]*//' | cut -f1 | grep '[a-z]' | grep -v File | grep -v : | awk '{ print $1}'
When you use awk, you can better use the full strength of awk. Just say you want the print the second last field of any line with a number:
command | awk '/[0-9]/ { print $(NF-1) }'
This is better than trying to use sed (do you have tabs or spaces?)
command | sed -n '/[0-9].[0-9]/ s/^.* \([^ ]*\) .*/\1/p'
A funny solution is using rev to revert your text. That way cut can find the second field.
command | grep '[0-9].[0-9]' | rev | cut -d " " -f2 | rev
For people who only read the last line, I will repeat the awk solution:
command | awk '/[0-9]/ { print $(NF-1) }'

bash awk first 1st column and 3rd column with everything after

I am working on the following bash script:
# contents of dbfake file
1 100% file 1
2 99% file name 2
3 100% file name 3
#!/bin/bash
# cat out data
cat dbfake |
# select lines containing 100%
grep 100% |
# print the first and third columns
awk '{print $1, $3}' |
# echo out id and file name and log
xargs -rI % sh -c '{ echo %; echo "%" >> "fake.log"; }'
exit 0
This script works ok, but how do I print everything in column $3 and then all columns after?
You can use cut instead of awk in this case:
cut -f1,3- -d ' '
awk '{ $2 = ""; print }' # remove col 2
If you don't mind a little whitespace:
awk '{ $2="" }1'
But UUOC and grep:
< dbfake awk '/100%/ { $2="" }1' | ...
If you'd like to trim that whitespace:
< dbfake awk '/100%/ { $2=""; sub(FS "+", FS) }1' | ...
For fun, here's another way using GNU sed:
< dbfake sed -r '/100%/s/^(\S+)\s+\S+(.*)/\1\2/' | ...
All you need is:
awk 'sub(/.*100% /,"")' dbfake | tee "fake.log"
Others responded in various ways, but I want to point that using xargs to multiplex output is rather bad idea.
Instead, why don't you:
awk '$2=="100%" { sub("100%[[:space:]]*",""); print; print >>"fake.log"}' dbfake
That's all. You don't need grep, you don't need multiple pipes, and definitely you don't need to fork shell for every line you're outputting.
You could do awk ...; print}' | tee fake.log, but there is not much point in forking tee, if awk can handle it as well.

how to extract columns from a text file with bash

I have a text file like this.
res ABS sum
SER A 1 161.15 138.3
CYS A 2 66.65 49.6
PRO A 3 21.48 15.8
ALA A 4 77.68 72.0
ILE A 5 15.70 9.0
HIS A 6 10.88 5.9
I would like to extract the names of first column(res) based on the values of last column(sum). I have to print resnames if sum >25 and sum<25. How can I get the output like this?
This should do it:
awk 'BEGIN{FS=OFS=" "}{if($5 != 25) print $1}' bla.txt
While you can do this with a while read loop in bash, it's easier, and most likely faster, to use awk
awk '$5 != 25 { print $1 }'
Note that your logic print resnames if sum >25 and sum<25 is the same as print if sum != 25.
Consider using awk. Its a simple tool for processing columns of text (and much more). Here's a simple awk tutorial which will give you an overview. If you want to use it within a bash script, then this tutorial should help.
Run this on the command line to give you an idea of how you could do it:
> echo "SER A 1 161.15 138.3" | awk '{ if($5 > 25) print $1}'
> SER
> echo "SER A 1 161.15 138.3" | awk '{ if($5 > 140) print $1}'
>
while read line
do
v=($line)
sum=${v[4]}
((${sum/.*/} >= 25)) && echo ${v[0]}
done < file
You need to skip the first line.
Since bash doesn't handle floating point values, this will print 25 which isn't exactly bigger than 25.
This can be handled with calling bc for arithmetics.
tail -n +2 ser.dat | while read line
do
v=($line)
sum=${v[4]}
gt=$(echo "$sum > 25" | bc) && echo ${v[0]}
done
what about the good old cut?
:)
say you would like to have the second column,
cat your_file.txt | sed 's, +, ,g' | cut -d" " -f 2
what is doing sed in this command?
cut expects columns to be separated by a character or a string of fixed length (see documentation).

Bash command to sum a column of numbers [duplicate]

This question already has answers here:
Shell command to sum integers, one per line?
(45 answers)
Closed 7 years ago.
I want a bash command that I can pipe into that will sum a column of numbers. I just want a quick one liner that will do something essentially like this:
cat FileWithColumnOfNumbers.txt | sum
Using existing file:
paste -sd+ infile | bc
Using stdin:
<cmd> | paste -sd+ | bc
Edit:
With some paste implementations you need to be more explicit when reading from stdin:
<cmd> | paste -sd+ - | bc
Options used:
-s (serial) - merges all the lines into a single line
-d - use a non-default delimiter (the character + in this case)
I like the chosen answer. However, it tends to be slower than awk since 2 tools are needed to do the job.
$ wc -l file
49999998 file
$ time paste -sd+ file | bc
1448700364
real 1m36.960s
user 1m24.515s
sys 0m1.772s
$ time awk '{s+=$1}END{print s}' file
1448700364
real 0m45.476s
user 0m40.756s
sys 0m0.287s
The following command will add all the lines(first field of the awk output)
awk '{s+=$1} END {print s}' filename
Does two lines count?
awk '{ sum += $1; }
END { print sum; }' "$#"
You can then use it without the superfluous 'cat':
sum < FileWithColumnOfNumbers.txt
sum FileWithColumnOfNumbers.txt
FWIW: on MacOS X, you can do it with a one-liner:
awk '{ sum += $1; } END { print sum; }' "$#"
[a followup to ghostdog74s comments]
bash-2.03$ uname -sr
SunOS 5.8
bash-2.03$ perl -le 'print for 1..49999998' > infile
bash-2.03$ wc -l infile
49999998 infile
bash-2.03$ time paste -sd+ infile | bc
bundling space exceeded on line 1, teletype
Broken Pipe
real 0m0.062s
user 0m0.010s
sys 0m0.010s
bash-2.03$ time nawk '{s+=$1}END{print s}' infile
1249999925000001
real 2m0.042s
user 1m59.220s
sys 0m0.590s
bash-2.03$ time /usr/xpg4/bin/awk '{s+=$1}END{print s}' infile
1249999925000001
real 2m27.260s
user 2m26.230s
sys 0m0.660s
bash-2.03$ time perl -nle'
$s += $_; END { print $s }
' infile
1.249999925e+15
real 1m34.663s
user 1m33.710s
sys 0m0.650s
You can use bc (calculator). Assuming your file with #s is called "n":
$ cat n
1
2
3
$ (cat n | tr "\012" "+" ; echo "0") | bc
6
The tr changes all newlines to "+"; then we append 0 after the last plus, then we pipe the expression (1+2+3+0) to the calculator
Or, if you are OK with using awk or perl, here's a Perl one-liner:
$perl -nle '$sum += $_ } END { print $sum' n
6
while read -r num; do ((sum += num)); done < inputfile; echo $sum
Use a for loop to iterate over your file …
sum=0; for x in `cat <your-file>`; do let sum+=x; done; echo $sum
If you have ruby installed
cat FileWithColumnOfNumbers.txt | xargs ruby -e "puts ARGV.map(&:to_i).inject(&:+)"
[root#pentest3r ~]# (find / -xdev -size +1024M) | (while read a ; do aa=$(du -sh $a | cut -d "." -f1 ); o=$(( $o+$aa )); done; echo "$o";)

Resources