Simplify an awk "nth column sum"

Simplify an awk "nth column sum" - bash

Could you help me simplify:
awk 'BEGIN{FS=OFS=","}{rank=1/((1/$6)+(1/$10)+(1/$14)+(1/$18)+(1/$22));print $0,rank}' test.csv
I know the for loop should be:
for(i=6; i<=NF; i+=4)
But I don't know how to make a repeating pattern in AWK. Also not sure how awk handles dividing by zero.
Sample data:
04/12/10 01:15,1291425300,279,41,6,24,71,39,12,1,356,25,4,29,32,10,1,1,170,27,16,8
21/05/14 16:45,1400690700,147,28,80,13,99,7,121,11,107,19,132,12,119,24,40,10,154,25,161,20
09/10/07 09:45,1191923100,152,56,201,35,115,47,157,29,149,47,119,19,131,40,30,11,216,136,213,64
08/06/07 00:30,1181262600,133,47,268,41,93,26,282,40,151,30,249,39,160,46,191,45,164,64,216,42
13/11/09 06:15,1258092900,1043,1462,1163,1456,789,1111,930,1143,954,1460,1366,1469,831,891,728,954,1092,1316,1381,1492
10/03/98 19:30,889558200,789,1240,1176,1262,,,,,,,,,,,,,162,271,1006,283
Sample output:
04/12/10 01:15,1291425300,279,41,6,24,71,39,12,1,356,25,4,29,32,10,1,1,170,27,16,8,0.454308093994778
21/05/14 16:45,1400690700,147,28,80,13,99,7,121,11,107,19,132,12,119,24,40,10,154,25,161,20,2.49273678094131
09/10/07 09:45,1191923100,152,56,201,35,115,47,157,29,149,47,119,19,131,40,30,11,216,136,213,64,4.50004789527607
08/06/07 00:30,1181262600,133,47,268,41,93,26,282,40,151,30,249,39,160,46,191,45,164,64,216,42,8.2601610016789
13/11/09 06:15,1258092900,1043,1462,1163,1456,789,1111,930,1143,954,1460,1366,1469,831,891,728,954,1092,1316,1381,1492,252.467979545275
10/03/98 19:30,889558200,789,1240,1176,1262,,,,,,,,,,,,,162,271,1006,283,#DIV/0!

Like this:
BEGIN{FS=OFS=","}{rank=0;for(i=6;i<=22;i+=4)rank+=($i ? 1/$i : 0);print $0,rank}

$ awk '
BEGIN { FS=OFS="," }
{
for(i=6;i<=NF;i+=4) # every 4th column
if($i+0==0) { # if there is a 0 divisor
rank="#DIV/0!" # set rank to something static
break # break from for
}
else
rank+=1/$i # sum every 4th
print $0,rank # output
rank=0 # reset
}' file
Outputs (didn't check if they were right):
04/12/10 01:15,1291425300,279,41,6,24,71,39,12,1,356,25,4,29,32,10,1,1,170,27,16,8,2.20115
21/05/14 16:45,1400690700,147,28,80,13,99,7,121,11,107,19,132,12,119,24,40,10,154,25,161,20,0.401166
09/10/07 09:45,1191923100,152,56,201,35,115,47,157,29,149,47,119,19,131,40,30,11,216,136,213,64,0.22222
08/06/07 00:30,1181262600,133,47,268,41,93,26,282,40,151,30,249,39,160,46,191,45,164,64,216,42,0.121063
13/11/09 06:15,1258092900,1043,1462,1163,1456,789,1111,930,1143,954,1460,1366,1469,831,891,728,954,1092,1316,1381,1492,0.0039609
10/03/98 19:30,889558200,789,1240,1176,1262,,,,,,,,,,,,,162,271,1006,283,#DIV/0!

Related

bash script to read values inside every file and compare them

I want to plot some data of a spray simulation. There is a variable called the vaporpenetrationlength, which describes the distance from the injector to the position where the mass fraction is 0.1%. The simulation created many folders for each time step. Inside those folders there is one file which contains the mass fraction and the distance. 
I want to create a script which goes through all the time step folders and search inside this one file and prints out the distance where the 0.1% were measured and in which time step it was.
I found a script, but I don't understand it because I just started to learn shell scripting.
Could someone please help me step by step in building such a script? I am interested in learning it, and therefore I want to understand ever line of the code. 
Thanks in advance :)

This little script outputs TimeTabLengthTabMass based on the value of the "mass fraction":
printf '%s\t%s\t%s\n' 'Time' 'Length' 'Mass'
awk '
BEGIN { FS = OFS = "\t"}
FNR == 1 {
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
0.001 <= $2 && $2 < 0.00101 { print time,$1,$2 }
' postProcessing/singleGraphVapPen/*/*
remark: In fact, printing the header could be done within the awk program, but doing it with a separate printf command allows you to post-process the output of awk (for ex. if you need to sort the times and/or lengths and/or masses).
notes:
FNR == 1 is true for the first line of each input file. In the corresponding block, I extract the time value from the directory name.
NF != 2 {next} is for filtering out the gnuplot commands that are at the beginning of the input files. In words, this statement means "if the number of (tab-delimited) fields in the line isn't 2, then skip"
0.001 <= $2 && $2 < 0.00101 selects the lines based on the value of their second field, which is referred to as yheptane in your script. IDK the margin of error of your "0.1% of mass fraction" so I chose convenient conditions for the sample output below.
With the sample data, the output will be:
Time Length Mass
0.0001500 0.0895768 0.00100839
0.0002000 0.102057 0.00100301
0.0002000 0.0877939 0.00100832
0.0003500 0.0827694 0.00100114
0.0009000 0.0657509 0.00100015
0.0015000 0.0501911 0.00100016
0.0016500 0.0469495 0.00100594
0.0018000 0.0436538 0.00100853
0.0021500 0.0369005 0.00100809
0.0023000 0.100328 0.00100751
As an aside, here's a script for replacing your original code:
#!/bin/bash
set -- postProcessing/singleGraphVapPen/*/*
if ! [ -f VapPen.txt ]
then
{
printf '%s\t%s\n' 'Time [s]' 'VapPen [m]'
awk '
BEGIN {FS = OFS = "\t"}
FNR == 1 {
if (NR > 1)
print time,vappen
vappen = 0
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
$2 >= 0.001 { vappen = $1 }
END { if (NR) print time,vappen }
' "$#" |
sort -n -k1,1
} > VapPen.txt
fi
gnuplot -e '
set title "Verdunstungspenetration";
set xlabel "Zeit [s]";
set ylabel "Verdunstungspenetrationslänge [m]";
set grid;
plot "VapPen.txt" using 1:2 with linespoints title "Vapor penetraion 0,1% mass";
pause -1 "Hit return to continue";
'
With the provided data, it reduces the execution time from several minutes to 0.15s on my computer.

Adding constant values using awk

I have requirement to add constant value to 4th column if value is less than 240000. The constant value is 010000. I have written command but its not give any output. Below is sample data and script. Please help me in this.Thank in advance.
Command :
awk '{
if($4 -lt 240000)
$4= $4+010000;
}' Test.txt
Sample Data :
1039,1018,20180915,000000,0,0,A
1039,1018,20180915,010000,0,0,A
1039,1018,20180915,020000,0,0,A
1039,1018,20180915,030000,0,0,A
1039,1018,20180915,240000,0,0,A
1039,1018,20180915,050000,0,0,A
1039,1018,20180915,060000,0,0,A
1039,1018,20180915,070000,1,0,A
1039,1018,20180915,080000,0,1,A
1039,1018,20180915,090000,2,0,A
1039,1018,20180915,241000,0,0,A
1039,1018,20180915,240500,0,0,A

$ awk '
BEGIN { FS=OFS="," } # input and output field separators
{
if($4<240000) # if comparison
$4=sprintf("%06d",$4+10000) # I assume 10000 not 010000, also zeropadded to 6 chars
# $4+=10000 # if zeropadding is not required
print # output
}' file
Output:
1039,1018,20180915,010000,0,0,A
1039,1018,20180915,020000,0,0,A
1039,1018,20180915,030000,0,0,A
1039,1018,20180915,040000,0,0,A
1039,1018,20180915,240000,0,0,A
1039,1018,20180915,060000,0,0,A
1039,1018,20180915,070000,0,0,A
1039,1018,20180915,080000,1,0,A
1039,1018,20180915,090000,0,1,A
1039,1018,20180915,100000,2,0,A
1039,1018,20180915,241000,0,0,A
1039,1018,20180915,240500,0,0,A
$4+10000 not 010000 since awk 'BEGIN{ print 010000+0}' outputs 4096 as it is octal representation of of that value.

Bash script - How to loop through rows in a CSV file

I am working with a huge CSV file (filename.csv) that contains a single column. From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - divide the value of the current cell by the value of the previous cell and exit by printing the value of the division. For example from the following example: i wanted my bash script to divide 327 by 340 and print 0.961765 to the console and exit.
338
338
339
340
327
301
299
284
284
283
283
283
282
282
282
283
I tried it with the following awk and it works perfectly fine.
awk '$1 < val {print $1/val; exit} {val=$1}' filename.csv
However, since i want to include around 7 conditional statements (if-else's), I wanted to do it with a bit cleaner bash script and here is my approach. I am not that used to awk to be honest and that's why i prefer to use bash.
#!/bin/bash
FileName="filename.csv"
# Test when to stop looping
STOP=1
# to find the number of columns
NumCol=`sed 's/[^,]//g' $FileName | wc -c`; let "NumCol+=1"
# Loop until the current cell is less than the count+1
while [ "$STOP" -lt "$NumCol" ]; do
cat $FileName | cut -d, -f$STOP
let "STOP+=1"
done
How can we loop through the values and add conditional statements?
PS: the criteria for my if-else statement is (if the value ($1/val) is >=0.85 and <=0.9, print A, else if the value ($1/val) is >=0.7 and <=0.8, print B, if the value ($1/val) is >=0.5 and <=0.6 print C otherwise print D).

Here's one in GNU awk using switch, because I haven't used it in a while:
awk '
$1<p {
s=sprintf("%.1f",$1/p)
switch(s) {
case "0.9": # if comparing to values ranged [0.9-1.0[ use /0.9/
print "A" # ... in which case (no pun) you don't need sprintf
break
case "0.8":
print "B"
break
case "0.7":
print "c"
break
default:
print "D"
}
exit
}
{ p=$1 }' file
D
Other awks using if:
awk '
$1<p {
# s=sprintf("%.1f",$1/p) # s is not rounded anymore
s=$1/p
# if(s==0.9) # if you want rounding,
# print "A" # uncomment and edit all ifs to resemble
if(s~/0.9/)
print "A"
else if(s~/0.8/)
print "B"
else if(s~/0.7/)
print "c"
else
print "D"
exit
}
{ p=$1 }' file
D

This is an alternative approach,based on previous input data describing comparison of $1/val with fixed numbers 0.9 , 0.7 and 0.6.
This solution will not work with ranges like ($1/val) >=0.85 and <=0.9 as clarified later.
awk 'BEGIN{crit[0.9]="A";crit[0.7]="B";crit[0.6]="C"} \
$1 < val{ss=substr($1/val,1,3);if(ss in crit) {print crit[ss]} else {print D};exit}{val=$1}' file
A
This technique is based on checking if rounded value $1/val belongs to a predefined array loaded with corresponding messages.
Let me expand the code for better understanding:
awk 'BEGIN{crit[0.9]="A";crit[0.7]="B";crit[0.6]="C"} \ #Define the criteria array. Your criteria values are used as keys and values are the messages you want to print.
$1 < val{
ss=substr($1/val,1,3); #gets the first three chars of the result $1/val
if(ss in crit) { #checks if the first three chars is a key of the array crit declared in begin
print crit[ss] #if it is, print it's value
}
else {
print D #If it is not, print D
};
exit
}
{val=$1}' file
Using substr we get the first three chars of the result $1/val:
for $1/val = 0.961765 using substr($1/val,1,3) returns 0.9
If you want to make comparisons based on two decimals like 0.96 then change substr like substr($1/val,1,4).
In this case you need to accordingly provide the correct comparison entries in crit array i.e crit[0.96]="A"

Splitting of Big File into Smaller Chunks in Shell Scripting

I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.
Sample.txt ( File will be sorted based on the third field on which pattern to be searched )
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/>
"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
Used
awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile
and grep commands but it was very time consuming since file is 300+ MB of size.

Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.
It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.
ndx=0; fromRow=1
for val in '00003' '00112' '|'; do # 2 sample values to match, plus dummy value
chunkFile="smallfile$(( ++ndx ))"
fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
NR < fromRow { next }
{ if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
' big_file)
done
Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.
Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:
awk -F'|' -v vals='00003|00112' '
BEGIN { split(vals, val); outFile="smallfile" ++ndx }
{
if ($3 != val[ndx]) {
if (p) { p=0; close(outFile); outFile="smallfile" ++ndx }
} else {
p=1
}
print > outFile
}
' big_file

You can try with Perl:
perl -ne '/00003/ && print' big_file > small_file
and compare its timing with other solutions...
EDIT
Limiting my answer to the tools you didn't try already... you can also use:
sed -n '/00003/p' big_file > small_file
But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.

Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.

I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.

Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt

I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Simplify an awk "nth column sum" - bash

Like this: BEGIN{FS=OFS=","}{rank=0;for(i=6;i<=22;i+=4)rank+=($i ? 1/$i : 0);print $0,rank}

Related

bash script to read values inside every file and compare them

Adding constant values using awk

Bash script - How to loop through rows in a CSV file

Splitting of Big File into Smaller Chunks in Shell Scripting

Find nth row using AWK and assign them to a variable

Categories

Resources