Awk printing out smallest and highest number, in a time format - bash

I'm fairly new to linux/bash shell and I'm really having trouble printing two values (the highest and lowest) from a particular column in a text file. The file is formatted like this:
Geoff Audi 2:22:35.227
Bob Mercedes 1:24:22.338
Derek Jaguar 1:19:77.693
Dave Ferrari 1:08:22.921
As you can see the final column is a timing, I'm trying to use awk to print out the highest and lowest timing in the column. I'm really stumped, I've tried:
awk '{print sort -n < $NF}' timings.txt
However that didn't even seem to sort anything, I just received an output of:
1
0
1
0
...
Repeating over and over, it went on for longer but I didn't want a massive line of it when you get the point after the first couple iterations.
My desired output would be:
Min: 1:08:22.921
Max: 2:22:35.227

After question clarifications: if the time field always has a same number of digits in the same place, e.g. h:mm:ss.ss, the solution can be drastically simplified. Namely, we don't need to convert time to seconds to compare it anymore, we can do a simple string/lexicographical comparison:
$ awk 'NR==1 {m=M=$3} {$3<m&&m=$3; $3>M&&M=$3} END {printf("min: %s\nmax: %s",m,M)}' file
min: 1:08:22.921
max: 2:22:35.227
The logic is the same as in the (previous) script below, just using a simpler string-only based comparison for ordering values (determining min/max). We can do that since we know all timings will conform to the same format, and if a < b (for example "1:22:33" < "1:23:00") we know a is "smaller" than b. (If values are not consistently formatted, then by using the lexicographical comparison alone, we can't order them, e.g. "12:00:00" < "3:00:00".)
So, on first value read (first record, NR==1), we set the initial min/max value to the timing read (in the 3rd field). For each record we test if the current value is smaller than the current min, and if it is, we set the new min. Similarly for the max. We use short circuiting instead if to make expressions shorter ($3<m && m=$3 is equivalent to if ($3<m) m=$3). In the END we simply print the result.
Here's a general awk solution that accepts time strings with variable number of digits for hours/minutes/seconds per record:
$ awk '{split($3,t,":"); s=t[3]+60*(t[2]+60*t[1]); if (s<min||NR==1) {min=s;min_t=$3}; if (s>max||NR==1) {max=s;max_t=$3}} END{print "min:",min_t; print "max:",max_t}' file
min: 1:22:35.227
max: 10:22:35.228
Or, in a more readable form:
#!/usr/bin/awk -f
{
split($3, t, ":")
s = t[3] + 60 * (t[2] + 60 * t[1])
if (s < min || NR == 1) {
min = s
min_t = $3
}
if (s > max || NR == 1) {
max = s
max_t = $3
}
}
END {
print "min:", min_t
print "max:", max_t
}
For each line, we convert the time components (hours, minutes, seconds) from the third field to seconds which we can later simply compare as numbers. As we iterate, we track the current min val and max val, printing them in the END. Initial values for min and max are taken from the first line (NR==1).

Given your statements that the time field is actually a duration and the hours component is always a single digit, this is all you need:
$ awk 'NR==1{min=max=$3} {min=(min<$3?min:$3); max=(max>$3?max:$3)} END{print "Min:", min ORS "Max:", max}' file
Min: 1:08:22.921
Max: 2:22:35.227

You don't want to run sort inside of awk (even with the proper syntax).
Try this:
sed 1d timings.txt | sort -k3,3n | sed -n '1p; $p'
where
the first sed will remove the header
sort on the 3rd column numerically
the second sed will print the first and last line

Related

Awk substring doesnt yield expected result

I've a file whose content is below:
C2:0301,353458082243570,353458082243580,0;
C2:0301,353458082462440,353458082462450,0;
C2:0301,353458082069130,353458082069140,0;
C2:0301,353458082246230,353458082246240,0;
C2:0301,353458082559320,353458082559330,0;
C2:0301,353458080153530,353458080153540,0;
C2:0301,353458082462670,353458082462680,0;
C2:0301,353458081943950,353458081943960,0;
C2:0301,353458081719070,353458081719080,0;
C2:0301,353458081392470,353458081392490,0;
Field 2 and Field 3 (considering , as separator), contains 15 digit IMEI number ranges and not individual IMEI numbers. Usual format of IMEI is 8-digits(TAC)+6-digits(Serial number)+0(padded). The 6 digits(Serial number) part in the IMEI defines the start and end range, everything else remaining same. So in order to find individual IMEIs in the ranges (which is exactly what I want), I need a unary increment loop from 6 digits(Serial number) from the starting IMEI number in Field-2 till 6 digits(Serial number) from the ending IMEI number in Field-3. I am using the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
It gives me the below result:
353458082243570,0
353458082243580,0
353458082462440,0
353458082462450,0
353458082069130,0
353458082069140,0
353458082246230,0
353458082246240,0
353458082559320,0
353458082559330,0
353458080153530,0
353458082462670,0
353458082462680,0
353458081943950,0
353458081943960,0
353458081719070,0
353458081719080,0
353458081392470,0
353458081392480,0
353458081392490,0
The above is as expected except for the below line in the result:
353458080153530,0
The result is actually from the below line in the input file:
C2:0301,353458080153530,353458080153540,0;
But the expected output for the above line in input file is:
353458080153530,0
353458080153540,0
I need to know whats going wrong in my script.
The problem with your script is you start with 2 string variables, v and t, (typed as strings since they are the result of a string operation, substr()) and then convert one to a number with v++ which would strip leading zeros but then you're doing a string comparison with v <= t since a string (t) compared to a number or string or numeric string is always a string comparison. Yes you can add zero to each of the variables to force a numeric comparison but IMHO this is more like what you're really trying to do:
$ cat tst.awk
BEGIN { FS=","; re="(.{8})(.{6})(.*)" }
{
match($2,re,beg)
match($3,re,end)
for (i=beg[2]; i<=end[2]; i++) {
printf "%s%06d%s\n", end[1], i, end[3]
}
}
$ gawk -f tst.awk file
353458082243570
353458082243580
353458082462440
353458082462450
353458082069130
353458082069140
353458082246230
353458082246240
353458082559320
353458082559330
353458080153530
353458080153540
353458082462670
353458082462680
353458081943950
353458081943960
353458081719070
353458081719080
353458081392470
353458081392480
353458081392490
and when done with appropriate variables like that no conversion is necessary. Note also that with the above you don't need to repeatedly state the same or relative numbers to extract the part of the strings you care about, you just state the number of characters to skip (8) and the number to select (6) once. The above uses GNU awk for the 3rd arg to match().
The problem was in the while(v <= t) part of the script. I believe with leading 0s the match was not happening properly. So I ensured that they are casted into int while doing the comparison in the while loop. The AWK documentation says you can cast a value to int by using value+0. So my while(v <= t) in the awk script needed to change to while(v+0 <= t+0) . So the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
was changed to :
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v+0 <= t+0) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
That only change got me the expected value for the failure case. For example this in my input file:
C2:0301,353458080153530,353458080153540,0;
Now gives me individual IMEIs as :
353458080153530,0
353458080153540,0
Use an if statement that checks for leading zeros in variable v setting y accordingly:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) { if (substr(v,1,1)=="0") { v++;y="0"v } else { v++;y=v } ;printf %s%0"6"s%s,%s\n", substr($3,1,8),y,substr($3,15,2),$4;v=y } }' TEMP.OUT.merge_range_part1_21
Make sure that the while condition is contained in braces and also that v is incremented WITHIN the if conditions.
Set v=y at the end of the statement to allow this to work on additional increments.

finding maximum from partial string

I have a list where first 6 digit is date in format yyyymmdd. The next 4 digits are part of timestamp. I want to select only those numbers which are maximum timestamp for any day.
20160905092900
20160905212900
20160906092900
20160906213000
20160907093000
20160907213000
20160908093000
20160908213000
20160910093000
20160910213100
20160911093100
20160911213100
20160912093100
Means from the above list the output should give the below list.
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
$ sort -r file | awk '!seen[substr($0,1,8)]++' | sort
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
If the file's already sorted you can use tac instead of sort.
You can use awk:
awk '{
dt = substr($0, 1, 8)
ts = substr($0, 9, 12)
}
ts > max[dt] {
max[dt] = ts
rec[dt] = $0
}
END {
for (i in rec)
print rec[i]
}' file
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
We are using associative array max that uses first 8 characters as key and next 4 characters as value. This array is being used to store max timestamp value for a given date. Another array rec is used to store full line for a date when we encounter timestamp value greater than stored value in max array.

finding and sorting datapoints in bash and awk

First of all, let me clarify that I am unfortunately still quite inexperienced in programming, so I really need some help.
What I have:
I have a data file containing 3 columns: $1=(Energy1), $2=(Energy2), $3=(intensity of their frequency in combination).
If I plot these data e.g. in gnuplot by doing spl "datafile.dat" u 1:2:3 I obtain a surface plot with my 2D-spectrum.
What I want:
Now, I would like to select only certain data points, for which my ($1-$2)=5.7 give this specific value, thus obtaining a line spectrum along a diagonal, with all possible combinations of $1 and $2 yielding this value.
The new data-file should then contain the $1-value and the intensity (stored in $3) corresponding to the selected line, which contained the correct values of $1 and $2 yielding 5.7.
I have tried do do this in bash using awk, but unfortunately until now I failed. PLEASE help me!!! thank you very much in advance.
Maybe I don't understand all the issues, or maybe you are having a floating-equal problem as others have noted, but why doesn't a simple filter through the data work?:
awk -v s=5.7 -v e=.01 '{d=$1-$2-$s}d<e&&d>-e{print $1,$3}'
Tack on a sort if you want/need:
| sort -n
Or, is it possible that your data is too sparse, and you're looking for some value interpolation solution?
You do not need awk for this, gnuplot can do it.
admissible(x,y,value,epsilon)=(abs(x-y-value)<epsilon)
plot 'datafile.dat' using (admissible($1,$2,5.7,1e-5)?$1:1/0):3 with points
Function admissible is tested for each line of data file, if it returns true then the point ($1,$3) is plotted, else the x-coordinate is set to undefined (1/0) and thus the point is not plotted. The only shortcoming is that you cannot use the lines style with this, since lines will be interrupted by non-admissible datapoints.
If you want to compare every $1 against every $2, you need to take 2 passes through the file, once to collect all the $1,$3 pairs, the next to do all the comparisons:
awk -v diff=5.7 '
NR == FNR {
# this is the first trip through
val[$1] = $3
next
}
{
for (v1 in val) {
if ( (v1 - $2) == diff ) {
print v1, val[v1]
}
}
}
' file file # yes, give the same filename twice.
To address #Baruchel's comment about floating point precision, try this:
awk -v diff=5.7 -v epsilon=0.0001'
NR == FNR {val[$1] = $3; next}
{
for (v1 in val) {
delta = v1 - $2 - diff
if (-epsilon <= delta && delta <= epsilon)
print v1, val[v1]
}
}
' file file

How to Write a Unix Shell to Sum the Values in a Row Against Each Unique Column (e.g., how to calculate total votes for each distinct candidate)

In its basic form, I am given a text file with state vote results from the 2012 Presidential Election and I need to write a one line shell script in Unix to determine which candidate won. The file has various fields, one of which is CandidateName and the other is TotalVotes. Each record in the file is the results from one precinct within the state, thus there are many records for any given CandidateName, so what I'd like to be able to do is sort the data according to CandidateName and then ultimately sum the TotalVotes for each unique CandidateName (so the sum starts at a unique CandidateName and ends before the next unique CandidateName).
No need for sorting with awk and its associative arrays. For convenience, the data file format can be:
precinct1:candidate name1:732
precinct1:candidate2 name:1435
precinct2:candidate name1:9920
precinct2:candidate2 name:1238
Thus you need to create totals of field 3 based on field 2 with : as the delimiter.
awk -F: '{sum[$2] += $3} END { for (name in sum) { print name " = " sum[name] } }' data.file
Some versions of awk can sort internally; others can't. I'd use the sort program to process the results:
sort -t= -k2nb
(field separator is the = sign; the sort is on field 2, which is a numeric field, possibly with leading blanks).
Not quite one line, but will work
$ cat votes.txt
Colorado Obama 50
Colorado Romney 20
Colorado Gingrich 30
Florida Obama 60
Florida Romney 20
Florida Gingrich 30
script
while read loc can num
do
if ! [ ${!can} ]
then
cans+=($can)
fi
(( $can += num ))
done < votes.txt
for can in ${cans[*]}
do
echo $can ${!can}
done
output
Obama 110
Romney 40
Gingrich 60

Compare fields in two files taking closest match

I am trying to resolve locations in lat and long in one file to a couple of named fields in another file.
I have one file that is like this..
f1--f2--f3--------f4-------- f5---
R 20175155 41273951N078593973W 18012
R 20175156 41274168N078593975W 18000
R 20175157 41274387N078593976W 17999
R 20175158 41274603N078593977W 18024
R 20175159 41274823N078593978W 18087
Each character is in a specific place so I need to define fields based on characters.
f1 char 18-21; f2 char 22 - 25; f3 char 26-35; f4 char 36-45; f5 char 62-66.
I have another much larger csv file that has fields 11, 12, and 13 to correspond to f3, f4, f5.
awk -F',' '{print $11, $12, $13}'
41.46703821 -078.98476926 519.21
41.46763555 -078.98477791 524.13
41.46824123 -078.98479015 526.67
41.46884129 -078.98480615 528.66
41.46943371 -078.98478482 530.50
I need to find the closest match to file 1 field 1 && 2 in file 2 field 11 && 12;
When the closest match is found I need to insert field 1, 2, 3, 4, 5 from file 1 into file 2 field 16, 17, 18, 19, 20.
As you can see the format is slightly different. File 1 breaks down like this..
File 1
f3-------f4--------
DDMMSSdd DDDMMSSdd
41273951N078593973W
File 2
f11-------- f12---------
DD dddddddd DDD dddddddd
41.46703821 -078.98476926
N means f3 is a positive number, W means f4 is a negative number.
I changed file 1 with sed, ridiculous one liner that works great.. (better way???)
cat $file1 |sed 's/.\{17\}//' |sed 's/\(.\{4\}\)\(.\{4\}\)\(.\{9\}\)\(.\)\(.\{9\}\)\(.\)\(.\{16\}\)\(.\{5\}\)/\1,\2,\3,\4,\5,\6,\8/'|sed 's/\(.\{10\}\)\(.\{3\}\)\(.\{2\}\)\(.\{2\}\)\(.\{2\}\)\(.\{3\}\)\(.\{3\}\)\(.\{2\}\)\(.*\)/\1\2,\3,\4.\5\6\7,\8\9/'|sed 's/\(.\{31\}\)\(.\{2\}\)\(.*\)/\1,\2.\3/'
2017,5155, 41,27,39.51,N,078,59,39.73,W,18012
2017,5156, 41,27,41.68,N,078,59,39.75,W,18000
2017,5157, 41,27,43.87,N,078,59,39.76,W,17999
2017,5158, 41,27,46.03,N,078,59,39.77,W,18024
2017,5159, 41,27,48.23,N,078,59,39.78,W,18087
Now I have to convert the formats.. (RESOLVED this (see below)--problem -- The numbers are rounded off too far. I need to have at least six decimal places.)
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 2) printf ($i","); else if (i == 3&&$6 == "S") printf("-"$3+($4/60)+($5/3600)","); else if (i == 3&&$6 == "N") printf($3+($4/60)+($5/3600)","); else if (i == 7&&$10 == "W") printf("-"$7+($8/60)+($9/3600)","); else if (i == 7&&$10 == "E") printf($7+($8/60)+($9/3600)","); if (i == 11) printf ($i"\n")}}'
2017,5155,41.461,-78.9944,18012
2017,5156,41.4616,-78.9944,18000
2017,5157,41.4622,-78.9944,17999
2017,5158,41.4628,-78.9944,18024
2017,5159,41.4634,-78.9944,18087
That's where I'm at.
RESOLVED THIS
*I need to get the number format to have at least 6 decimal places from this formula.*
printf($3+($4/60)+($5/3600))
Added "%.8f"
printf("%.8f", $3+($4/60)+($5/3600))
Next issue will be to match the fields file 1 f3 and f4 to the closest match in file 2 f11 and f12.
Any ideas?
Then I will need to calculate the distance between the fields.
In Excel the formuls would be like this..
=ATAN2(COS(lat1)*SIN(lat2)-SIN(lat1)*COS(lat2)*COS(lon2-lon1), SIN(lon2-lon1)*COS(lat2))
What could I use for that calculation?
*UPDATE---
I am looking at a short distance for the matching locations. I was thinking about applying something simple like Pythagoras’ theorem for the nearest match. Maybe even use less decimal places. It's got to be many times faster.
maybe something like this..*
x = (lon2-lon1) * Math.cos((lat1+lat2)/2);
y = (lat2-lat1);
d = Math.sqrt(x*x + y*y) * R;
Then I could do the heavy calculations required for greater accuracy after the final file is updated.
Thanks
You can't do the distance calculation after you perform the closest match: closest is defined by comparison of the distance values. Awk can evaluate the formula that you want (looks like great-circle distance?). Take a look at this chapter to see what you need.
The big problem is finding the nearest match. Write an awk script that takes a single line of file 1 and outputs the lines in file 2 with an extra column. That column is the calculation of the distance between the pair of points according to your distance formula. If you sort that file numerically (sort -n) then your closest match is at the top. Then you need a script that loops over each line in file 1, calls your awk script, uses head -n1 to pull out the closest match and then output it in the format that you want.
This is all possible in bash and awk, but it would be a much simpler script in Python. Depends on which you prefer.

Resources