Awk script else statement not working - shell

I am writing an awk script inside of a shell script to display all logins of current month, and number of logins on each day of the month so far. I will have to determine the day on which the number of logins has been the greatest, but first I would like to figure out how to write the else statement of my if statement which follows in the code below:
#!/bin/bash
month=$(date +"%b")
day=$(date +"%e")
last | awk -v m=$month -v d=$day 'BEGIN{
max=0; c=0; maxd=day}
($5 ~ m) {
if ($6 =d) {print; c++; print c}
else {printf("Else statement test")}
}
'
So far it works fine without the line containing the else statement, but it seems like it won't recognize the else no matter what I add to it.
With
$5 ~ m
I check whether the current line is of current month, and then
$6 =d
checks if it's still the same day. If so, then counter increases, and I print the current number of daily logins. I would like to save the value of the counter to an associative array in the else statement and set the counter variable back to zero too when I encounter a new day (when $6 is no longer =d).
I've tried to add these operations to the else statement, but when I read the script, it won't recognize the "else". It will only print logins of today (Apr. 23) and count the lines, but won't execute the else part. Why isn't it working?
Edit:
I figured that comparison expressions are identical to the ones from most languages and I've corrected it, but it won't print all lines from current day.

Your if condition is not correct:-
if ($6 =d)
It should be:-
if ($6 == "$d")

this should work better
$ last |
awk -v d="$(date +'%b')" '$5==d{day[$6]++}
END {for(k in day) print d, k, day[k]}' |
sort -k2n

Related

How $0 is used in awk, how it works?

read n
awk '
BEGIN {sum=0;}{if( $0%2==0 ){sum+=$0;
}
}
END { print sum}'
Here i add, sum of even numbers and what i want is, initially i give input as how many(count) and then the numbers i wanted to check as even and add it.
eg)
3
6
7
8
output is : 14
here 3 is count and followed by numbers i want to check, the code is executed correctly and output is correct, but i wanted to know how $0 left the count value i.e) 3 and calculates the remaining numbers.
Please update your question to be meaningful: There is no relationship between $0 and the Unix operating system, as choroba already pointed out in his comment. You obviously want to know the meaning of $0 in the awk programming language. From the awk man-page in the section about Fields:
$0 is the whole record, including leading and trailing whitespace.
you're reading the count but not using it in the script,
a rewrite can be
$ awk 'NR==1 {n=$1; next} // read the first value and skip the rest
!($1%2) {sum+=$1} // add up even numbers
NR>n {print sum; exit}' file // done when the # linespass the counter.
in awk, $0 corresponds to the record (here the line), and $i for the fields i=1,2,3...
even number is the one with remainder 0 divided by 2. NR is the line number.

Detecting semi-duplicate records in Bash/AWK

Right now I have a script that rifles through tabulated data for cross-referencing record by record (using AWK). But I've run into a problem. AWK is great for line-by-line comparisons to run through formatted data, but I also want to detect semi-duplicate records. Unfortunately, uniq will not work by itself as the record is not 100% carbon-copy.
This is an orderly list, sorted by second and third columns. What I want to detect is the same values in Column 3, 6 and 7
Here's an example:
JJ 0072 0128 V7589 N 22.35 22.35 0.00 Auth
JJ 0073 0128 V7589 N 22.35 22.35 0.00 Auth
The second number is different while the other information is exactly the same, so uniq will not find it solo.
Is there something in AWK that lets me reference the previous line? I already have this code block from AWK going line-by-line. (EDIT awk statement was an older version that was terrible)
awk '{printf "%s", $0; if($6 != $7 && $9 != "Void" && $5 == "N") {printf "****\n"} else {printf "\n"}}' /tmp/verbout.txt
Is there something in AWK that lets me reference the previous line?
No, but there's nothing stopping you from explicitly saving certain info from the last line and using that later:
{
if (last3 != $3 || last6 != $6 || last7 != $7) {
print
} else
handle duplicate here
}
last3=$3
last6=$6
last7=$7
}
The lastN variables all (effectively) default to an empty string at the start then we just compare each line with those and print that line if any are different.
Then we store the fields from that line to use for the next.
That is, of course, assuming duplicates should only be detected if they're consecutive. If you want to remove duplicates when order doesn't matter, you can sort on those fields first.
If order needs to be maintained, you can use an associative array to store the fact that the key has been seen before, something like:
{
seenkey = $3" "$6" "$7
if (seen[seenkey] == 0) {
print
seen[seenkey] = 1
} else {
handle duplicate here
}
}
one way of doing this with awk is
$ awk '{print $0, (a[$3,$6,$7]++?"duplicate":"")' file
this will mark the duplicate records, note that you don't need to sort the file.
if you want to just print the uniq records, the idiomatic way is
$ awk '!a[$3,$6,$7]++' file
again, sorting is not required.

Eliminate useless repeats of values from CSV for line charting

Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value
awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2

How to pass a bash variable to one of the two reg expressions in awk

I have a file where the first two fields are 'Month' and 'Year'. like as follows:
April 2016 100 200 300
May 2016 150 250 300
June 2016 200 250 400
Such data is stored for about 30 months. I need to get an output starting from April of any year to March of next year (12 months). When I use following awk code on terminal I get the correct answer.
awk '/March/ && /2016/ {for(i=1; i<=12; i++){getline;print}}' file
The first pattern will always be the same 'March', however the second pattern will depend upon user input. User may ask for 2015 or 2017 or any other.
I do not understand exactly how the above code works but more importantly I am unable to pass the user input for the year to awk and get the correct result.
I have tried the following:
F_year=2016
awk -v f_year="$F_year" '/March/ && /$1 ~f_year/ {
for (i=1; i<=12; i++) {
getline;
print
}
}' file.
I will appreciate if someone can give me the solution with some explanation.
OP code:
$ awk -v f_year="$F_year" '
/March/ && /$1 ~f_year/ { # removing the latter /.../ would work, but... (1)
for(i=1; i<=12; i++) { # (2)
getline # getline is a bit controversial... (3)
print
}
}' file
Modified:
$ awk -v f_year="$F_year" '
(/March/ && $2==f_year && c=12) ||--c > 0 { # (1) == is better
# (2) awk is a big record loop, use it
print # (3) with decreasing counter var c
}' file
Above is somewhat untested as your data samples did not fully allow it but 2 months including April seemed to work ('/April/ ... && c=2). Also, you could remove the whole {print} block.
You can use sed:
sed -n '/April 2016/,+11 p' file
Or
month="April"
year="2016"
sed -n "/${month} ${year}/,+11 p" file
awk -v year="$F_year" '$1=="April" && $2==year{f=1} f{if (++c==13) exit; print}' file
Untested of course since you didn't provide sample input/output we could test against. Don't use getline until you've read and fully understand everything discussed in http://awk.freeshell.org/AllAboutGetline.

Bash find last entry before timestamp

I have a .csv file that is formatted thus;
myfile.csv
**Date,Timestamp,Data1,Data2,Data3,Data4,Data5,Data6**
20130730,22:08:51.244,APPLES,Spain,67p,blah,blah
20130730,22:08:51.244,PEARS,Spain,32p,blah,blah
20130730,22:08:51.708,APPLES,France,102p,blah,blah
20130730,22:10:62.108,APPLES,Spain,67p,blah,blah
20130730,22:10:68.244,APPLES,Spain,67p,blah,blah
I wish to feed in a timestamp which most likely will NOT match up perfectly to the millisecond with those in the file, and find the preceding line that matches a particular grep search.
so e.g. something like;
cat myfile.csv | grep 'Spain' | grep 'APPLES' | grep -B1 "22:09"
should return
20130730,22:08:51.244,APPLES,Spain,67p,blah,blah
But thus far I can only get it to work with exact timestamps in the grep. Is there a way to get it to treat these as a time series? (I am guessing that's what the issue is here - it's trying pure pattern matching and not unreasonably failing to find one)
I have also a fancy solution using awk:
awk -F ',' -v mytime="2013 07 30 22 09 00" '
BEGIN {tlimit=mktime(mytime); lastline=""}
{
l_y=substr($1,0,4); l_m=substr($1,4,2); l_d=substr($1,6,2);
split($2,l_hms,":"); l_hms[3]=int(l_hms[3]);
line_time=mktime(sprintf("%d %d %d %d %d %d", l_y, l_m, l_d, l_hms[1], l_hms[2], l_hms[3]));
if (line_time>tlimit) exit; lastline=$0;
}
END{if lastline=="" print $0; else print lastline;}' myfile.csv
It is working based on making the timestamps from each line with awk's time function mktime. I also make the assumption that $1 is the date.
On the first line, you have to provide the timestamp of the time limit you want (here I choose 2013 07 30 22 09 00). You have to write it according to the format used by mktime : YYYY MM DD hh mm ss. You begin the awk statement with making up the timestamp of your time limit. Then, for each line, you catch up year, month and day from $1 (line 4), then the exact hour from $2 (line 5). As mktime takes only entire seconds, I truncate the seconds (you can round it up with int(l_hms[3]+0.5)). Here you can do evereything you want to approximate the timestamp, like discarding the seconds. On line 6, I make the time stamp from the six date fields I have extracted. Finally, on line 7, I compare timestamps and goto end in case of reaching your time limit. As you want the preceding line, I store the line into the variable lastline. On exit, I print lastline; in case of reaching the time limit on the first line, I print the first line.
This solution works well on your sample file, and works for any date you supply. You only have to supply the date limit in the correct format!
EDIT
I realize that mktime is not necessary. If the assumption that $1 is the date written as YYYYMMDD, you can compare the date as a number then the time (extracted with split, rebuilt as a number as in other answers). In that case, you can supply the time limit in the format you want, and recover proper date and time limits in the BEGIN block.
you could have a awk that keep in memory the last line it saw which have a timestamp lower than the one you feed it, and prints the last match at the end (considering they are in ascending order)
ex:
awk -v FS=',' -v thetime="22:09" '($2 < thetime) { before=$0 ; } END { print before ; }' myfile.csv
This happen to work as you feed it a string that, lexigographically, doesn't need to have the complete size (ie 22:09:00.000) to be compared.
The same, but on several lines for readability:
awk -v FS=',' -v thetime="22:09" '
($2 < thetime) { before=$0 ; }
END { print before ; }' myfile.csv
Now if I understand your complete requirements: you need to find, among lines mactching a country and a type of product, the last line before a timestamp? then:
awk -v FS=',' -v thetime="${timestamp}" -v country="${thecountry}" -v product="${theproduct}" '
( $4 == country ) && ( $3 == product ) && ( $2 < thetime ) { before=$0 ; }
END { print before ; }' myfile.csv
should work for you... (feed it with 10:07, Spain and APPLES, and it returns the expected "20130730,22:08:51.244,APPLES,Spain,67p,blah,blah" line)
And if your file spans several days (to adress Bentoy13's concern),
awk -v FS=',' -v theday="${theday}" -v thetime="${timestamp}" -v thecountry="${thecountry}" -v theproduct="${theproduct}" '
( $4 == thecountry ) && ( $3 == theproduct ) && (($1<theday)||(($1==theday)&&($2<thetime))) { before=$0 ; }
END { print before ; }' myfile.csv
That last one also works if the first column changes (ie, if it spans several days), but you need to feed it also theday
You could use awk instead of your grep like this:
awk -v FS=',' -v Hour=22 -v Min=9 '{split($2, a, "[:]"); if ((3600*a[1] + 60*a[2] + a[3] - 3600*Hour - 60*Min)^2 < 100) print $0}' file
and basically change the 100 to what ever tolerance you want.

Resources