Bash find last entry before timestamp - bash

I have a .csv file that is formatted thus;
myfile.csv
**Date,Timestamp,Data1,Data2,Data3,Data4,Data5,Data6**
20130730,22:08:51.244,APPLES,Spain,67p,blah,blah
20130730,22:08:51.244,PEARS,Spain,32p,blah,blah
20130730,22:08:51.708,APPLES,France,102p,blah,blah
20130730,22:10:62.108,APPLES,Spain,67p,blah,blah
20130730,22:10:68.244,APPLES,Spain,67p,blah,blah
I wish to feed in a timestamp which most likely will NOT match up perfectly to the millisecond with those in the file, and find the preceding line that matches a particular grep search.
so e.g. something like;
cat myfile.csv | grep 'Spain' | grep 'APPLES' | grep -B1 "22:09"
should return
20130730,22:08:51.244,APPLES,Spain,67p,blah,blah
But thus far I can only get it to work with exact timestamps in the grep. Is there a way to get it to treat these as a time series? (I am guessing that's what the issue is here - it's trying pure pattern matching and not unreasonably failing to find one)

I have also a fancy solution using awk:
awk -F ',' -v mytime="2013 07 30 22 09 00" '
BEGIN {tlimit=mktime(mytime); lastline=""}
{
l_y=substr($1,0,4); l_m=substr($1,4,2); l_d=substr($1,6,2);
split($2,l_hms,":"); l_hms[3]=int(l_hms[3]);
line_time=mktime(sprintf("%d %d %d %d %d %d", l_y, l_m, l_d, l_hms[1], l_hms[2], l_hms[3]));
if (line_time>tlimit) exit; lastline=$0;
}
END{if lastline=="" print $0; else print lastline;}' myfile.csv
It is working based on making the timestamps from each line with awk's time function mktime. I also make the assumption that $1 is the date.
On the first line, you have to provide the timestamp of the time limit you want (here I choose 2013 07 30 22 09 00). You have to write it according to the format used by mktime : YYYY MM DD hh mm ss. You begin the awk statement with making up the timestamp of your time limit. Then, for each line, you catch up year, month and day from $1 (line 4), then the exact hour from $2 (line 5). As mktime takes only entire seconds, I truncate the seconds (you can round it up with int(l_hms[3]+0.5)). Here you can do evereything you want to approximate the timestamp, like discarding the seconds. On line 6, I make the time stamp from the six date fields I have extracted. Finally, on line 7, I compare timestamps and goto end in case of reaching your time limit. As you want the preceding line, I store the line into the variable lastline. On exit, I print lastline; in case of reaching the time limit on the first line, I print the first line.
This solution works well on your sample file, and works for any date you supply. You only have to supply the date limit in the correct format!
EDIT
I realize that mktime is not necessary. If the assumption that $1 is the date written as YYYYMMDD, you can compare the date as a number then the time (extracted with split, rebuilt as a number as in other answers). In that case, you can supply the time limit in the format you want, and recover proper date and time limits in the BEGIN block.

you could have a awk that keep in memory the last line it saw which have a timestamp lower than the one you feed it, and prints the last match at the end (considering they are in ascending order)
ex:
awk -v FS=',' -v thetime="22:09" '($2 < thetime) { before=$0 ; } END { print before ; }' myfile.csv
This happen to work as you feed it a string that, lexigographically, doesn't need to have the complete size (ie 22:09:00.000) to be compared.
The same, but on several lines for readability:
awk -v FS=',' -v thetime="22:09" '
($2 < thetime) { before=$0 ; }
END { print before ; }' myfile.csv
Now if I understand your complete requirements: you need to find, among lines mactching a country and a type of product, the last line before a timestamp? then:
awk -v FS=',' -v thetime="${timestamp}" -v country="${thecountry}" -v product="${theproduct}" '
( $4 == country ) && ( $3 == product ) && ( $2 < thetime ) { before=$0 ; }
END { print before ; }' myfile.csv
should work for you... (feed it with 10:07, Spain and APPLES, and it returns the expected "20130730,22:08:51.244,APPLES,Spain,67p,blah,blah" line)
And if your file spans several days (to adress Bentoy13's concern),
awk -v FS=',' -v theday="${theday}" -v thetime="${timestamp}" -v thecountry="${thecountry}" -v theproduct="${theproduct}" '
( $4 == thecountry ) && ( $3 == theproduct ) && (($1<theday)||(($1==theday)&&($2<thetime))) { before=$0 ; }
END { print before ; }' myfile.csv
That last one also works if the first column changes (ie, if it spans several days), but you need to feed it also theday

You could use awk instead of your grep like this:
awk -v FS=',' -v Hour=22 -v Min=9 '{split($2, a, "[:]"); if ((3600*a[1] + 60*a[2] + a[3] - 3600*Hour - 60*Min)^2 < 100) print $0}' file
and basically change the 100 to what ever tolerance you want.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

How to filter output lines from bash command, based on dates in start of the line?

I am getting following lines as an output of some bash pipe
output
20200604_tsv
20200605_tsv
20200606_tsv
20200706_tsv
I have a date variable in YYYYMMDD format in a variable
filter_date="20200605"
I want to apply the date operation on the output lines i.e. pick lines only where line's first part (before '_') is less than equal to filter_date.
i.e. Expected output
20200604_tsv
20200605_tsv
How to achieve this filtering in bash pipe?
I have tried following (lexicographically match the string) but not able to filter and get original names.
BASH_CMD_THAT_OUTPUT_LINES | sort | awk '{name = ($1); print name <= "20200605*"}'
## Answer
1
0
0
0
Could you please try following, written and tested with shown samples in GNU awk.
awk -v filter_date="20200605" '
BEGIN{
FS=OFS="_"
filter=mktime(substr(filter_date,1,4)" "substr(filter_date,5,2)" "substr(filter_date,7,2) " 00 00 00")}
{
curr_dat=mktime(substr($1,1,4)" "substr($1,5,2)" "substr($1,7,2) " 00 00 00")
}
filter<curr_dat{ exit }
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v filter_date="20200605" ' ##Starting awk program from here and creating awk variable filter_date which is date set by OP till where we need to get the lines.
BEGIN{ ##Starting BEGIN section for this program from here.
FS=OFS="_" ##Setting field separator and output field separator as _ here.
filter=mktime(substr(filter_date,1,4)" "substr(filter_date,5,2)" "substr(filter_date,7,2) " 00 00 00")} ##Creating filter variable which is mktime function having sub string function in it to get value inn cpoh time for current line.
{
curr_dat=mktime(substr($1,1,4)" "substr($1,5,2)" "substr($1,7,2) " 00 00 00") ##Creating curr_dat variable which has mktime function in it which has sub string of current line to get its epoch time for current line.
}
filter<curr_dat{ exit } ##Checking condition if filter date is lesser than current date then exit from program.
1 ##1 will print current line which will happen when current date is either lesser than or equal to current date.
' Input_file ##Mentioning Input_file name here.
Awk has the power to convert strings to numbers very easily by stripping what is redundant. Eg. The string 123_foo is converted to 123 if you add 0 to it. So the following operation would do what you request:
command | awk '($0+0 < 20200605)'
This method works excellently if you have a sortable date-format like YYYYMMDD. If you have a different format such as YYYYDDMM, you have to use different techniques by first converting the format. Eg.
command | awk '{d=substr($0,1,4)substr($0,7,2)substr($0,5,2)}(d+0 < 20200605)'
Remark that in the last solution, you have to invert your months and days in the last number: i.e. 20200605 is YYYYMMDD and not YYYYDDMM
I have found a simple way to match lexicographically.
following is test data and answer simulation
## 1. Test data
cat > /tmp/tmp_test_data <<EOF
20200605_tsv
20200607_tsv
20200604_tsv
20200718_tsv
20200606_tsv
EOF
## 2. Threshold date
check_date="20200605"
## 3. Sort, Filter and output
cat /tmp/tmp_test_data \
| sort \
| awk -v check_d=${check_date} '{
name = ($1); \
dt = (substr(name, 0, 8)); \
if (dt <= check_d) \
{print name}\
}'
Bash only:
while read line
do
[[ $line =~ ^[0-9]{8} ]] && [ ${line::8} -le 20200605 ] && echo $line
done < file # actually command | while ...

Awk script else statement not working

I am writing an awk script inside of a shell script to display all logins of current month, and number of logins on each day of the month so far. I will have to determine the day on which the number of logins has been the greatest, but first I would like to figure out how to write the else statement of my if statement which follows in the code below:
#!/bin/bash
month=$(date +"%b")
day=$(date +"%e")
last | awk -v m=$month -v d=$day 'BEGIN{
max=0; c=0; maxd=day}
($5 ~ m) {
if ($6 =d) {print; c++; print c}
else {printf("Else statement test")}
}
'
So far it works fine without the line containing the else statement, but it seems like it won't recognize the else no matter what I add to it.
With
$5 ~ m
I check whether the current line is of current month, and then
$6 =d
checks if it's still the same day. If so, then counter increases, and I print the current number of daily logins. I would like to save the value of the counter to an associative array in the else statement and set the counter variable back to zero too when I encounter a new day (when $6 is no longer =d).
I've tried to add these operations to the else statement, but when I read the script, it won't recognize the "else". It will only print logins of today (Apr. 23) and count the lines, but won't execute the else part. Why isn't it working?
Edit:
I figured that comparison expressions are identical to the ones from most languages and I've corrected it, but it won't print all lines from current day.
Your if condition is not correct:-
if ($6 =d)
It should be:-
if ($6 == "$d")
this should work better
$ last |
awk -v d="$(date +'%b')" '$5==d{day[$6]++}
END {for(k in day) print d, k, day[k]}' |
sort -k2n

How to pass a bash variable to one of the two reg expressions in awk

I have a file where the first two fields are 'Month' and 'Year'. like as follows:
April 2016 100 200 300
May 2016 150 250 300
June 2016 200 250 400
Such data is stored for about 30 months. I need to get an output starting from April of any year to March of next year (12 months). When I use following awk code on terminal I get the correct answer.
awk '/March/ && /2016/ {for(i=1; i<=12; i++){getline;print}}' file
The first pattern will always be the same 'March', however the second pattern will depend upon user input. User may ask for 2015 or 2017 or any other.
I do not understand exactly how the above code works but more importantly I am unable to pass the user input for the year to awk and get the correct result.
I have tried the following:
F_year=2016
awk -v f_year="$F_year" '/March/ && /$1 ~f_year/ {
for (i=1; i<=12; i++) {
getline;
print
}
}' file.
I will appreciate if someone can give me the solution with some explanation.
OP code:
$ awk -v f_year="$F_year" '
/March/ && /$1 ~f_year/ { # removing the latter /.../ would work, but... (1)
for(i=1; i<=12; i++) { # (2)
getline # getline is a bit controversial... (3)
print
}
}' file
Modified:
$ awk -v f_year="$F_year" '
(/March/ && $2==f_year && c=12) ||--c > 0 { # (1) == is better
# (2) awk is a big record loop, use it
print # (3) with decreasing counter var c
}' file
Above is somewhat untested as your data samples did not fully allow it but 2 months including April seemed to work ('/April/ ... && c=2). Also, you could remove the whole {print} block.
You can use sed:
sed -n '/April 2016/,+11 p' file
Or
month="April"
year="2016"
sed -n "/${month} ${year}/,+11 p" file
awk -v year="$F_year" '$1=="April" && $2==year{f=1} f{if (++c==13) exit; print}' file
Untested of course since you didn't provide sample input/output we could test against. Don't use getline until you've read and fully understand everything discussed in http://awk.freeshell.org/AllAboutGetline.

unix shell script to get nth business day

Referencing the solution posted on this unix.com thread for getting the Nth business day of the month, I tried to get the 16th business day of the month using the following code, but it doesn't work.
currCal=`/usr/bin/cal`
BUSINESS_DAYS=`echo $($currCal|nawk 'NR>2 {print substr($0,4,14)}' |tr "\n" " ")`
The error when executing this is:
nawk: syntax error at source line 1 context is
NR>2 {print >>> substr(test. <<< sh,4,14)}
nawk: illegal statement at source line 1
I'm guessing it takes $0 as the script name, causing the syntax error. Please help.
There seem to be a few issues with what you have above.
First, I agree with #John1024 that in order to get the nawk error you've posted, you must actually be running:
BUSINESS_DAYS=`echo $($currCal|nawk "NR>2 {print substr($0,4,14)}" |tr "\n" " ")`
with double quotes around the nawk script.
Furthermore, once you resolve the nawk error, you're going to run into issues with how you are using currCal. You get the actual output of the cal command into the currCal variable, but then are using the variable value (that is the output of cal) as a command before the | rather than echoing it into the pipe or something similar.
This brings up an additional question of why you're using echo on the result of a subshell command (the $() part) within another subshell (the outer ``s).
Finally, the two lines you show above only get a list of the business days in the current month into the BUSINESS_DAYS variable. They do not output/save the 16th such day.
Taking all of the above into consideration (and also changing to use the $() subshell syntax consistently), you might want one of the following invocations:
If you really need to cache the current month's calendar and want to pull multiple days:
currCal="$(/usr/bin/cal)"
BUSINESS_DAYS="$(echo "${currCal}" | \
nawk 'NR>2 {print substr($0,4,14)}' | \
tr "\n" " ")"
DAY=16
DAYTH_DAY="$(echo "${BUSINESS_DAYS}" | nawk -v "day=${DAY}" '{ print $day }')
If this is just a one-and-done:
DAY=16
DAYTH_DAY="$(/usr/bin/cal | \
nawk 'NR>2 {print substr($0,4,14)}' | \
tr "\n" " " | \
nawk -v "day=${DAY}" '{ print $day }')"
One more note: the processing here can be simplified if done entirely in awk(/nawk), but I wanted to stick to the basic framework you had already chosen.
Update per the request in the comment:
A pure POSIX awk version:
DAY=16
DAYTH="$(cal | awk -v "day=${DAY}" '
(NR < 3) { next ; }
/^.[0-9 ]/ { $1="" ; }
/^ / || (NF == 7) { $NF="" ; }
{ hold=hold $0 ; }
END { split(hold,arr," ") ; print arr[day] ; }')"
Yes, simplified is a matter of opinion, and I'm sure someone can make this more concise. Explanation of how this works:
Skip the header of the cal output:
(NR < 3) { next ; }
For weeks that have a date on the Sunday, trim the date of that Sunday:
/^.[0-9 ]/ { $1="" ; }
For weeks that start after Sunday (first week of a month) or weeks that have a full seven days, trim the date of Saturday for that week:
/^ / || (NF == 7) { $NF="" ; }
Once the lines only have the dates of weekdays, curry them into hold:
{ hold=hold $0 ; }
At the end, split hold on spaces so we can grab the Nth day:
END { split(hold,arr," ") ; print arr[day] ; }')"
No awk, just software tools:
set -- $(cal -h | rev | cut --complement -b-5,20- | rev | tail -n +3) ; \
shift 15 ; echo $1
Output:
22
The output of cal is tricky to parse because:
It's right justified.
It's space delimited.
One or two digit dates means two or one delimiting spaces.
More leading spaces for first days of month.
Parsing won't quite work without the -h option, (turn off 'today' highlighting).

Resources