gawk - suppress output of matched lines - bash

I'm running into an issue where gawk prints unwanted output. I want to find lines in a file that match an expression, test to see if the information in the line matches a certain condition, and then print the line if it does. I'm getting the output that I want, but gawk is also printing every line that matches the expression rather than just the lines that meet the condition.
I'm trying to search through files containing dates and times for certain actions to be executed. I want to show only lines that contain times in the future. The dates are formatted like so:
text... 2016-01-22 10:03:41 more text...
I tried using sed to just print all lines starting with ones that had the current hour, but there is no guarantee that the file contains a line with that hour, (plus there is no guarantee that the lines all have any particular year, month, day etc.) so I needed something more robust. I decided trying to convert the times into seconds since epoch, and comparing that to the current systime. If the conversion produces a number greater than systime, I want to print that line.
Right now it seems like gawk's mktime() function is the key to this. Unfortunately, it requires input in the following format:
yyyy mm dd hh mm ss
I'm currently searching a test file (called timecomp) for a regular expression matching the date format.
Edit: the test file only contains a date and time on each line, no other text.
I used sed to replace the date separators (i.e. /, -, and :) with a space, and then piped the output to a gawk script called stime using the following statement:
sed -e 's/[-://_]/ /g' timecomp | gawk -f stime
Here is the script
# stime
BEGIN { tsec=systime(); } /.*20[1-9][0-9] [0-1][1-9] [0-3][0-9] [0-2][0-9][0-6][0-9] [0-6][0-9]/ {
if (tsec < mktime($0))
print "\t" $0 # the tab is just to differentiate the desired output from the other lines that are being printed.
} $1
Right now this is getting the basic information that I want, but it is also printing every like that matches the original expression, rather than just the lines containing a time in the future. Sample output:
2016 01 22 13 23 20
2016 01 22 14 56 57
2016 01 22 15 46 46
2016 01 22 16 32 30
2016 01 22 18 56 23
2016 01 22 18 56 23
2016 01 22 22 22 28
2016 01 22 22 22 28
2016 01 22 23 41 06
2016 01 22 23 41 06
2016 01 22 20 32 33
How can I print only the lines in the future?
Note: I'm doing this on a Mac, but I want it to be portable to Linux because I'm ultimately making this for some tasks I have to do at work.
I'd like trying to accomplish this in one script rather than requiring the sed statement to reformat the dates, but I'm running into other issues that probably require a different question, so I'm sticking to this for now.
Any help would be greatly appreciated! Thanks!
Answered: I had a $1 at the last line of my script, and that was the cause of the additional output.

Instead of awk, this is an (almost) pure Bash solution:
#!/bin/bash
# Regex for time string
re='[0-9]{4}-[0-9]{2}-[0-9]{2} ([0-9]{2}:){2}[0-9]{2}'
# Current time, in seconds since epoch
now=$(date +%s)
while IFS= read -r line; do
# Match time string
[[ $line =~ $re ]]
time_string="${BASH_REMATCH[0]}"
# Convert time string to seconds since epoch
time_secs=$(date -d "$time_string" +%s)
# If time is in the future, print line
if (( time_secs > now )); then
echo "$line"
fi
done < <(grep 'pattern' "$1")
This takes advantage of the Coreutils date formatting to convert a date to seconds since epoch for easy comparison of two dates:
$ date
Fri, Jan 22, 2016 11:23:59 PM
$ date +%s
1453523046
And the -d argument to take a string as input:
$ date -d '2016-01-22 10:03:41' +%s
1453475021
The script does the following:
Filter the input file with grep (for lines containing a generic pattern, but could be anything)
Loop over lines containing pattern
Match the line with a regex that matches the date/time string yyyy-mm-dd hh:mm:ss and extract the match
Convert the time string to seconds since epoch
Compare that value to the time in $now, which is the current date/time in seconds since epoch
If the time from the logfile is in the future, print the line
For an example input file like this one
text 2016-01-22 10:03:41 with time in the past
more text 2016-01-22 10:03:41 matching pattern but in the past
other text 2017-01-22 10:03:41 in the future matching pattern
some text 2017-01-23 10:03:41 in the future but not matching
blahblah 2022-02-22 22:22:22 pattern and also in the future
the result is
$ date
Fri, Jan 22, 2016 11:36:54 PM
$ ./future_time logfile
other text 2017-01-22 10:03:41 in the future matching pattern
blahblah 2022-02-22 22:22:22 pattern and also in the future

This is what I have working now. It works for a few different date formats and on the actual files that have more than just the date and time. The default format that it works for is yyyy/mm/dd, but it takes an argument to specify a mm/dd/yyyy format if needed.
BEGIN { tsec=systime(); dtstr=""; dt[1]="" } /.*[0-9][0-9]:[0-9][0-9]:[0-9][0-9]/ {
cur=$0
if ( fm=="mdy" ) {
match($0,/[0-1][1-9][-_\/][0-3][0-9][-_\/]20[1-9][0-9]/) # mm dd yyyy
section=substr($0,RSTART,RLENGTH)
split(section, dt, "[-_//]")
dtstr=dt[3] " " dt[1] " " dt[2]
gsub(/[0-1][1-9][-\/][0-3][0-9][-\/]20[1-9][0-9]/, dtstr, cur)
}
gsub(/[-_:/,]/, " ", cur)
match(cur,/20[1-9][0-9] [0-1][1-9] [0-3][0-9][[:space:] ]*[0-2][0-9] [0-6][0-9] [0-6][0-9]/)
arr=mktime(substr(cur,RSTART,RLENGTH))
if ( tsec < arr)
print $0
}
I'll be adding more format options as I find more formats, but this works for all the different files I've tested so far. If they have a mm/dd/yyyy format, you call it with:
gawk -f stime fm=mdy filename
I plan on adding an option to specify the time window that you want to see, but this is an excellent start. Thank you guys again, this is going to drastically simplify a few tasks at work ( I basically have to retrieve a great deal of data, often under time pressure depending on the situation ).

Related

Batch change accessed and modified date, with date from another file's content?

I'm migrating old notes from a SQL database based note taking app to separate text files.
I've managed to export the notes and date codes as separate text files.
The files are ordered like this:
$ ls -1
Note0001.txt
Note0001-date.txt
Note0002.txt
Note0002-date.txt
Note0003.txt
Note0003-date.txt
The contents of the date files looks like this:
$ cat Note0001-date.txt
388766121.742373
$ cat Note0002-date.txt
274605766.273638
$ cat Note0003-date.txt
384996285.436197
The dates are seconds since the epoch 2001-01-01. See other question about the format: What type of date format is this? And how to convert it?.
How do I batch change the accessed and modified date of the notes files, NoteNNNN.txt, to the date in the contents of respective date file, NoteNNNN-date.txt?
How to convert the date to UTC+1? Preferably with consideration of DST (daylight saving time).
I am trying to convert the dates with the method described this question:
https://unix.stackexchange.com/questions/2987/
But it outputs an error message in bash 3.2.57 (macOS):
$ date -d '2001-01-01 UTC+1 + 388766121 seconds'
usage: date [-jnRu] [-d dst] [-r seconds] [-t west] [-v[+|-]val[ymwdHMS]] ...
[-f fmt date | [[[mm]dd]HH]MM[[cc]yy][.ss]] [+format]
I am new to working with the dates and timestamps in the terminal.
Iterate over each file pair, access the timestamp, shift the timestamp so it's something unix tools can understand, then touch files. Ie. big problems are composed of sum of small problems.
# find all files named .txt but not -date.txt
find . -name '*.txt' '!' -name '*-date.txt' |
# remove the .txt suffix
sed 's/\.txt$//' |
{
# the reference point of files content
start=$(date -d "2001-01-01" +%s) # will not work with BSD date
# I guess just precompute the value:
start=978303600
# for each file
while IFS= read -r f; do
# get the timestamp
diff=$(<"$f"-date.txt)
# increment the timestamp to seconds since epoch
ref=$(<<<"scale=6; $start + $diff" bc)
# TODO: use a tool convert the timestamp sinece epoch to BSD touch
# compatible format, ie. to ccyy-mm-ddTHH:MM:SS[.frac][Z]
ref=(TODO "$ref")
# change access and modification times of .txt file
touch -d "#$ref" "$f".txt
done
}
Assuming your OS local timezone is what you want for your output, and you have a version of awk that supports the GNU awk time functions, you could use the
following script. Also:
If the DST daylight-savings flag is positive, the time is assumed to
be daylight savings time; if zero, the time is assumed to be standard
time; and if negative (the default), mktime() attempts to determine
whether daylight savings time is in effect for the specified time.
file tst.awk:
BEGIN {
epoch = mktime("2001 01 01 00 00 00")
}
FNR==1 {
close(out)
out = substr(FILENAME, 1, length(FILENAME)-9) ".txt"
}
{
print strftime("%F %T %Z", epoch+$0) > out
}
Usage:
awk -f tst.awk *-date.txt
Example
Here is an example with the script, without the I/O part, just converting the datetimes.
test file:
> cat file
388766121.742373
274605766.273638
384996285.436197
script tst.awk:
BEGIN { epoch = mktime("2001 01 01 00 00 00") }
{ print strftime("%F %T %Z", epoch+$0) }
Output:
> awk -f tst.awk file
2013-04-27 15:35:21 EEST
2009-09-14 08:22:46 EEST
2013-03-14 23:24:45 EET
The timezone of my box is being used by default (EET). If we 'd like to print to a different timezone, we should define that and set the TZ. Also DST is used by default, notice that some days are printed as EEST (Summer Time).

Print line if column 2 is greater than column 2 on the next line

I have a file with multiple lines that all have a date in the second column. I'm looking for a command that prints the whole line if the date is greater than the date on the next line.
When this is no longer the case I want it to stop, don't print anything else.
I'm a rookie so if you could explain your answer that would be great.
I'm trying to use awk (answer can be any command)
awk '$2 > ?nextline?$2 {print}' file
I couldn't find how to check next line or how to stop after the first time the greater than command isn't true.
Input:
Jan 20 text1
Jan 15 text2
Jan 15 text3
Jan 3 text4
Jan 27 text5
Jan 17 text6
(more lines...)
Wanted output:
Jan 20 text1
Jan 15 text2
Jan 15 text3
Jan 3 text4
An awk version:
awk 'f && $2>f {exit} 1; {f=$2}' file
Jan 20 text1
Jan 15 text2
Jan 15 text3
Jan 3 text4
f && $2>f {exit} Test if fis set and second field larger than f? Yes, exit the program.
1; is always true, so print the line.
{f=$2} set f to second field.
Cold even be shorten some more:
awk 'f&&$2>f{exit}f=$2' file
f=$2 By setting it as pattern it will be true and print at the same time as f is set to $2
This version would skip print blank line between data if that exists, other not.
Generally, you have to reverse your requirement. Instead of printing out the current line if the next line does something, because sed, awk, etc., can't see the future, you need to stop printing if the current line does something compared to the previous line since awk and friends can see the past by storing it in a variable.
So a simple way, since you said language doesn't matter, is to do this:
perl -palE 'BEGIN {$h = 999} last if $F[1] > $h; $h = $F[1]' < file
What we are doing here is passing perl in -p "loop through input and print line at the end of the loop", -a "auto-split the line on spaces into the #F variable", -l "auto-handle line endings" (not strictly required here, but just a good habit most of the time), and -E "execute the code from the next parameter with the current version of perl specified" (-e would suffice here, but, again, habit). And the code we pass in starts off by setting $h (highest allowed at this point) to something out of range, I'm assuming no number will be 999+ since you say they're days of a month, using last to terminate the loop if the current day is higher than the highest allowed, and setting that high point to the current value if we get past the if. Perl now automatically prints out the current line and loops to the next line.
The key point is that we only look at the current line and track in a variable the relevant history so that we don't need to look into the future.

How to get last 10 minutes of logs from remote host

I'm trying to get the last x minutes of logs from /var/log/maillog from a remote host (I'm using this script within icinga2) but having no luck.
I have tried a few combinations of awk, sed, and grep but none have seemed to work. I thought it was an issue with double quotes vs single quotes but I played around with them and nothing helped.
host=$1
LOG_FILE=/var/log/maillog
hour_segment=$(ssh -o 'StrictHostKeyChecking=no' myUser#${host} 2>/dev/null "sed -n "/^$(date --date='10 minutes ago' '+%b %_d %H:%M')/,\$p" ${LOG_FILE}")
echo "${hour_segment}"
When running the script with bash -x, I get the following output:
bash -x ./myScript.sh host.domain
+ host=host.domain
+ readonly STATE_OK=0
+ STATE_OK=0
+ readonly STATE_WARN=1
+ STATE_WARN=1
+ LOG_FILE=/var/log/maillog
+++ date '--date=10 minutes ago' '+%b %_d %H:%M'
++ ssh -o StrictHostKeyChecking=no myUser#host.domain 'sed -n /^Jan' 8 '12:56/,$p /var/log/maillog'
+ hour_segment=
+ echo ''
Maillog log file output. I'd like $hour_segment to look like the below output also so I can apply filters to it:
head -n 5 /var/log/maillog
Jan 6 04:03:36 hostname imapd: Disconnected, ip=[ip_address], time=5
Jan 6 04:03:36 hostname postfix/smtpd[9501]: warning: unknown[ip_address]: SASL LOGIN authentication failed: authentication failure
Jan 6 04:03:37 hostname imapd: Disconnected, ip=[ip_address], time=5
Jan 6 04:03:37 hostname postfix/smtpd[7812]: warning: unknown[ip_address]: SASL LOGIN authentication failed: authentication failure
Jan 6 04:03:37 hostname postfix/smtpd[7812]: disconnect from unknown[ip_address]
Using GNU awk's time functions:
$ awk '
BEGIN {
m["Jan"]=1 # convert month abbreviations to numbers
# fill in the rest # fill in the rest of the months
m["Dec"]=12
nowy=strftime("%Y") # assume current year, deal with Dec/Jan below
nowm=strftime("%b") # get the month, see above comment
nows=strftime("%s") # current epoch time
}
{ # below we for datespec for mktime
dt=(nowm=="Jan" && $1=="Dec"?nowy-1:nowy) " " m[$1] " " $2 " " gensub(/:/," ","g",$3)
if(mktime(dt)>=nows-600) # if timestamp is less than 600 secs away
print # print it
}' file
Current year is assumed. If it's January and log has Dec we subtract one year from mktime's datespec: (nowm=="Jan" && $1=="Dec"?nowy-1:nowy). Datespec: Jan 6 04:03:37 -> 2019 1 6 04 03 37 and for comparison in epoch form: 1546740217.
Edit: As no one implemeted my specs in the comments I'll do it myself. tac outputs file in reverse and the awk prints records while they are in given time frame (t-now or future) and exits once it meets a date outside of the time frame:
$ tac file | awk -v t=600 ' # time in seconds go here
BEGIN {
m["Jan"]=1
# add more months
m["Dec"]=12
nowy=strftime("%Y")
nowm=strftime("%b")
nows=strftime("%s")
} {
dt=(nowm=="Jan" && $1=="Dec"?nowy-1:nowy) " " m[$1] " " $2 " " gensub(/:/," ","g",$3)
if(mktime(dt)<nows-t) # this changed some
exit
else
print
}'
Coming up with a robust solution that will work 100% bulletproof is very hard since we are missing the most crucial information, the year.
Imagine you want the last 10 minutes of available data on March 01 2020 at 00:05:00. This is a bit annoying since February 29 2020 exists. But in 2019, it does not.
I present here an ugly solution that only looks at the third field (the time) and I will make the following assumptions:
The log-file is sorted by time
There is at least one log every single day!
Under these conditions we can keep track of a sliding window starting from the first available time.
If you safe the following in an file extractLastLog.awk
{ t=substr($3,1,2)*3600 + substr($3,4,2)*60 + substr($3,7,2) + offset}
(t < to) { t+=86400; offset+=86400 }
{ to = t }
(NR==1) { startTime = t; startIndex = NR }
{ a[NR]=$0; b[NR]=t }
{ while ( startTime+timeSpan*60 <= t ) {
delete a[startIndex]
delete b[startIndex]
startIndex++; startTime=b[startIndex]
}
}
END { for(i=startIndex; i<=NR; ++i) print a[i] }
then you can extract the last 23 minutes in the following way:
awk -f extractLastLog.awk -v timeSpan=23 logfile.log
The second condition I gave (There is at least one log every single day!) is needed not to have messed up results. In the above code, I compute the time fairly simple, HH*3600 + MM*60 + SS + offset. But I make the statement that if the current time is smaller than the previous time, it implies we are on a different day hence we update the offset with 86400 seconds. So if you have two entries like:
Jan 09 12:01:02 xxx
Jan 10 12:01:01 xxx
it will work, but this
Jan 09 12:01:00 xxx
Jan 10 12:01:01 xxx
will not work. It will not realize the day changed. Other cases that will fail are:
Jan 08 12:01:02 xxx
Jan 10 12:01:01 xxx
as it does not know that it jumped two days. Corrections for this are not easy due to the months (all thanks to leap years).
As I said, it's ugly, but might work.

Bash script assistance with renaming file using existing parts of filename

I'm looking for help with a bash script to do some renaming of files for me. I don't know much about bash scripting, and what I have read is overwhelming. It's a lot to know/understand for the limited applications I will probably have.
In Dropbox, my media files are named something like:
Photo Jul 04, 5 49 44 PM.jpg
Video Jun 22, 11 21 00 AM.mov
I'd like them to be renamed in the following format: 2015-07-04 1749.ext
Some difficulties:
The script has to determine if AM or PM to put in the correct 24-hour format
The year is not specified; it is safe to assume the current year
The date, minute and second have a leading zero, but the hour does not; therefore the position after the hour is not absolute
Any assistance would be appreciated. FWIW, I'm running MacOS.
Mac OSX
This uses awk to reformat the date string:
for f in *.*
do
new=$(echo "$f" | awk -F'[ .]' '
BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",month)
for (i in month) {
nums[month[i]]=i
}
}
$(NF-1)=="PM" {$4+=12;}
{printf "%s 2015-%02i-%02i %02i%02i.%s",$1,nums[$2],$3,$4,$5,$8;}
')
mv "$f" "$new"
done
After the above was run, the files are now named:
$ ls -1 *.*
Photo 2015-07-04 1749.jpg
Video 2015-06-22 1121.mov
The above was tested on GNU awk but I don't believe that I have used any GNU-specific features.
GNU/Linux
GNU date has a handy feature for interpreting human-style date strings:
for f in *.*
do
prefix=${f%% *}
ext=${f##*.}
datestr=$(date -d "$(echo "$f" | sed 's/[^ ]* //; s/[.].*//; s/ /:/3; s/ /:/3; s/,//')" '+%F %H%M')
mv "$f" "$prefix $datestr.$ext"
done
Here is an example of the script in operation:
$ ls -1 *.*
Photo Jul 04, 5 49 44 PM.jpg
Video Jun 22, 11 21 00 AM.mov
$ bash script
$ ls -1 *.*
Photo 2015-07-04 1749.jpg
Video 2015-06-22 1121.mov
While not a simple parse and reformat for date, it isn't that difficult. The bash string tools of parameter expansion/substring removal are all you need to parse the pieces of the date into a format that date can use to output a new date string in the format for use in a filename. (see String Manipulation ) date -d is used to generate a new date string based on the contents of the original filename.
Note: the following presumes the dropbox filenames are in the format you have specified. (it doesn't care what the first part of the name or extension is as long as it matches the format you have specified) Here is an example of properly isolating the pieces of the filename needed to generate a date in the format specified)
Further, all spaces have been removed from the filename. While you originally showed a space between the day and hours, I will not provide an example of poor practice by inserting a space in a filename. As such, the spaces have been replaced with '_' and '-':
#!/bin/bash
# Photo Jul 04, 5 49 44 PM.jpg
# Video Jun 22, 11 21 00 AM.mov
# fn="Photo Jul 04, 5 49 44 PM.jpg"
fn="Video Jun 22, 11 21 00 AM.mov"
ext=${fn##*.} # determine extension
prefix=${fn%% *} # determine prefix (Photo or Video)
datestr=${fn%.${ext}} # remove extension from filename
datestr=${datestr#${prefix} } # remove prefix from datestr
day=${datestr%%,*} # isolate Month and date in day
ampm=${datestr##* } # isloate AM/PM in ampm
datestr=${datestr% ${ampm}} # remove ampm from datestr
timestr=${datestr##*, } # isolate time in timestr
timestr=$(tr ' ' ':' <<<"$timestr") # translate spaces to ':' using herestring
cmb="$day $timestr $hr" # create combined date/proper format
## create date/time string for filename
datetm=$(date -d "$cmb" '+%Y%m%d-%H%M')
newfn="${prefix}_${datetm}.${ext}"
## example moving of file to new name
# (assumes you handle the path correctly)
printf "mv '%s' %s\n" "$fn" "$newfn"
# mv "$fn" "$newfn" # uncomemnt to actually use
exit 0
Example/Output
$ bash dateinfname.sh
mv 'Video Jun 22, 11 21 00 AM.mov' Video_20150622-1121.mov

Running shell commands within AWK

I'm trying to work on a logfile, and I need to be able to specify the range of dates. So far (before any processing), I'm converting a date/time string to timestamp using date --date "monday" +%s.
Now, I want to be able to iterate over each line in a file, but check if the date (in a human readable format) is within the allowed range. To do this, I'd like to do something like the following:
echo `awk '{if(`date --date "$3 $4 $5 $6 $7" +%s` > $START && `date --date "" +%s` <= $END){/*processing code here*/}}' myfile`
I don't even know if thats possible... I've tried a lot of variations, plus I couldn't find anything understandable/usable online.
Thanks
Update:
Example of myfile is as follows. Its logging IPs and access times:
123.80.114.20 Sun May 01 11:52:28 GMT 2011
144.124.67.139 Sun May 01 16:11:31 GMT 2011
178.221.138.12 Mon May 02 08:59:23 GMT 2011
Given what you have to do, its really not that hard AND it is much more efficient to do your date processing by converting to strings and comparing.
Here's a partial solution that uses associative arrays to convert the month value to a number. Then you rely on the %02d format specifier to ensure 2 digits. You can reformat the dateTime value with '.', etc or leave the colons in the hr:min:sec if you really need the human readability.
The YYYYMMDD format is a big help in these sort of problems, as LT, GT, EQ all work without any further formatting.
echo "178.221.138.12 Mon May 02 08:59:23 GMT 2011" \
| awk 'BEGIN {
mons["Jan"]=1 ; mons["Feb"]=2; mons["Mar"]=3
mons["Apr"]=4 ; mons["May"]=5; mons["Jun"]=6
mons["Jul"]=7 ; mons["Aug"]=8; mons["Sep"]=9
mons["Oct"]=10 ; mons["Nov"]=11; mons["Dec"]=12
}
{
# 178.221.138.12 Mon May 02 08:59:23 GMT 2011
printf("dateTime=%04d%02d%02d%02d%02d%02d\n",
$NF, mons[$3], $4, substr($5,1,2), substr($5,4,2), substr($5,7,2) )
} ' -v StartTime=20110105235959
The -v StartTime is ilustrative of how to pass in (and the matching format) your starTime value.
I hope this helps.
Here's an alternative approach using awk's built-in mktime() function. I've never bothered with the month parsing until now - thanks to shelter for that part (see accepted answer). It always feels time to switch language around that point.
#!/bin/bash
# input format:
#(1 2 3 4 5 6 7)
#123.80.114.20 Sun May 01 11:52:28 GMT 2011
awk -v startTime=1304252691 -v endTime=1306000000 '
BEGIN {
mons["Jan"]=1 ; mons["Feb"]=2; mons["Mar"]=3
mons["Apr"]=4 ; mons["May"]=5; mons["Jun"]=6
mons["Jul"]=7 ; mons["Aug"]=8; mons["Sep"]=9
mons["Oct"]=10 ; mons["Nov"]=11; mons["Dec"]=12;
}
{
hmsSpaced=$5; gsub(":"," ",hmsSpaced);
timeInSec=mktime($7" "mons[$3]" "$4" "hmsSpaced);
if (timeInSec > startTime && timeInSec <= endTime) print $0
}' myfile
(I've chosen example time thresholds to select only the last two log lines.)
Note that if the mktime() function were a bit smarter this whole thing would reduce to:
awk -v startTime=1304252691 -v endTime=1306000000 't=mktime($7" "$3" "$4" "$5); if (t > startTime && t <= endTime) print $0}' myfile
I'm not sure of the format of the data you're parsing, but I do know that you can't use the backticks within single quotes. You'll have to use double quotes. If there are too many quotes being nested, and it's confusing you, you can also just save the output of your date command to a variable beforehand.

Resources