I need small help related to Unix shell script using awk.
I have a file like below:
139341 8.61248 python_dev ntoma2 r 07/17/2017 07:27:43 gpuml#acepd1641.udp.finco.com 1
139342 8.61248 python_val ntoma2 r 07/17/2017 07:27:48 gpuml#acepd1611.udp.finco.com 1
139652 8.61248 python_dev ntoma2 r 07/17/2017 10:55:57 gpuml#acepd1671.udp.finco.com 1
Which is space separated. I need to get 1st col and 4th col which are job-id and user-name(ntoma2 in this case) based on 6th col (which is date in date formate - mm/dd/yyyy), older than 7days. Compare 6th column with current date and I need to get cols which are older than 7days.
I have below one to get Job id and user name of older than 7 days:
cat filename.txt | awk -v dt="$(date "--date=$(date) -7 day" +%m/%d/%Y)" -F" " '/qw/{ if($6<dt) print $4,":",$1 }' >> ./longRunningJob.$$
Also i have another command to get email ids like below using user-name (from the above 4th col):
/ccore/pbis/bin/enum-members "adsusers" | grep ^UNIX -B3 | grep <User-Name> -B2 | grep UPN | awk '{print $2}'
I need to combined above 2 commands and need to send a report to every user as like below:
echo "Hello <User Name>, There is a long running job which is of job-id: <job-id> more than 7days, so please kill the job or let us know if we can help. Thank you!" | mailx -s "Long Running Job"
NOTE: if user name repeated, all the list should go in one email.
I am not sure how can i combine these 2 and send email to user, can some one please help me?
Thank you in advance!!
Vasu
You can certainly do this in awk -- easier in gawk because of date support.
Just to give you an outline of how to do this, I wrote this in Ruby:
$ cat file
139341 8.61248 python_dev ntoma2 r 07/10/2017 07:27:43 gpuml#acepd1641.udp.finco.com 1
139342 8.61248 python_val ntoma2 r 07/09/2017 07:27:48 gpuml#acepd1611.udp.finco.com 1
139652 8.61248 python_dev ntoma2 r 07/17/2017 10:55:57 gpuml#acepd1671.udp.finco.com 1
$ ruby -lane 'BEGIN{ require "date"
jobs=Hash.new { |h,k| h[k]=[] }
users=Hash.new()
pn=7.0
}
t=DateTime.parse("%s %s" % [$F[5].split("/").rotate(-1).join("-"), $F[6]])
ti_days=(DateTime.now-t).to_f
ts="%d days, %d hours, %d minutes and %d seconds" % [60,60,24]
.reduce([ti_days*86400]) { |m,o| m.unshift(m.shift.divmod(o)).flatten }
users[$F[3]]=$F[7]
jobs[$F[3]] << "Job: %s has been running %s" % [$F[0], ts] if (DateTime.now-t).to_f > pn
END{
jobs.map { |id, v|
w1,w2=["is a","job"]
w1,w2=["are","jobs"] if v.length>1
s="Hello #{id}, There #{w1} long running #{w2} running more than the policy of #{pn.to_i} days. Please kill the #{w2} or let us know if we can help. Thank you!\n\t" << v.join("\n\t")
puts "#{users[id]} \n#{s}"
# s is the formated email address and body. You take it from here...
}
}
' /tmp/file
gpuml#acepd1671.udp.finco.com
Hello ntoma2, There are long running jobs running more than the policy of 7 days. Please kill the jobs or let us know if we can help. Thank you!
Job: 139341 has been running 11 days, 9 hours, 28 minutes and 44 seconds
Job: 139342 has been running 12 days, 9 hours, 28 minutes and 39 seconds
I got the Solution, but there is a bug in it, here is the solution:
!#/bin/bash
{ qstat -u \*; /ccore/pbis/bin/enum-members "adsusers"; } | awk -v dt=$(date "--date=$(date) -7 day" +%m/%d/%Y) '
/^User obj/ {
F2 = 1
FS = ":"
T1 = T2 = ""
next
}
!F2 {
if (NR < 3) next
if ($5 ~ "qw" && $6 < dt) JID[$4] = $1 "," JID[$4]
next
}
/^UPN/ {T1 = $2
}
/^Display/ {T2 = $2
}
/^Alias/ {gsub (/ /, _, $2)
EM[$2] = T1
DN[$2] = T2
}
END {for (j in JID) {print "echo -e \"Hello " DN[j] " \\n \\nJob(s) with job id(s): " JID[j] " executing more than last 7 days, hence request you to take action, else job(s) will be killed in another 1 day \\n \\n Thank you.\" | mailx -s \"Long running job for user: " DN[j] " (" j ") and Job ID(s): " JID[j] "\" " EM[j]
}
}
' | sh
The bug in the above code is -- the if condition of date compare (as shown below) is is not working as expected, i am really not sure how to compare the $6 and the variable dt (both of format mm/dd/yyyy). I think i should use either mkdate() or something else. can some one please help?
if ($5 ~ "qw" && $6 < dt)
Thank you!!
Vasu
Related
I am having tough time in getting time difference and size with uniform code like MB or GB or TB for entire client backup duration.
Below is my command:
mminfo -v -q "group=testgroup1,savetime>=02/17/2022,savetime<=02/18/2022" -r \
"savetime,level,totalsize,volume,vmname,client,sscreate(20),sscomp(20)" -xc,
-q is for query
-r is to retrieve params from query
group contains the clients
savetime is day of backup
level is level of backup
totalsize is size of backup in bytes
volume is the name of volume where data is stored
vmname gives the name of vm
client gives name of client , both vmname and client are one and same depends on type of backup.
sscreate(20) gives start time of backup
sscomp(20) gives end time of backup
o/p of my command looks like this
17/02/22,incr,4853101080,volume.001,,testclient1,17/02/22 20:27:18,17/02/22 20:40:45
17/02/22,incr,404305556,volume.001,,testclient1,17/02/22 20:27:15,17/02/22 20:27:34
17/02/22,incr,645786660,volume.001,,testclient1,17/02/22 20:27:17,17/02/22 20:27:30
17/02/22,incr,4,volume.001,,testclient1,17/02/22 20:27:45,17/02/22 20:27:47
17/02/22,incr,4,volume.001,,testclient1,17/02/22 20:27:16,17/02/22 20:27:19
17/02/22,incr,4,volume.001,,testclient1,17/02/22 20:27:46,17/02/22 20:27:48
17/02/22,incr,4,volume.001,,testclient1,17/02/22 20:28:05,17/02/22 20:28:08
17/02/22,incr,4,volume.002,,testclient1,17/02/22 20:27:48,17/02/22 20:27:51
17/02/22,incr,6085356,volume.002,,testclient1,17/02/22 20:42:26,17/02/22 20:42:51
17/02/22,incr,53328,volume.004,,testclient1,17/02/22 20:43:13,17/02/22 20:43:22
17/02/22,incr,4,volume.004,,testclient1,17/02/22 20:27:34,17/02/22 20:27:37
for each drive/mount point of a backup we have an entry in output in each line. Am able to sumup of size of data backed up per each client on a day but unable to get the logic for calculating time difference for backup for a client on one particular day. can someone help me with this?
logic which i used:(awk is where i started filtering the data)
printf "show name\n p type:nsr group\n" |
nsradmin -i - |
grep -v ^$ |
cut -d: -f2- |
cut -d\; -f1 |
sort -u |
perl -pe 's/\ //' |
while read grp;do
mminfo -q "group=$grp,savetime>02/12/2022 16:00,savetime<02/13/2022 16:00,level=full" \
-r "savetime,level,totalsize,volume,vmname,client,sscreate(20),sscomp(20)" \
-xc, 2>/dev/null |
awk -F, 'BEGIN{OFS=FS}{if($5=="")$5=$6;else $5=$5;print}' |
awk -F, '$5 != "vm_name" {a[$5","$2]+=$3;b[$5","$2]++;OFS=FS}
END{for (v in a) print v,b[v],a[v]}'|
while read j;do echo $grp,$j;
done
done
o/p from code looks like this:
testgroup1,testclient1,full,11,65959975044
Suggesting to fold all the post processing logic into a single gawk (standard Linux awk) script .
script.awk
function timestamp(dateStr) {
formatedStr = gensub(/([[:digit:]]{2})\/([[:digit:]]{2})\/([[:digit:]]{2}) ([[:digit:]]{2}):([[:digit:]]{2}):([[:digit:]]{2})/,
"20\\3 \\2 \\1 \\4 \\5 \\6", 1, dateStr);
# from: DD/MM/DD HH:MM:SS
# to: YYYY MM DD HH MM SS
return mktime(formatedStr);
}
BEGIN {
minStartTime = 9999999999999999999999;
}
$5 == "" {
$5 = $6;
}
{
accumulatedDailyStorage[$5","$2] += $3;
accumulatedDailyTime[$5","$2] += (timestamp($8) - timestamp($7));
startTime = timestamp($7);
minStartTime = (startTime < minStartTime) ? startTime : minStartTime ;
endTime = timestamp($8);
maxEndTime = (endTime > maxEndTime) ? endTime : maxEndTime ;
accumulatedDailyCount[$5","$2]++;
}
END {
for (clientName in accumulatedDailyStorage) {
print clientName, accumulatedDailyStorage[clientName], accumulatedDailyCount[clientName], accumulatedDailyTime[clientName], (maxEndTime - minStartTime);
}
}
Output:
awk -F, -f script.awk input.1.txt
testclient1,incr 5909332004 11 889 6666
I have my command below and I want to have the result in the same line with delimeters. My command:
Array=("GET" "POST" "OPTIONS" "HEAD")
echo $(date "+%Y-%m-%d %H:%M")
for i in "${Array[#]}"
do
cat /home/log/myfile_log | grep "$(date "+%d/%b/%Y:%H")"| awk -v last5=$(date --date="-5 min" "+%M") -F':' '$3>=last5 && $3<last5+5{print}' | egrep -a "$i" | wc -l
done
Results is:
2019-01-01 13:27
1651
5760
0
0
I want to have the result below:
2019-01-01 13:27,1651,5760,0,0
It looks (to me) like the overall objective is to scan /home/log/myfile.log for entries that have occurred within the last 5 minutes and which match one of the 4 entries in ${Array[#]}, keeping count of the matches along the way and finally printing the current date and the counts to a single line of output.
I've opted for a complete rewrite that uses awk's abilities of pattern matching, keeping counts and generating a single line of output:
date1=$(date "+%Y-%m-%d %H:%M") # current date
date5=$(date --date="-5 min" "+%M") # date from 5 minutes ago
awk -v d1="${date1}" -v d5="${date5}" -F":" '
BEGIN { keep=0 # init some variables
g=0
p=0
o=0
h=0
}
$3>=d5 && $3<d5+5 { keep=1 } # do we keep processing this line?
!keep { next } # if not then skip to next line
/GET/ { g++ } # increment our counters
/POST/ { p++ }
/OPTIONS/ { o++ }
/HEAD/ { h++ }
{ keep=0 } # reset keep flag for next line
# print results to single line of output
END { printf "%s,%s,%s,%s,%s\n", d1, g, p, o, h }
' <(grep "$(date '+%d/%b/%Y:%H')" /home/log/myfile_log)
NOTE: The OP may need to revisit the <(grep "$(date ...)" /home/log/myfile.log) to handle timestamp periods that span hours, days, months and years, eg, 14:59 - 16:04, 12/31/2019 23:59 - 01/01/2020 00:04, etc.
Yeah, it's a bit verbose but a bit easier to understand; OP can rewrite/reduce as sees fit.
My server is having unusually high CPU usage, and I can see Apache is using way too much memory.
I have a feeling, I'm being DOS'd by a single IP - maybe you can help me find the attacker?
I've used the following line, to find the 10 most "active" IPs:
cat access.log | awk '{print $1}' |sort |uniq -c |sort -n |tail
The top 5 IPs have about 200 times as many requests to the server, as the "average" user. However, I can't find out if these 5 are just very frequent visitors, or they are attacking the servers.
Is there are way, to specify the above search to a time interval, eg. the last two hours OR between 10-12 today?
Cheers!
UPDATED 23 OCT 2011 - The commands I needed:
Get entries within last X hours [Here two hours]
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date) print Date FS $4}' access.log
Get most active IPs within the last X hours [Here two hours]
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date) print $1}' access.log | sort |uniq -c |sort -n | tail
Get entries within relative timespan
awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print Date FS Date2 FS $4}' access.log
Get entries within absolute timespan
awk -vDate=`date -d '13:20' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'13:30' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print $0}' access.log
Get most active IPs within absolute timespan
awk -vDate=`date -d '13:20' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'13:30' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print $1}' access.log | sort |uniq -c |sort -n | tail
yes, there are multiple ways to do this. Here is how I would go about this. For starters, no need to pipe the output of cat, just open the log file with awk.
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print Date, $0}' access_log
assuming your log looks like mine (they're configurable) than the date is stored in field 4. and is bracketed. What I am doing above is finding everything within the last 2 hours. Note the -d'now-2 hours' or translated literally now minus 2 hours which for me looks something like this: [10/Oct/2011:08:55:23
So what I am doing is storing the formatted value of two hours ago and comparing against field four. The conditional expression should be straight forward.I am then printing the Date, followed by the Output Field Separator (OFS -- or space in this case) followed by the whole line $0. You could use your previous expression and just print $1 (the ip addresses)
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print $1}' | sort |uniq -c |sort -n | tail
If you wanted to use a range specify two date variables and construct your expression appropriately.
so if you wanted do find something between 2-4hrs ago your expression might looks something like this
awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date && $4 < Date2 {print Date, Date2, $4} access_log'
Here is a question I answered regarding dates in bash you might find helpful.
Print date for the monday of the current week (in bash)
Introduction
As accepted answer from matchew is wrong, regarding Antoine's comment: Because awk will do alphanumeric comparisons. So if you logfile list events across the end and begin of two months:
[27/Feb/2023:00:00:00
[28/Feb/2023:00:00:00
[01/Mar/2023:00:00:00
awk will consider:
[01/Mar/2023:00:00:00 < [27/Feb/2023:00:00:00 < [28/Feb/2023:00:00:00
Wich is wrong! You have to compare date stings!!
For this, you could use libraries. Conforming to the language
you use.
I will present here two different way, one using perl with Date::Parse library, and another (quicker), using bash with GNU/date.
As this is a common perl task
And because this is not exactly same than extract last 10 minutes from logfile where it's about a bunch of time upto the end of logfile.
And because I've needed them, I (quickly) wrote this:
#!/usr/bin/perl -ws
# This script parse logfiles for a specific period of time
sub usage {
printf "Usage: %s -s=<start time> [-e=<end time>] <logfile>\n";
die $_[0] if $_[0];
exit 0;
}
use Date::Parse;
usage "No start time submited" unless $s;
my $startim=str2time($s) or die;
my $endtim=str2time($e) if $e;
$endtim=time() unless $e;
usage "Logfile not submited" unless $ARGV[0];
open my $in, "<" . $ARGV[0] or usage "Can't open '$ARGV[0]' for reading";
$_=<$in>;
exit unless $_; # empty file
# Determining regular expression, depending on log format
my $logre=qr{^(\S{3}\s+\d{1,2}\s+(\d{2}:){2}\d+)};
$logre=qr{^[^\[]*\[(\d+/\S+/(\d+:){3}\d+\s\+\d+)\]} unless /$logre/;
while (<$in>) {
/$logre/ && do {
my $ltim=str2time($1);
print if $endtim >= $ltim && $ltim >= $startim;
};
};
This could be used like:
./timelapsinlog.pl -s=09:18 -e=09:24 /path/to/logfile
for printing logs between 09h18 and 09h24.
./timelapsinlog.pl -s='2017/01/23 09:18:12' /path/to/logfile
for printing from january 23th, 9h18'12" upto now.
In order to reduce perl code, I've used -s switch to permit auto-assignement of variables from commandline: -s=09:18 will populate a variable $s wich will contain 09:18. Care to not miss the equal sign = and no spaces!
Nota: This hold two diffent kind of regex for two different log standard. If you require different date/time format parsing, either post your own regex or post a sample of formatted date from your logfile
^(\S{3}\s+\d{1,2}\s+(\d{2}:){2}\d+) # ^Jan 1 01:23:45
^[^\[]*\[(\d+/\S+/(\d+:){3}\d+\s\+\d+)\] # ^... [01/Jan/2017:01:23:45 +0000]
Quicker** bash version:
Answering to Gilles Quénot's comment, I've tried to create a bash version.
As this version seem quicker than perl version, I post them here:
#!/bin/bash
prog=${0##*/}
usage() {
cat <<EOUsage
Usage: $prog <start date> <end date> <logfile>
Each argument are required. End date could by `now`.
EOUsage
}
die() {
echo >&2 "ERROR $prog: $*"
exit 1
}
(($#==3))|| { usage; die 'Wrong number of arguments.';}
[[ -f $3 ]] || die "File not found."
# Conversion of argument to EPOCHSECONDS by asking `date` for the two conversions
{
read -r start
read -r end
} < <(
date -f - +%s <<<"$1"$'\n'"$2"
)
# Determing wich kind of log format, between "apache logs" and "system logs":
read -r oline <"$3" # read one log line
if [[ $oline =~ ^[^\ ]{3}\ +[0-9]{1,2}\ +([0-9]{2}:){2}[0-9]+ ]]; then
# Look like syslog format
sedcmd='s/^\([^ ]\{3\} \+[0-9]\{1,2\} \+\([0-9]\{2\}:\)\{2\}[0-9]\+\).*/\1/'
elif [[ $oline =~ ^[^\[]+\[[0-9]+/[^\ ]+/([0-9]+:){3}[0-9]+\ \+[0-9]+\] ]]; then
# Look like apache logs
sedcmd='s/^[0-9.]\+ \+[^ ]\+ \+[^ ]\+ \[\([^]]\+\)\].*$/\1/;s/:/ /;y|/|-|'
else
die 'Log format not recognized'
fi
# Print lines begining by `1<tabulation>`
sed -ne s/^1\\o11//p <(
# paste `bc` tests with log file
paste <(
# bc will do comparison against EPOCHSECONDS returned by date and $start - $end
bc < <(
# Create a bc function for testing against $start - $end.
cat <<EOInitBc
define void f(x) {
if ((x>$start) && (x<$end)) { 1;return ;};
0;}
EOInitBc
# Run sed to extract date strings from logfile, then
# run date to convert string to EPOCHSECONDS
sed "$sedcmd" <"$3" |
date -f - +'f(%s)'
)
) "$3"
)
Explanation
Script run sed to extract date strings from logfile
Pass date strings to date -f - +%s to convert in one run all strings to EPOCH (Unix Timestamp).
Run bc for the tests: print 1 if min > date > max or else print 0.
Run paste to merge bc output with logfile.
Finally run sed to find lines that match 1<tab> then replace match with nothing, then print.
So this script will fork 5 subprocess to do dedicated things by specialised tools, but won't do shell loop against each lines of logfile!
** Note:
Of course, this is quicker on my host because I run on a multicore processor, each task run parallelized!!
Conclusion:
This is not a program! This is an aggregation script!
If you consider bash not as a programming language, but as a super language or a tools aggregator, you could take the full power of all your tools!!
If someone encounters with the awk: invalid -v option, here's a script to get the most active IPs in a predefined time range:
cat <FILE_NAME> | awk '$4 >= "[04/Jul/2017:07:00:00" && $4 < "[04/Jul/2017:08:00:00"' | awk '{print $1}' | sort -n | uniq -c | sort -nr | head -20
Very quick and readable way to do it in Python. This seems to be faster than the bash version. (Computed time is displayed using an internal module which has been striped from this code)
./ext_lines.py -v -s 'Feb 12 00:23:00' -e 'Feb 15 00:23:00' -i /var/log/syslog.1
Total time : 445 ms 187 musec
Time per line : 7 musec 58 ns
Number of lines : 63,072
Number of extracted lines : 29,265
I can't compare this code with the daemon.log file used by others... But, here is my config
Operating System: Kubuntu 22.10
KDE Plasma Version: 5.25.5
KDE Frameworks Version: 5.98.0
Qt Version: 5.15.6
Kernel Version: 6.2.0-060200rc8-generic (64-bit)
Graphics Platform: X11
Processors: 16 × AMD Ryzen 7 5700U with Radeon Graphics
Memory: 14.9 GiB of RAM
The essential code could fit in just one line (dts = ...), but to make it more readable it's being "splited" in three. It's not only rather fast, it's also very compact :-)
from argparse import ArgumentParser, FileType
from datetime import datetime
from os.path import basename
from sys import argv, float_info
from time import mktime, localtime, strptime
__version__ = '1.0.0' # Workaround (internal use)
now = datetime.now
progname = basename(argv[0])
parser = ArgumentParser(description = 'Is Python strptime faster than sed and Perl ?',
prog = progname)
parser.add_argument('--version',
dest = 'version',
action = 'version',
version = '{} : {}'.format(progname,
str(__version__)))
parser.add_argument('-i',
'--input',
dest = 'infile',
default = '/var/log/syslog.1',
type = FileType('r',
encoding = 'UTF-8'),
help = 'Input file (stdin not yet supported)')
parser.add_argument('-f',
'--format',
dest = 'fmt',
default = '%b %d %H:%M:%S',
help = 'Date input format')
parser.add_argument('-s',
'--start',
dest = 'start',
default = None,
help = 'Starting date : >=')
parser.add_argument('-e',
'--end',
dest = 'end',
default = None,
help = 'Ending date : <=')
parser.add_argument('-v',
dest = 'verbose',
action = 'store_true',
default = False,
help = 'Verbose mode')
args = parser.parse_args()
verbose = args.verbose
start = args.start
end = args.end
infile = args.infile
fmt = args.fmt
############### Start code ################
lines = tuple(infile)
# Use defaut values if start or end are undefined
if not start :
start = lines[0][:14]
if not end :
end = lines[-1][:14]
# Convert start and end to timestamp
start = mktime(strptime(start,
fmt))
end = mktime(strptime(end,
fmt))
# Extract matching lines
t1 = now()
dts = [(x, line) for x, line in [(mktime(strptime(line[:14 ],
fmt)),
line) for line in lines] if start <= x <= end]
t2 = now()
# Print stats
if verbose :
total_time = 'Total time'
time_p_line = 'Time per line'
n_lines = 'Number of lines'
n_ext_lines = 'Number of extracted lines'
print(f'{total_time:<25} : {((t2 - t1) * 1000)} ms')
print(f'{time_p_line:<25} : {((t2 -t1) / len(lines) * 1000)} ms')
print(f'{n_lines:<25} : {len(lines):,}')
print(f'{n_ext_lines:<25} : {len(dts):,}')
# Print extracted lines
print(''.join([x[1] for x in dts]))
To parse the access.log precisely in a specified range, in this case only the last 10 minutes (based from EPOCH aka number of seconds since 1970/01/01):
Input file:
172.16.0.3 - - [17/Feb/2023:17:48:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
172.16.0.4 - - [17/Feb/2023:17:25:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
172.16.0.5 - - [17/Feb/2023:17:15:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
Perl's oneliner:
With the reliable Time::Piece time parser, using strptime() to parse date, and strftime() to format new one. This module is installed in core (by default) thats is not the case with not reliable Date::Parse
$ perl -MTime::Piece -sne '
BEGIN{
my $t = localtime;
our $now = $t->epoch;
our $monthsRe = join "|", $t->mon_list;
}
m!\[(\d{2}/(?:$monthsRe)/\d{4}:\d{2}:\d{2}:\d{2})\s!;
my $d = Time::Piece->strptime("$1", "%d/%b/%Y:%H:%M:%S");
my $old = $d->strftime("%s");
my $diff = (($now - $old) + $gap);
if ($diff > $min and $diff < $max) {print}
' -- -gap=$({ echo -n "0"; date "+%:::z*3600"; } | bc) \
-min=0 \
-max=600 access.log
Explanations of arguments: -gap, -min, -max switches
-gap the $((7*3600)) aka 25200 seconds, is the gap with UTC : +7 hours in seconds in my current case 🇹🇭 (Thai TZ) ¹ rewrote as { echo -n "0"; date "+%:::z*3600"; } | bc if you have GNU date. If not, use another way to set the gap
-min the min seconds since we print log matching line(s)
-max the max seconds until we print log matching line(s)
to know the gap from UTC, take a look to:
¹
$ LANG=C date
Fri Feb 17 15:50:13 +07 2023
The +07 is the gap.
This way, you can filter exactly at the exact seconds range with this snippet.
Sample output
172.16.0.3 - - [17/Feb/2023:17:48:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
My server is having unusually high CPU usage, and I can see Apache is using way too much memory.
I have a feeling, I'm being DOS'd by a single IP - maybe you can help me find the attacker?
I've used the following line, to find the 10 most "active" IPs:
cat access.log | awk '{print $1}' |sort |uniq -c |sort -n |tail
The top 5 IPs have about 200 times as many requests to the server, as the "average" user. However, I can't find out if these 5 are just very frequent visitors, or they are attacking the servers.
Is there are way, to specify the above search to a time interval, eg. the last two hours OR between 10-12 today?
Cheers!
UPDATED 23 OCT 2011 - The commands I needed:
Get entries within last X hours [Here two hours]
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date) print Date FS $4}' access.log
Get most active IPs within the last X hours [Here two hours]
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date) print $1}' access.log | sort |uniq -c |sort -n | tail
Get entries within relative timespan
awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print Date FS Date2 FS $4}' access.log
Get entries within absolute timespan
awk -vDate=`date -d '13:20' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'13:30' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print $0}' access.log
Get most active IPs within absolute timespan
awk -vDate=`date -d '13:20' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'13:30' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print $1}' access.log | sort |uniq -c |sort -n | tail
yes, there are multiple ways to do this. Here is how I would go about this. For starters, no need to pipe the output of cat, just open the log file with awk.
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print Date, $0}' access_log
assuming your log looks like mine (they're configurable) than the date is stored in field 4. and is bracketed. What I am doing above is finding everything within the last 2 hours. Note the -d'now-2 hours' or translated literally now minus 2 hours which for me looks something like this: [10/Oct/2011:08:55:23
So what I am doing is storing the formatted value of two hours ago and comparing against field four. The conditional expression should be straight forward.I am then printing the Date, followed by the Output Field Separator (OFS -- or space in this case) followed by the whole line $0. You could use your previous expression and just print $1 (the ip addresses)
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print $1}' | sort |uniq -c |sort -n | tail
If you wanted to use a range specify two date variables and construct your expression appropriately.
so if you wanted do find something between 2-4hrs ago your expression might looks something like this
awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date && $4 < Date2 {print Date, Date2, $4} access_log'
Here is a question I answered regarding dates in bash you might find helpful.
Print date for the monday of the current week (in bash)
Introduction
As accepted answer from matchew is wrong, regarding Antoine's comment: Because awk will do alphanumeric comparisons. So if you logfile list events across the end and begin of two months:
[27/Feb/2023:00:00:00
[28/Feb/2023:00:00:00
[01/Mar/2023:00:00:00
awk will consider:
[01/Mar/2023:00:00:00 < [27/Feb/2023:00:00:00 < [28/Feb/2023:00:00:00
Wich is wrong! You have to compare date stings!!
For this, you could use libraries. Conforming to the language
you use.
I will present here two different way, one using perl with Date::Parse library, and another (quicker), using bash with GNU/date.
As this is a common perl task
And because this is not exactly same than extract last 10 minutes from logfile where it's about a bunch of time upto the end of logfile.
And because I've needed them, I (quickly) wrote this:
#!/usr/bin/perl -ws
# This script parse logfiles for a specific period of time
sub usage {
printf "Usage: %s -s=<start time> [-e=<end time>] <logfile>\n";
die $_[0] if $_[0];
exit 0;
}
use Date::Parse;
usage "No start time submited" unless $s;
my $startim=str2time($s) or die;
my $endtim=str2time($e) if $e;
$endtim=time() unless $e;
usage "Logfile not submited" unless $ARGV[0];
open my $in, "<" . $ARGV[0] or usage "Can't open '$ARGV[0]' for reading";
$_=<$in>;
exit unless $_; # empty file
# Determining regular expression, depending on log format
my $logre=qr{^(\S{3}\s+\d{1,2}\s+(\d{2}:){2}\d+)};
$logre=qr{^[^\[]*\[(\d+/\S+/(\d+:){3}\d+\s\+\d+)\]} unless /$logre/;
while (<$in>) {
/$logre/ && do {
my $ltim=str2time($1);
print if $endtim >= $ltim && $ltim >= $startim;
};
};
This could be used like:
./timelapsinlog.pl -s=09:18 -e=09:24 /path/to/logfile
for printing logs between 09h18 and 09h24.
./timelapsinlog.pl -s='2017/01/23 09:18:12' /path/to/logfile
for printing from january 23th, 9h18'12" upto now.
In order to reduce perl code, I've used -s switch to permit auto-assignement of variables from commandline: -s=09:18 will populate a variable $s wich will contain 09:18. Care to not miss the equal sign = and no spaces!
Nota: This hold two diffent kind of regex for two different log standard. If you require different date/time format parsing, either post your own regex or post a sample of formatted date from your logfile
^(\S{3}\s+\d{1,2}\s+(\d{2}:){2}\d+) # ^Jan 1 01:23:45
^[^\[]*\[(\d+/\S+/(\d+:){3}\d+\s\+\d+)\] # ^... [01/Jan/2017:01:23:45 +0000]
Quicker** bash version:
Answering to Gilles Quénot's comment, I've tried to create a bash version.
As this version seem quicker than perl version, I post them here:
#!/bin/bash
prog=${0##*/}
usage() {
cat <<EOUsage
Usage: $prog <start date> <end date> <logfile>
Each argument are required. End date could by `now`.
EOUsage
}
die() {
echo >&2 "ERROR $prog: $*"
exit 1
}
(($#==3))|| { usage; die 'Wrong number of arguments.';}
[[ -f $3 ]] || die "File not found."
# Conversion of argument to EPOCHSECONDS by asking `date` for the two conversions
{
read -r start
read -r end
} < <(
date -f - +%s <<<"$1"$'\n'"$2"
)
# Determing wich kind of log format, between "apache logs" and "system logs":
read -r oline <"$3" # read one log line
if [[ $oline =~ ^[^\ ]{3}\ +[0-9]{1,2}\ +([0-9]{2}:){2}[0-9]+ ]]; then
# Look like syslog format
sedcmd='s/^\([^ ]\{3\} \+[0-9]\{1,2\} \+\([0-9]\{2\}:\)\{2\}[0-9]\+\).*/\1/'
elif [[ $oline =~ ^[^\[]+\[[0-9]+/[^\ ]+/([0-9]+:){3}[0-9]+\ \+[0-9]+\] ]]; then
# Look like apache logs
sedcmd='s/^[0-9.]\+ \+[^ ]\+ \+[^ ]\+ \[\([^]]\+\)\].*$/\1/;s/:/ /;y|/|-|'
else
die 'Log format not recognized'
fi
# Print lines begining by `1<tabulation>`
sed -ne s/^1\\o11//p <(
# paste `bc` tests with log file
paste <(
# bc will do comparison against EPOCHSECONDS returned by date and $start - $end
bc < <(
# Create a bc function for testing against $start - $end.
cat <<EOInitBc
define void f(x) {
if ((x>$start) && (x<$end)) { 1;return ;};
0;}
EOInitBc
# Run sed to extract date strings from logfile, then
# run date to convert string to EPOCHSECONDS
sed "$sedcmd" <"$3" |
date -f - +'f(%s)'
)
) "$3"
)
Explanation
Script run sed to extract date strings from logfile
Pass date strings to date -f - +%s to convert in one run all strings to EPOCH (Unix Timestamp).
Run bc for the tests: print 1 if min > date > max or else print 0.
Run paste to merge bc output with logfile.
Finally run sed to find lines that match 1<tab> then replace match with nothing, then print.
So this script will fork 5 subprocess to do dedicated things by specialised tools, but won't do shell loop against each lines of logfile!
** Note:
Of course, this is quicker on my host because I run on a multicore processor, each task run parallelized!!
Conclusion:
This is not a program! This is an aggregation script!
If you consider bash not as a programming language, but as a super language or a tools aggregator, you could take the full power of all your tools!!
If someone encounters with the awk: invalid -v option, here's a script to get the most active IPs in a predefined time range:
cat <FILE_NAME> | awk '$4 >= "[04/Jul/2017:07:00:00" && $4 < "[04/Jul/2017:08:00:00"' | awk '{print $1}' | sort -n | uniq -c | sort -nr | head -20
Very quick and readable way to do it in Python. This seems to be faster than the bash version. (Computed time is displayed using an internal module which has been striped from this code)
./ext_lines.py -v -s 'Feb 12 00:23:00' -e 'Feb 15 00:23:00' -i /var/log/syslog.1
Total time : 445 ms 187 musec
Time per line : 7 musec 58 ns
Number of lines : 63,072
Number of extracted lines : 29,265
I can't compare this code with the daemon.log file used by others... But, here is my config
Operating System: Kubuntu 22.10
KDE Plasma Version: 5.25.5
KDE Frameworks Version: 5.98.0
Qt Version: 5.15.6
Kernel Version: 6.2.0-060200rc8-generic (64-bit)
Graphics Platform: X11
Processors: 16 × AMD Ryzen 7 5700U with Radeon Graphics
Memory: 14.9 GiB of RAM
The essential code could fit in just one line (dts = ...), but to make it more readable it's being "splited" in three. It's not only rather fast, it's also very compact :-)
from argparse import ArgumentParser, FileType
from datetime import datetime
from os.path import basename
from sys import argv, float_info
from time import mktime, localtime, strptime
__version__ = '1.0.0' # Workaround (internal use)
now = datetime.now
progname = basename(argv[0])
parser = ArgumentParser(description = 'Is Python strptime faster than sed and Perl ?',
prog = progname)
parser.add_argument('--version',
dest = 'version',
action = 'version',
version = '{} : {}'.format(progname,
str(__version__)))
parser.add_argument('-i',
'--input',
dest = 'infile',
default = '/var/log/syslog.1',
type = FileType('r',
encoding = 'UTF-8'),
help = 'Input file (stdin not yet supported)')
parser.add_argument('-f',
'--format',
dest = 'fmt',
default = '%b %d %H:%M:%S',
help = 'Date input format')
parser.add_argument('-s',
'--start',
dest = 'start',
default = None,
help = 'Starting date : >=')
parser.add_argument('-e',
'--end',
dest = 'end',
default = None,
help = 'Ending date : <=')
parser.add_argument('-v',
dest = 'verbose',
action = 'store_true',
default = False,
help = 'Verbose mode')
args = parser.parse_args()
verbose = args.verbose
start = args.start
end = args.end
infile = args.infile
fmt = args.fmt
############### Start code ################
lines = tuple(infile)
# Use defaut values if start or end are undefined
if not start :
start = lines[0][:14]
if not end :
end = lines[-1][:14]
# Convert start and end to timestamp
start = mktime(strptime(start,
fmt))
end = mktime(strptime(end,
fmt))
# Extract matching lines
t1 = now()
dts = [(x, line) for x, line in [(mktime(strptime(line[:14 ],
fmt)),
line) for line in lines] if start <= x <= end]
t2 = now()
# Print stats
if verbose :
total_time = 'Total time'
time_p_line = 'Time per line'
n_lines = 'Number of lines'
n_ext_lines = 'Number of extracted lines'
print(f'{total_time:<25} : {((t2 - t1) * 1000)} ms')
print(f'{time_p_line:<25} : {((t2 -t1) / len(lines) * 1000)} ms')
print(f'{n_lines:<25} : {len(lines):,}')
print(f'{n_ext_lines:<25} : {len(dts):,}')
# Print extracted lines
print(''.join([x[1] for x in dts]))
To parse the access.log precisely in a specified range, in this case only the last 10 minutes (based from EPOCH aka number of seconds since 1970/01/01):
Input file:
172.16.0.3 - - [17/Feb/2023:17:48:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
172.16.0.4 - - [17/Feb/2023:17:25:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
172.16.0.5 - - [17/Feb/2023:17:15:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
Perl's oneliner:
With the reliable Time::Piece time parser, using strptime() to parse date, and strftime() to format new one. This module is installed in core (by default) thats is not the case with not reliable Date::Parse
$ perl -MTime::Piece -sne '
BEGIN{
my $t = localtime;
our $now = $t->epoch;
our $monthsRe = join "|", $t->mon_list;
}
m!\[(\d{2}/(?:$monthsRe)/\d{4}:\d{2}:\d{2}:\d{2})\s!;
my $d = Time::Piece->strptime("$1", "%d/%b/%Y:%H:%M:%S");
my $old = $d->strftime("%s");
my $diff = (($now - $old) + $gap);
if ($diff > $min and $diff < $max) {print}
' -- -gap=$({ echo -n "0"; date "+%:::z*3600"; } | bc) \
-min=0 \
-max=600 access.log
Explanations of arguments: -gap, -min, -max switches
-gap the $((7*3600)) aka 25200 seconds, is the gap with UTC : +7 hours in seconds in my current case 🇹🇭 (Thai TZ) ¹ rewrote as { echo -n "0"; date "+%:::z*3600"; } | bc if you have GNU date. If not, use another way to set the gap
-min the min seconds since we print log matching line(s)
-max the max seconds until we print log matching line(s)
to know the gap from UTC, take a look to:
¹
$ LANG=C date
Fri Feb 17 15:50:13 +07 2023
The +07 is the gap.
This way, you can filter exactly at the exact seconds range with this snippet.
Sample output
172.16.0.3 - - [17/Feb/2023:17:48:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
I am writing a script in BASH that needs to check through log files for ERROR entries. I plan to run this as a cron hourly, so I only want to have it only return ERROR type entries that occurred within the last hour (all server times are GMT). I establish the following variables
# Log file directory
LOGPATH="/path/to/logs/"
# Current date and time
CURDATE=`date +%Y-%m-%d`
CURTIME=`date +%H:%M:%S`
# Old date and time
OLDDATE=`date +%Y-%m-%d -d "1 hour ago"`
OLDTIME=`date +%H:%M:%S -d "1 hour ago"`
All log files adhere to the file name format of ktYEAR-MONTH-DAY.root.log.txt Where YEAR/MONTH/DAY are replaced with the date that entries are recorded in. So for instance, today's log file would be kt2011-08-15.root.log.txt. An example entry of the contents is
2011-08-15 | 19:30:02 | ERROR | 18333 | 337 | n/a | dms | default | error | XMLRPC Lucene - addDocument - Reason: Failed to parse XML-RPC request: An invalid XML character (Unicode: 0xb) was found in the element content of the document.
The columns of interest are the 1st, 2nd, 3rd (value may be "INFO", "DEBUG", etc, but am only interested when "ERROR" is the value) and the last column which is the body of the log message.
What I am trying to accomplish is having this BASH script parse through the file(s) that have entries spanning the last hour of activity (as defined in the 1st and 2nd column), and if the 4th column contains the string "ERROR", then display the right-most column's contents. My confusion comes when trying to determine how to parse through the log file(s) based off of the $CURTIME an $OLDTIME, made worse when midnight comes and I then have to search through the previous day's log file. I would prefer not to do a blanket grep style search through all the log files as the quantity and size can be excessive, but if that's how it has to be done, then so be it.
awk -F ' \\| ' -v "d=$(date -d "1 hour ago" -u +%Y-%m-%d#%H:%M:%S)" '$3 == "ERROR" && $1"#"$2 > d'
This is as simple as doing string comparison in awk. When you pass midnight, simply add the $OLDDATE file to the search:
if [ "$CURDATE" != "$OLDDATE" ]; then
cat "kt$OLDDATE.root.log.txt" "kt$CURDATE.root.log.txt"
else
cat "kt$CURDATE.root.log.txt"
fi | awk -F "|" -v olddate=$OLDDATE -v oldtime=$OLDTIME -v curdate=$CURDATE 'BEGIN{olddate=olddate " "; curdate = curdate " "; oldtime = " " oldtime " "}
$1 == olddate && $2 >= oldtime && $3 == " ERROR "{print $0}
$1 > olddate && $3 == " ERROR "{print $0}'
Can be combined with glenn's solution to be much shorter.