I have been gathering data for the last 20 days using a bash script that runs every 5 minutes. I started the script with no idea how I was going to output the data. I have since found a rather cool js graph that reads from a CSV.
Only issue is my date is currently in the format of:
Fri Nov 6 07:52:02
and for the CSV I need it to be
2015-11-06 07:52:02
So I need to cat my results grep-ing for the date and convert it.
The cat/grep for the date is:
cat speeds.txt | grep 2015 | awk '{print $1" "$2" "$3" "$4}'
Any brainwaves on how I can switch this around either using bash or php?
Thanks
PS - Starting the checks again using date +%Y%m%d" "%H:%M:%S is sadly not an option :(
Assuming all of your lines contains dates:
$ cat file
Fri Nov 6 07:52:02
...
$ awk 'BEGIN {
months["Jan"] = 1;
months["Feb"] = 2;
months["Mar"] = 3;
months["Apr"] = 4;
months["May"] = 5;
months["Jun"] = 6;
months["Jul"] = 7;
months["Aug"] = 8;
months["Sep"] = 9;
months["Oct"] = 10;
months["Nov"] = 11;
months["Dec"] = 12;
}
{
month = months[$2];
printf("%s-%02d-%02d %s\n", 2015, month, $3, $4);
}' file > out
$ cat out
2015-11-06 07:52:02
...
If you only need to modify a some of the lines you can tweak the awk script a little bit, eg. match every line containing 2015:
...
# Match every line containing 2015
/2015/ {
month = months[$2];
printf("%s-%02d-%02d %s\n", 2015, month, $3, $4);
# Use next to prevent this the other print to happen for these lines
# Like 'continue' in while iterations
next;
};
# This '1' will print all other lines as well:
# Same as writing { print $0 }
1
You can use the date format to epoch time format in bash script.
date -d 'Fri Nov 6 07:52:02' +%s;
1446776522
date -d #1446776522 +"%Y-%m-%d %T "
2015-11-06 07:52:02
Since you didn't provide the input, I'll assume you have a file called speeds.txt that contains:
Fri Oct 31 07:52:02 3
Fri Nov 1 08:12:04 4
Fri Nov 2 07:43:22 5
(the 3, 4, and 5 above are just to show that you could have other data in the row, but are not necessary).
Using this command:
cat speeds.txt | cut -d ' ' -f2,3,4 | while read line ; do date -d"$line" "+2015-%m-%d %H:%M:%S" ; done;
You get the output:
2015-10-31 07:52:02
2015-11-01 08:12:04
2015-11-02 07:43:22
Related
Hey guys I have been working on this problem where I have got a CSV File where I have to filter specific months based on the input from the user.
Record Format
firstName,lastName,YYYYMMDD
But the thing is the input is in String and the month in the file is in numbers.
For example
> cat guest.csv
Micheal,Scofield,20000312
Lincon,Burrows,19981009
Sara,Tancredi,20040923
Walter,White,20051024
Barney,Stinson,20041230
Ted,Mosbey,20031126
Eric,Forman,20070430
Jake,Peralta,20030808
Amy,Santiago,19990405
Colt,Bennett,19990906
> ./list.sh March guest.csv
Micheal,Scofield,20000312
Oneliner:
MONTH=March; REGEX=`date -d "1 ${MONTH} 2022" +%m..$`; grep $REGEX guest.csv
Awk can easily translate month names to numbers and do the filtering.
awk -v month="March" -F , '
BEGIN { split("January February March April May June July August September October November December", mon, " ");
for(i=1; i<=12; i++) mm[i] = mon[i] }
mm[0 + substr($3, 5, 2)] == month' guest.csv
The BEGIN block sets up a pair of associative arrays which can be used in the main script to look up a month number by name. Change -v month="April" to search for a different month.
If you want to wrap this in a shell script, you can easily parse out the arguments into variables:
#!/bin/sh
monthname=$1
shift
awk -v month="$monthname" -F , '
BEGIN { split("January February March April May June July August September October November December", mon, " ");
for(i=1; i<=12; i++) mm[i] = mon[i] }
mm[0 + substr($3, 5, 2)] == month' "$#"
I need to extract all the texts between the dates in the following manner (The format for the below is: Month Day Hour):
start_marker: "Jul 3 2"
end_marker: "Jul 3 7"
from a log file that has data in the following example format
<unneeded text>
Fri Jul 3 2:51:54:780 2020
<needed text>
<needed text>
<needed text>
Fri Jul 3 5:51:54:780 2020
<needed text>
<needed text>
Fri Jul 3 7:51:54:780 2020
<unneeded text>
I am trying the below script but it returns a blank log_collector file
start_month="Jul"
start_date="3"
start_hour="2"
end_month="Jul"
end_date="3"
end_hour="7"
start_marker="$start_month $start_date $start_hour"
end_marker="$end_month $end_date $end_hour"
sed -n '/"$start_marker"/,/"$end_marker"/p' logfile >> "log_collector"
cat log_collector
Use double quotes when using sed + variables otherwise sed wont read your variables, your script is now readed/executed as the file has been written in your example:
+ start_month=Jul
+ start_date=3
+ start_hour=2
+ end_month=Jul
+ end_date=3
+ end_hour=7
+ start_marker='Jul 3 2'
+ end_marker='Jul 3 7'
+ sed -n '/"$start_marker"/,/"$end_marker"/p' logfile
+ cat log_collector
...empty file
Instead try:
sed -n "/${start_marker}/,/${end_marker}/p" logfile >> "log_collector"
Result:
+ variables...
+ sed -n '/Jul 3 2/,/Jul 3 7/p' logfile
+ cat log_collector
Fri Jul 3 2:51:54:780 2020
text...
And your script will now output the variables as you want.
But I really don't see the point with using start_* and end_* variables when you using *_marker for the same values, but maybe it was just an bad/confusing example :)
Hint: Launch your script with 'bash -x' or add 'set -x' and you will see how script is launched.
Edit: Bill Jetzer was faster I see in your comments, however see examples above.
FWIW I'd use a flag (inRange below) instead of a range (which excludes sed since it doesn't have variables) and only check for the date/time markers on lines that look like your date/time lines (hence the long-ish regexp below):
$ cat tst.awk
BEGIN { FS = "[[:space:]:]+" }
/^([[:upper:]][[:lower:]]{2} +){2}[0-9]{1,2} +([0-9]{1,2}:){3}[0-9]{3} +[0-9]{4} *$/ {
marker = $2" "$3" "$4
}
marker == start_marker { inRange = 1 }
inRange { print }
marker == end_marker { inRange = 0 }
.
$ awk -v start_marker='Jul 3 2' -v end_marker='Jul 3 7' -f tst.awk file
Fri Jul 3 2:51:54:780 2020
<needed text>
<needed text>
<needed text>
Fri Jul 3 5:51:54:780 2020
<needed text>
<needed text>
Fri Jul 3 7:51:54:780 2020
See Is a /start/,/end/ range expression ever useful in awk? for why I wouldn't use a range expression (/start/,/end/).
I'm trying to get the last x minutes of logs from /var/log/maillog from a remote host (I'm using this script within icinga2) but having no luck.
I have tried a few combinations of awk, sed, and grep but none have seemed to work. I thought it was an issue with double quotes vs single quotes but I played around with them and nothing helped.
host=$1
LOG_FILE=/var/log/maillog
hour_segment=$(ssh -o 'StrictHostKeyChecking=no' myUser#${host} 2>/dev/null "sed -n "/^$(date --date='10 minutes ago' '+%b %_d %H:%M')/,\$p" ${LOG_FILE}")
echo "${hour_segment}"
When running the script with bash -x, I get the following output:
bash -x ./myScript.sh host.domain
+ host=host.domain
+ readonly STATE_OK=0
+ STATE_OK=0
+ readonly STATE_WARN=1
+ STATE_WARN=1
+ LOG_FILE=/var/log/maillog
+++ date '--date=10 minutes ago' '+%b %_d %H:%M'
++ ssh -o StrictHostKeyChecking=no myUser#host.domain 'sed -n /^Jan' 8 '12:56/,$p /var/log/maillog'
+ hour_segment=
+ echo ''
Maillog log file output. I'd like $hour_segment to look like the below output also so I can apply filters to it:
head -n 5 /var/log/maillog
Jan 6 04:03:36 hostname imapd: Disconnected, ip=[ip_address], time=5
Jan 6 04:03:36 hostname postfix/smtpd[9501]: warning: unknown[ip_address]: SASL LOGIN authentication failed: authentication failure
Jan 6 04:03:37 hostname imapd: Disconnected, ip=[ip_address], time=5
Jan 6 04:03:37 hostname postfix/smtpd[7812]: warning: unknown[ip_address]: SASL LOGIN authentication failed: authentication failure
Jan 6 04:03:37 hostname postfix/smtpd[7812]: disconnect from unknown[ip_address]
Using GNU awk's time functions:
$ awk '
BEGIN {
m["Jan"]=1 # convert month abbreviations to numbers
# fill in the rest # fill in the rest of the months
m["Dec"]=12
nowy=strftime("%Y") # assume current year, deal with Dec/Jan below
nowm=strftime("%b") # get the month, see above comment
nows=strftime("%s") # current epoch time
}
{ # below we for datespec for mktime
dt=(nowm=="Jan" && $1=="Dec"?nowy-1:nowy) " " m[$1] " " $2 " " gensub(/:/," ","g",$3)
if(mktime(dt)>=nows-600) # if timestamp is less than 600 secs away
print # print it
}' file
Current year is assumed. If it's January and log has Dec we subtract one year from mktime's datespec: (nowm=="Jan" && $1=="Dec"?nowy-1:nowy). Datespec: Jan 6 04:03:37 -> 2019 1 6 04 03 37 and for comparison in epoch form: 1546740217.
Edit: As no one implemeted my specs in the comments I'll do it myself. tac outputs file in reverse and the awk prints records while they are in given time frame (t-now or future) and exits once it meets a date outside of the time frame:
$ tac file | awk -v t=600 ' # time in seconds go here
BEGIN {
m["Jan"]=1
# add more months
m["Dec"]=12
nowy=strftime("%Y")
nowm=strftime("%b")
nows=strftime("%s")
} {
dt=(nowm=="Jan" && $1=="Dec"?nowy-1:nowy) " " m[$1] " " $2 " " gensub(/:/," ","g",$3)
if(mktime(dt)<nows-t) # this changed some
exit
else
print
}'
Coming up with a robust solution that will work 100% bulletproof is very hard since we are missing the most crucial information, the year.
Imagine you want the last 10 minutes of available data on March 01 2020 at 00:05:00. This is a bit annoying since February 29 2020 exists. But in 2019, it does not.
I present here an ugly solution that only looks at the third field (the time) and I will make the following assumptions:
The log-file is sorted by time
There is at least one log every single day!
Under these conditions we can keep track of a sliding window starting from the first available time.
If you safe the following in an file extractLastLog.awk
{ t=substr($3,1,2)*3600 + substr($3,4,2)*60 + substr($3,7,2) + offset}
(t < to) { t+=86400; offset+=86400 }
{ to = t }
(NR==1) { startTime = t; startIndex = NR }
{ a[NR]=$0; b[NR]=t }
{ while ( startTime+timeSpan*60 <= t ) {
delete a[startIndex]
delete b[startIndex]
startIndex++; startTime=b[startIndex]
}
}
END { for(i=startIndex; i<=NR; ++i) print a[i] }
then you can extract the last 23 minutes in the following way:
awk -f extractLastLog.awk -v timeSpan=23 logfile.log
The second condition I gave (There is at least one log every single day!) is needed not to have messed up results. In the above code, I compute the time fairly simple, HH*3600 + MM*60 + SS + offset. But I make the statement that if the current time is smaller than the previous time, it implies we are on a different day hence we update the offset with 86400 seconds. So if you have two entries like:
Jan 09 12:01:02 xxx
Jan 10 12:01:01 xxx
it will work, but this
Jan 09 12:01:00 xxx
Jan 10 12:01:01 xxx
will not work. It will not realize the day changed. Other cases that will fail are:
Jan 08 12:01:02 xxx
Jan 10 12:01:01 xxx
as it does not know that it jumped two days. Corrections for this are not easy due to the months (all thanks to leap years).
As I said, it's ugly, but might work.
I need to split a large syslog file that goes from October 2015 to February 2016 and be separated by month. Due to background log retention, the format of these logs are similar to:
Oct 21 08:00:00 - Log info
Nov 16 08:00:00 - Log Info
Dec 25 08:00:00 - Log Info
Jan 11 08:00:00 - Log Info
Feb 16 08:00:00 - Log Info
This large file is the result of an initial zgrep search across a large amount of log files split by day. Example being, user activity on a network across multiple services such as Windows/Firewall/Physical access logs.
For a previous request, I used the following:
gawk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",mth,"|")
}
{
for(i=1;i<=m;i++){ if ( mth[i]==$1){ month = i } }
tt="2015 "month" "$2" 00 00 00"
date= strftime("%Y%m",mktime(tt))
print $0 > FILENAME"."date".txt"
}
' logfile
output file examples (note sometimes I add "%d" to get the day but not this time:
Test.201503.txt
Test.201504.txt
Test.201505.txt
Test.201506.txt
This script however adds 2015 manually to the output log file name. What I attempted, and failed to do, was a script that creates variables out of each month at 1-12 and then sets 2015 as a variable (a) and 2016 as variable (b). Then the script would be able to compare when going in the order of 10, 11, 12, 1, 2 which would go in order and once it gets to 1 < 12 (the previous month) it would know to use 2016 instead of 2015. Odd request I know, but any ideas would at least help me get in the right mindset.
You could use date to parse the date and time. E.g.
#!/bin/bash
while IFS=- read -r time info; do
mon=$(date --date "$time" +%m | sed 's/^0//')
if (( mon < 10 )); then
year=2016
else
year=2015
fi
echo $time - $info > Test.$year$(printf "02d%" $mon).txt
done
Here is a gawk solution based on your script and your observation in the question. The idea is to detect a new year when the number of the month suddenly gets smaller, eg from 12 to 1. (Of course that will not work if the log has Jan 2015 directly followed by Jan 2016.)
script.awk
BEGIN { START_YEAR= 2015
# configure months and a mapping month -> nr, e.g. "Feb" |-> "02"
split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",monthNames,"|")
for( nr in monthNames) { month2Nr[ monthNames[ nr ] ] = sprintf("%02d", nr ) }
yearCounter=0
}
{
currMonth = month2Nr[ $1 ]
# detect a jump to the next year by a reset in the month number
if( prevMonth > currMonth) { yearCounter++ }
newFilename = sprintf("%s.%d%s.txt", FILENAME, (START_YEAR + yearCounter), currMonth)
prevMonth = currMonth
print $0 > newFilename
}
Use it like this: awk -f script.awk logfile
I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.
I wanted to split it in many small files, by using date as splitting criteria.
Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?
Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log (that would do the trick, although not nice when sorting.)
One way using awk:
awk 'BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = sprintf("%02d", a)
}
{
split($4,array,"[:/]")
year = array[3]
month = m[array[2]]
print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009
This will output files like:
incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt
Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.
Original inspiration: "How to split existing apache logfile by month?"
Pure bash, making one pass through the access log:
while read; do
[[ $REPLY =~ \[(..)/(...)/(....): ]]
d=${BASH_REMATCH[1]}
m=${BASH_REMATCH[2]}
y=${BASH_REMATCH[3]}
#printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[#]:1:3}
printf -v fname "access.apache.%s_%s_%s.log" $y $m $d
echo "$REPLY" >> $fname
done < access.log
Here is an awk version that outputs lexically sortable log files.
Some efficiency enhancements: all done in one pass, only generate fname when it is not the same as before, close fname when switching to a new file (otherwise you might run out of file descriptors).
awk -F"[]/:[]" '
BEGIN {
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
if($4 != pyear || $3 != pmonth || $2 != pday) {
pyear = $4
pmonth = $3
pday = $2
if(fname != "")
close(fname)
fname = sprintf("access_%04d_%02d_%02d.log", $4, m2n[$3], $2)
}
print > fname
}' access-log
Perl came to the rescue:
cat access.log | perl -n -e'm#\[(\d{1,2})/(\w{3})/(\d{4}):#; open(LOG, ">>access.apache.$3_$2_$1.log"); print LOG $_;'
Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.
I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.
I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.
awk '
BEGIN {
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
split($4, a, "[]/:[]")
if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
pyear = a[4]
pmonth = a[3]
pday = a[2]
if(fname != "")
close(fname)
fname = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
}
print >> fname
}'
Kind of ugly, that's bash for you:
for year in 2010 2011 2012; do
for month in jan feb mar apr may jun jul aug sep oct nov dec; do
for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
done
done
done
I made a slight improvement to Theodore's answer so I could see progress when processing a very large log file.
#!/usr/bin/awk -f
BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = a
}
{
split($4, array, "[:/]")
year = array[3]
month = sprintf("%02d", m[array[2]])
current = year "-" month
if (last != current)
print current
last = current
print >> FILENAME "-" year "-" month ".txt"
}
Also I found that I needed to use gawk (brew install gawk if you don't have it) for this to work on Mac OS X.