Split access.log file by dates using command line tools - bash

I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.
I wanted to split it in many small files, by using date as splitting criteria.
Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?
Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log (that would do the trick, although not nice when sorting.)

One way using awk:
awk 'BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = sprintf("%02d", a)
}
{
split($4,array,"[:/]")
year = array[3]
month = m[array[2]]
print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009
This will output files like:
incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt
Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.
Original inspiration: "How to split existing apache logfile by month?"

Pure bash, making one pass through the access log:
while read; do
[[ $REPLY =~ \[(..)/(...)/(....): ]]
d=${BASH_REMATCH[1]}
m=${BASH_REMATCH[2]}
y=${BASH_REMATCH[3]}
#printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[#]:1:3}
printf -v fname "access.apache.%s_%s_%s.log" $y $m $d
echo "$REPLY" >> $fname
done < access.log

Here is an awk version that outputs lexically sortable log files.
Some efficiency enhancements: all done in one pass, only generate fname when it is not the same as before, close fname when switching to a new file (otherwise you might run out of file descriptors).
awk -F"[]/:[]" '
BEGIN {
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
if($4 != pyear || $3 != pmonth || $2 != pday) {
pyear = $4
pmonth = $3
pday = $2
if(fname != "")
close(fname)
fname = sprintf("access_%04d_%02d_%02d.log", $4, m2n[$3], $2)
}
print > fname
}' access-log

Perl came to the rescue:
cat access.log | perl -n -e'm#\[(\d{1,2})/(\w{3})/(\d{4}):#; open(LOG, ">>access.apache.$3_$2_$1.log"); print LOG $_;'
Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.
I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.

I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.
awk '
BEGIN {
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
split($4, a, "[]/:[]")
if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
pyear = a[4]
pmonth = a[3]
pday = a[2]
if(fname != "")
close(fname)
fname = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
}
print >> fname
}'

Kind of ugly, that's bash for you:
for year in 2010 2011 2012; do
for month in jan feb mar apr may jun jul aug sep oct nov dec; do
for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
done
done
done

I made a slight improvement to Theodore's answer so I could see progress when processing a very large log file.
#!/usr/bin/awk -f
BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = a
}
{
split($4, array, "[:/]")
year = array[3]
month = sprintf("%02d", m[array[2]])
current = year "-" month
if (last != current)
print current
last = current
print >> FILENAME "-" year "-" month ".txt"
}
Also I found that I needed to use gawk (brew install gawk if you don't have it) for this to work on Mac OS X.

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

Change date format - bash or php

I have been gathering data for the last 20 days using a bash script that runs every 5 minutes. I started the script with no idea how I was going to output the data. I have since found a rather cool js graph that reads from a CSV.
Only issue is my date is currently in the format of:
Fri Nov 6 07:52:02
and for the CSV I need it to be
2015-11-06 07:52:02
So I need to cat my results grep-ing for the date and convert it.
The cat/grep for the date is:
cat speeds.txt | grep 2015 | awk '{print $1" "$2" "$3" "$4}'
Any brainwaves on how I can switch this around either using bash or php?
Thanks
PS - Starting the checks again using date +%Y%m%d" "%H:%M:%S is sadly not an option :(
Assuming all of your lines contains dates:
$ cat file
Fri Nov 6 07:52:02
...
$ awk 'BEGIN {
months["Jan"] = 1;
months["Feb"] = 2;
months["Mar"] = 3;
months["Apr"] = 4;
months["May"] = 5;
months["Jun"] = 6;
months["Jul"] = 7;
months["Aug"] = 8;
months["Sep"] = 9;
months["Oct"] = 10;
months["Nov"] = 11;
months["Dec"] = 12;
}
{
month = months[$2];
printf("%s-%02d-%02d %s\n", 2015, month, $3, $4);
}' file > out
$ cat out
2015-11-06 07:52:02
...
If you only need to modify a some of the lines you can tweak the awk script a little bit, eg. match every line containing 2015:
...
# Match every line containing 2015
/2015/ {
month = months[$2];
printf("%s-%02d-%02d %s\n", 2015, month, $3, $4);
# Use next to prevent this the other print to happen for these lines
# Like 'continue' in while iterations
next;
};
# This '1' will print all other lines as well:
# Same as writing { print $0 }
1
You can use the date format to epoch time format in bash script.
date -d 'Fri Nov 6 07:52:02' +%s;
1446776522
date -d #1446776522 +"%Y-%m-%d %T "
2015-11-06 07:52:02
Since you didn't provide the input, I'll assume you have a file called speeds.txt that contains:
Fri Oct 31 07:52:02 3
Fri Nov 1 08:12:04 4
Fri Nov 2 07:43:22 5
(the 3, 4, and 5 above are just to show that you could have other data in the row, but are not necessary).
Using this command:
cat speeds.txt | cut -d ' ' -f2,3,4 | while read line ; do date -d"$line" "+2015-%m-%d %H:%M:%S" ; done;
You get the output:
2015-10-31 07:52:02
2015-11-01 08:12:04
2015-11-02 07:43:22

awk assign field string to variable not working

Just wondering why this is not working?
this is my awk code, converting "hh:mm:ss" format to seconds
a.awk
3 BEGIN {
4 FS=":";
5 }
6
7 {
8 retval = 0;
9 for (i = 1; i <= NF; i++) {
10 retval += $i * 60 ** (NF-i);
11 }
12 print $retval;
13 }
14
input.txt
59:22:40
$ cat input.txt | awk -f a.awk
//<empty>
$
however, I try it on command line:
$ echo "00:59:30" | awk 'BEGIN { FS=":" } { retval = 0; for (i = 1; i <= NF; i++) { retval += $i * 60 ** (NF-i); } print retval;}'
3570
what's wrong with a.awk ?
update just for clarifcation
$ awk --version
GNU Awk 4.0.1
Copyright (C) 1989, 1991-2012 Free Software Foundation.
Since your question has already been answered by the other 2 posts, here's something cute you can do with date to accomplish the same conversion from hh:mm:ss to time in seconds:
# GNU date
string_time="01:01:01"
string_time_in_seconds=$(date -u -d "1970-01-01 ${string_time}" +"%s")
echo ${string_time_in_seconds}
3661
That for loop is cute, but this seems more direct and easier to understand.
BEGIN {
FS=":";
}
{
retval = 0;
in_hours = $1
in_minutes = $2;
in_seconds = $3;
retval = (in_hours * 3600) + (in_minutes * 60) + in_seconds
print retval;
}
I think the problem with your loop is in the exponentiation. My version, at least, doesn't support any ** operator. This might work better for you. Also, be careful with your dollar signs. You need them for fields; you don't need them for variables.
for (i = 1; i <= NF; i++) {
retval += i * (60^(NF-i));
}
it was a typo
a.awk
3 BEGIN {
4 FS=":";
5 }
6
7 {
8 retval = 0;
9 for (i = 1; i <= NF; i++) {
10 retval += $i * 60 ** (NF-i);
11 }
12 print retval; ///<<<< notice here.
13 }
14
Or, using bash only:
IFS=: read -a a < input.txt
((retval=${a[0]}*3600+${a[1]}*60+${a[2]}))
echo "$retval"

How to generate a sequence of dates given starting and ending dates using AWK of BASH scripts?

I have a data set with the following format
The first and second fields denote the dates (M/D/YYYY) of starting and ending of a study.
How one expand the data into the desired output format, taking into account the leap years using AWK or BASH scripts?
Your help is very much appreciated.
Input
7/2/2009 7/7/2009
2/28/1996 3/3/1996
12/30/2001 1/4/2002
Desired Output
7/7/2009
7/6/2009
7/5/2009
7/4/2009
7/3/2009
7/2/2009
3/3/1996
3/2/1996
3/1/1996
2/29/1996
2/28/1996
1/4/2002
1/3/2002
1/2/2002
1/1/2002
12/31/2001
12/30/2001
It can be done nicely with bash alone:
for i in `seq 1 5`;
do
date -d "2017-12-01 $i days" +%Y-%m-%d;
done;
or with pipes:
seq 1 5 | xargs -I {} date -d "2017-12-01 {} days" +%Y-%m-%d
If you have gawk:
#!/usr/bin/gawk -f
{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[3] " " s[1] " " s[2] " 0 0 0")
et=mktime(e[3] " " e[1] " " e[2] " 0 0 0")
for (i=et;i>=st;i-=60*60*24) print strftime("%m/%d/%Y",i)
}
Demonstration:
./daterange.awk inputfile
Output:
07/07/2009
07/06/2009
07/05/2009
07/04/2009
07/03/2009
07/02/2009
03/03/1996
03/02/1996
03/01/1996
02/29/1996
02/28/1996
01/04/2002
01/03/2002
01/02/2002
01/01/2002
12/31/2001
12/30/2001
Edit:
The script above suffers from a naive assumption about the length of days. It's a minor nit, but it could produce unexpected results under some circumstances. At least one other answer here also has that problem. Presumably, the date command with subtracting (or adding) a number of days doesn't have this issue.
Some answers require you to know the number of days in advance.
Here's another method which hopefully addresses those concerns:
while read -r d1 d2
do
t1=$(date -d "$d1 12:00 PM" +%s)
t2=$(date -d "$d2 12:00 PM" +%s)
if ((t2 > t1)) # swap times/dates if needed
then
temp_t=$t1; temp_d=$d1
t1=$t2; d1=$d2
t2=$temp_t; d2=$temp_d
fi
t3=$t1
days=0
while ((t3 > t2))
do
read -r -u 3 d3 t3 3<<< "$(date -d "$d1 12:00 PM - $days days" '+%m/%d/%Y %s')"
((++days))
echo "$d3"
done
done < inputfile
You can do this in the shell without awk, assuming you have GNU date (which is needed for the date -d #nnn form, and possibly the ability to strip leading zeros on single digit days and months):
while read start end ; do
for d in $(seq $(date +%s -d $end) -86400 $(date +%s -d $start)) ; do
date +%-m/%-d/%Y -d #$d
done
done
If you are in a locale that does daylight savings, then this can get messed up if requesting a date sequence where a daylight saving switch occurs in between. Use -u to force to UTC, which also strictly observes 86400 seconds per day. Like this:
while read start end ; do
for d in $(seq $(date -u +%s -d $end) -86400 $(date -u +%s -d $start)) ; do
date -u +%-m/%-d/%Y -d #$d
done
done
Just feed this your input on stdin.
The output for your data is:
7/7/2009
7/6/2009
7/5/2009
7/4/2009
7/3/2009
7/2/2009
3/3/1996
3/2/1996
3/1/1996
2/29/1996
2/28/1996
1/4/2002
1/3/2002
1/2/2002
1/1/2002
12/31/2001
12/30/2001
Another option is to use dateseq from dateutils (http://www.fresse.org/dateutils/#dateseq). -i changes the input format and -f changes the output format. -1 must be specified as an increment when the first date is later than the second date.
$ dateseq -i %m/%d/%Y -f %m/%d/%Y 7/7/2009 -1 7/2/2009
07/07/2009
07/06/2009
07/05/2009
07/04/2009
07/03/2009
07/02/2009
$ dateseq 2017-04-01 2017-04-05
2017-04-01
2017-04-02
2017-04-03
2017-04-04
2017-04-05
I prefer ISO 8601 format dates - here is a solution using them.
You can adapt it easily enough to American format if you wish.
AWK Script
BEGIN {
days[ 1] = 31; days[ 2] = 28; days[ 3] = 31;
days[ 4] = 30; days[ 5] = 31; days[ 6] = 30;
days[ 7] = 31; days[ 8] = 31; days[ 9] = 30;
days[10] = 31; days[11] = 30; days[12] = 31;
}
function leap(y){
return ((y %4) == 0 && (y % 100 != 0 || y % 400 == 0));
}
function last(m, l, d){
d = days[m] + (m == 2) * l;
return d;
}
function prev_day(date, y, m, d){
y = substr(date, 1, 4)
m = substr(date, 6, 2)
d = substr(date, 9, 2)
#print d "/" m "/" y
if (d+0 == 1 && m+0 == 1){
d = 31; m = 12; y--;
}
else if (d+0 == 1){
m--; d = last(m, leap(y));
}
else
d--
return sprintf("%04d-%02d-%02d", y, m, d);
}
{
d1 = $1; d2 = $2;
print d2;
while (d2 != d1){
d2 = prev_day(d2);
print d2;
}
}
Call this file: dates.awk
Data
2009-07-02 2009-07-07
1996-02-28 1996-03-03
2001-12-30 2002-01-04
Call this file: dates.txt
Results
Command executed:
awk -f dates.awk dates.txt
Output:
2009-07-07
2009-07-06
2009-07-05
2009-07-04
2009-07-03
2009-07-02
1996-03-03
1996-03-02
1996-03-01
1996-02-29
1996-02-28
2002-01-04
2002-01-03
2002-01-02
2002-01-01
2001-12-31
2001-12-30
You can convert date to unix timestamp and then sequencing on it, you can even have granularity of nanoseconds if you want (with '%N' in date)
The following example prints time from 2020-11-07 00:00:00 to 2020-11-07 01:00:00 in intervals of 5 minutes
# total seconds past 1970-01-01 00:00:00 as observed on UTC timestamp in UTC
# you change TZ to represent time in your timezone like TZ="Asia/Kolkata"
start_time=$(date -u -d 'TZ="UTC" 2020-11-07 00:00:00' '+%s')
end_time=$(date -u -d 'TZ="UTC" 2020-11-07 01:00:00' '+%s')
# 60 seconds * 5 times (i.e. 5 minutes)
# you change interval according your needs or leave it to show every second
interval=$((60 * 5))
# generate sequence with intervals and convert back to timestamp in UTC
# again change TZ to represent timein your timezone
seq ${start_time} ${interval} ${end_time} |
xargs -I{} date -u -d 'TZ="UTC" #'{} '+%F %T'

Extracting multiple parts of a string using bash

I have a caret delimited (key=value) input and would like to extract multiple tokens of interest from it.
For example: Given the following input
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"
1=A00^35=D^22=101^150=1^33=1
1=B000^35=D^22=101^150=2^33=2
I would like the following output
35=D^150=1^
35=D^150=2^
I have tried the following
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"|egrep -o "35=[^/^]*\^|150=[^/^]*\^"
35=D^
150=1^
35=D^
150=2^
My problem is that egrep returns each match on a separate line. Is it possible to get one line of output for one line of input? Please note that due to the constraints of the larger script, I cannot simply do a blind replace of all the \n characters in the output.
Thank you for any suggestions.This script is for bash 3.2.25. Any egrep alternatives are welcome. Please note that the tokens of interest (35 and 150) may change and I am already generating the egrep pattern in the script. Hence a one liner (if possible) would be great
You have two options. Option 1 is to change the "white space character" and use set --:
OFS=$IFS
IFS="^ "
set -- 1=A00^35=D^150=1^33=1 # No quotes here!!
IFS="$OFS"
Now you have your values in $1, $2, etc.
Or you can use an array:
tmp=$(echo "1=A00^35=D^150=1^33=1" | sed -e 's:\([0-9]\+\)=: [\1]=:g' -e 's:\^ : :g')
eval value=($tmp)
echo "35=${value[35]}^150=${value[150]}"
To get rid of the newline, you can just echo it again:
$ echo $(echo "1=A00^35=D^150=1^33=1"|egrep -o "35=[^/^]*\^|150=[^/^]*\^")
35=D^ 150=1^
If that's not satisfactory (I think it may give you one line for the whole input file), you can use awk:
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=35,150 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
35=D^150=1^
35=d^
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=1,33 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
1=A00^33=1^
1=a00^33=11^
This one allows you to use a single awk script and all you need to do is to provide a comma-separated list of keys to print out.
And here's the one-liner version :-)
echo '1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLST=1,33 -F^ '{s="";split(LST,k,",");for(i=1;i<=NF;i++){for(j in k){split($i,arr,"=");if(arr[1]==k[j]){printf s""arr[1]"="arr[2];s="^";}}}if(s!=""){print s;}}'
given a file 'in' containing your strings :
$ for i in $(cut -d^ -f2,3 < in);do echo $i^;done
35=D^150=1^
35=D^150=2^

Resources