filter a log file data from a certain time range - bash

I want to write a script that asks user for the first and last date and time of the interval we want to filter our log data and I need some help.
I don't exactly know how to really find the data from that range as I can't use a single regex.
my log file looks like this:
108.162.221.147 - - [04/Aug/2016:18:59:59 +0200] "GET / HTTP/1.1" 200 10254 "-"...
141.101.99.235 - - [04/Aug/2016:19:00:00 +0200] "GET / HTTP/1.1" 200 10255 ...
108.162.242.219 - - [04/Aug/2016:19:00:00 +0200] "GET / HTTP/1.1" 200 10255...
185.63.252.237 - - [04/Aug/2016:19:00:00 +0200] "CONNECT...
108.162.221.147 - - [04/Aug/2016:19:00:00 +0200] "GET /?...
185.63.252.237 - - [04/Aug/2016:19:00:01 +0200] "CONNECT....
etc...
my script:
#!/bin/bash
echo "enter the log file name "
read fname
echo "enter the start date and time "
read startdate
echo "enter the end fate and time "
read enddate
result=$(some code for filtering rows from this range)
echo "$result" > 'log_results'
echo "results written into /root/log_results file"
I tried using
sed -n "/"$startdate"/,/"$enddate"/p" "fname"
didn't word as it couldn't see the date format because of slashes, the regex doesn't also work, as it finds only those 2 dates from log(maybe I've been writing it wrong)
how do I do this?

Usually it's best to use some kind of dedicated log parsing software for this kind of task, so that you don't have to do what you're trying to do. It's also decidedly not a job for regular expressions. However, if you must do this with text processing tools such as grep, I would suggest a two-phase approach:
Generate a list of every timestamp you want to find.
Use grep -F to find all lines in your log that contain one of those timestamps.
For example, if you only wanted to find the middle five lines of your file (the ones with the timestamp [04/Aug/2016:19:00:00 +0200]), that would make step 1 very simple (as you are generating a single-item list, with just one timestamp in it).
echo '[04/Aug/2016:19:00:00 +0200]' > interesting_times
Then find all the lines with that timestamp:
grep -F -f interesting_times logfile
You could generate a shorter list by reducing the precision of the timestamp. For example to find two entire hours of log data:
echo '[04/Aug/2016:19' > interesting_times
echo '[04/Aug/2016:20' >> interesting_times
I leave it to you to determine how to generate the list of interesting times, but seriously look into purpose-built log parsing software.

Related

How do I test the speed between my site and a proxy server?

I'm getting complaints from employees in the field that our site is slow. When I check it -- the speed is acceptable. They are all going through a proxy server that is not controlled by me.
I'd like to run a continuous ping to the proxy server, but I haven't found anything to do that.
How do I check the speed from my site to a proxy server?
You can set up a cronjob to ping a site of your choice, at the frequency you choose. Here I ping google.com every 15 minutes. I can adjust the number of times I ping with the flag -c count and the time between pings with -i interval. This time is in seconds, I can use shorter intervals if required, for example 0.5.
I then pipe to tail -n to only use the last line with the results. At this stage my output is as follows:
rtt min/avg/max/mdev = 12.771/17.448/23.203/4.022 ms
We then use awk to only take the 4th field and use tr to replace the slashes with commas. Finally we store the result in a CSV file.
Here is the whole line in crontab:.
*/15 * * * * ping -c 5 -i 1 google.com | tail -n 1 | awk '{ print $4 }' | tr "/" "," >> /home/john/pingLog.csv
It is important to run this as root. To do so we edit the crontab using sudo:
sudo crontab -e
The end result is a comma separated file that you can open in Excel or equivalent, or process as you wish.
As noted in the ping output the 4 figures are min/avg/max/mdev.
Here is a version for Windows. The result is not so refined as we had in the Linux version but we're still getting the essentiels. You could put it in a .bat file and run it with a planned task or put it directly in the planned task.
ping google.com | findstr Minimum >> TotalPings.txt
Which adds the following line every time it is run:
Minimum = 23ms, Maximum = 23ms, Moyenne = 23ms
You can change the server pinged to suit your needs.

Can't take filename from Apache logs

I have an owncloud seted up myself.
Need to log for the downloads files by users.
I made the bash script which greps Apache logs and pus it to the file.
Example of line in file
/var/log/httpd/ssl_access_log-20200621-46.63.46.133 - - [18/Jun/2020:13:07:33 +0000] "GET /ocs/v2.php/apps/files_sharing/api/v1/shares?format=json&path=%2FHJC%2FMaster-Schedule%20Draft%20for%20SOP%20of%20HJC%20(10.10.2019).xlsx&shared_with_me=true HTTP/1.1" 200 108
How I can get file name "Master-Schedule Draft for SOP of HJC (10.10.2019).xlsx" ???
OK. Finally I found decision by using 'sed'
sed 's#+# #g;s#%#\\x#g' <my-log-file> | xargs -0 printf "%b" > <result-file>
It decoded URL, so all that remains to be done - get the 'path' value

How to get access_log summary by goaccess starting from certain date?

Currently I keep 6 weeks of apache access_log. If I generate a access summary at month end:
cat /var/log/httpd/access_log* | goaccess --output-format=csv
the summary will include some access data from previous month.
How can I skip logs of previous month and summarise from first day of month?
p.s. the data-format is: %d/%b/%Y
You can trade the Useless Use of cat for a useful grep.
grep -n $(date +'[0-3][0-9]/%b/%Y') /var/log/httpd/access_log* |
goaccess --output-format=csv
If the logs are by date, it would be a lot more economical to skip the logs which you know are too old or too new, i.e. modify the wildcard argument so you only match the files you really want (or run something like find -mtime -30 to at least narrow the set to a few files).
(The cat is useless because, if goaccess is at all correctly written, it should be able to handle
goaccess --output-format=csv /var/log/httpd/access_log*
just fine.)

creating a script which finds two alternating patterns

So my issue is that I need to make a script which finds a pattern where Time to live and User-Agent occur in that order and I increment a count (or grab what data I want, etc; it will likely evolve from there.)
For example:
Time to live: 64
Some other data: ________
...
User-Agent: Mozilla/Chrome/IE:Windows/Unix/Mac
So basically the data appears in that order TTL then user-agent, from that information I can grab the data I want but I don't know what to do about the pattern to identify this. If it helps I'm getting this data from a Wireshark capture saved as a text file.
Thanks to Shellter I got to the point where I have:
egrep ' User-Agent:| Time to live:' ../*.txt
which finds if both (TTL and UA) are in the file.
I'd appreciate any assistance.
Fragment offset: 0
Time to live: 128
Protocol: TCP (6)
Header checksum: 0x7e4d [correct]
[Good: True]
[Bad: False]
Source: 1.1.1.3 (1.1.1.3)
Destination: 1.1.1.4 (1.1.1.4)
//packet 2
Fragment offset: 0
Time to live: 128
Protocol: TCP (6)
Hypertext Transfer Protocol
GET / HTTP/1.1\r\n
[Expert Info (Chat/Sequence): GET / HTTP/1.1\r\n]
[Message: GET / HTTP/1.1\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /
Request Version: HTTP/1.1
Host: mail.yahoo.com\r\n
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n
I apologize for the slow reply, I had to do some editing.
So basically I just need to identify when a TTL only occurs, when a TTL occurs and there's user-agent data; basically I use this to identify clients behind a gateway.
So if TTL is 126 (windows) and I see 125, we assume it's behind a gateway and count++.
If we get that same count but with a different user-agent but same OS, count doesn't change.
If we get that same count but with a different user-agent and OS, count++.
so output could be as simple as:
1 (ttl)
1 (ttl+os)
2 (ttl+os+ua)
from the example (not the data) above.
It's still a little unclear what you're looking to report, but maybe this will help.
We're going to use awk as that tool was designed to solve problems of this nature (among many others).
And while my output doesn't match your output exactly, I think the code is self-documenting enough that you can work with this, and make a closer approximation to your final need. Feel free to update your question with your new code, new output, and preferably an exact example of the output you hope to achieve.
awk '
/Time to live/{ttl++}
/User-Agent/{agent++}
/Windows|Linux|Solaris/{os++}
END{print "ttl="ttl; print "os="os; print"agent="agent}
' ttlTest.txt
output
ttl=2
os=1
agent=1
The key thing to understand is that awk (and most Unix based reg-ex utilities, grep included) read each line of input and decide if it will print (or do something else) with the current line of data.
awk normally will print every line of input if you give it something like
awk '{print $1}' file
i this example, printing just the first field from each line of data.
In the solution above, we're filtering the data with regular expressions and the applying an action because we have matched some data, i.e.
/Time to live/{ ttl++ }
| | | |
| | | > block end
| | > action (in this case, increment value of ttl var
| > block begin
>/ regex to match / #
So we have 2 other 'regular expressions' that we're scanning each line for, and every time we match that regular expression, we increment the related variable.
Finally, awk allows for END blocks that execute after all data has been read from files.
This is how we create your summary report. awk also has BEGIN blocks that execute before any data has been read.
Another idiom of awk scanning that allows for more complex patterns to be match looks like
awk '{
if ( /Time to live/ && User-Agent/ ) {
ttl_agent++
}
}' ttlTest.txt
Where the first and last { } block-definition characters, indicate that this logic will be applied to each line that is read from the data. This block can be quite complex, and can use other variable values to be evaluated inside the if test, like if ( var=5 ) { print "found var=5"}.
IHTH

AWSTATS issue : all records dropped

I have a problem when using AWSTATS to analyse my apache logs.
In the past everything worked well.
But now the log format has changed for my server.
Old format example:
194.206.22.25 - - [14/Dec/2009:12:23:33 +0100] "GET /gPM-Systems/css/default.css HTTP/1.1" 404 1036
New format example:
356652,mics,194.206.22.24,194.206.22.24,-,[05/Jul/2011:15:11:18 +0200],"GET /index.html HTTP/x.x",302,-
For the old format the good LogFormat to choose was 4.
Now it is this custom format:
LogFormat="%other %other %host %other %logname %time1 %methodurl
%code"
I also changed the LogSeparator to set "," instead of " ".
My problem is that all records are dropped.
-showdropped option shows that:
Dropped record (method/protocol 'GET /apache_pb.gif' not qualified
when LogType=W):
356652,mics,194.206.22.24,194.206.22.24,-,[05/Jul/2011:15:11:18
+0200],"GET /apache_pb.gif HTTP/1.0",302,-
I had a similar issue when I changed the format of my logs. The format was changed, as well as using tab as the field separator, and this caused the same error.
For the LogFile configuration option, I was already using a pipe. So I switched the tab out for a space by adding tr '\t' ' ' | to the end. Then I modified the AWStats config to separate on spaces.
I was able to get AWStats to parse the logs after this. Perhaps it will work for you as well.
If you are not already using a pipe for the LogFile configuration option, you can use cat to get the files into tr.
LogFile="cat /log/file/path/*.log | tr '\t' ' ' |"
Replace HTTP/1.x with nothing solve this issue

Resources