Get most popular domains from log - shell

I am trying to get most popular domain from a log file
The log format is like this
197.123.43.59, 27/May/2015:01:00:11 -0600, https://m.facebook.com/
I am interested only with the domain and i want an output as follows
XXXX facebook.com
where XXXX is the number of similar entries in logs
A one liner unix command anyone
Edit
I tried the following
grep -i * sites.log | sort | uniq -c | sort -nr | head -10 &> popular.log
but popular.log is empty , implying that command is wrong

perl -nle '$d{$1}++ if m!//([^/]+)!; END {foreach(sort {$d{$a} <= $d{$b}} keys(%d)) {print "$d{$_}\t$_"};}' your.log
if you don't mind perl

uniq -c counts unique occurences, but requires sorted input.
sort sorts a stream of data
grep has a flag -o which returns only the output that matched the regex
These three parts put together is what you need to perform map/reduce on this and get the data you want.
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c
The grep gets only the single topdomains and the domain name, all the entries are sored by the sort and the uniq counts the occurences.
Adding a sort -n at the end will give you a list where the top most entries are the highest
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c | sort -nr

Related

Order grep output by second token

I am grepping for log output in many log files. I want the output of the grep command to be ordered by the time, which is the second token of every log line. The log lines are preceded with the date in the following format:
[2019-Jan-18 09:46:40.385624]
The logs are only ever from today so ordering only by the time will suffice.
I am using the following command to grep for a string:
grep "needle" /path/to/logs/*
How can I order the output by ascending time? I have tried piping to the sort command
grep "needle" /path/to/logs/* | sort
but that only sorts by the filename.
To force sort to sort second column you should use command like:
grep "needle" /path/to/logs/* | sort -k2
To use for sort only second column you should use it on this way:
grep "needle" /path/to/logs/* | sort -k2,2

Find uniq items in column of gzip file

I want to return the number of unique items in one column of my gzip file.
To sort on a normal file I know you can use something like:
sort -u -t, -k1,1 filename | wc -l
but when I run this on a gzip file I get:
?BC??\ks?ʑ???
Is it possible to change this format to find the unique items in a column, given a gzip file?
Okay so I actually figured it out!
gzcat vcf_del.vcf.gz | cut -f 2 | sort | uniq | wc -l (and zcat also works)
Then if there are parts of the file you do not want (for instance in VCF files there are a series of lines with "#") you can simply remove them as such:
gzcat vcf_del.vcf.gz | awk '!/^#/{print $0}' | cut -f 2 | sort | uniq | wc -l
The gzip package comes with the zcat program which works just like cat but works on gz files.
zcat filename | sort -u -t -k1,1
you cannot run search and sort commands on compressed file either you have to extract compressed file and then run you commands on output of the gzip command.
you can try below command
gunzip -c filename | sort -u -t -k1,1

How to find most frequent string in file

I have a question about bash script, lets say there is file witch contains lines, each line will have path to a file and a date, the problem is how to find most frequent path.
Thanks in advance.
Here's a suggestion
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
# \_____________________/ \__/ \_____/ \______/ \_______/
# select the file column sort print sort on print top
# files counts count result
Example use:
$ cat file.txt
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileB jan:17:13:46:27:2015
/home/admin/fileC jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
3 /home/admin/fileA
You can strip out 3 from the final result by another cut.
Reverse the lines, cut the begginning (the date), reverse them again, then sort and count unique lines:
cat file.txt | rev | cut -b 22- | rev | sort | uniq -c
If you're absolutely sure you won't have whitespace in your paths, you can avoid rev altogether:
cat file.txt | cut -d " " -f 1 | sort | uniq -c
If the output is too long to inspect visually, aioobe's suggestion of following this with sort -rn | head -n1 will serve you well
It's worth noticing, as aioobe mentioned, that many unix commands optionally take a file argument. By using it, you can avoid the extra cat command in the beginning, by supplying its argument to the next command:
cat file.txt | rev | ... vs rev file.txt | ...
While I personally find the first option both easier to remember and understand, the second is preferred by many (most?) people, as it saves up system resources (specifically, the memory and references used by an additional process) and can have better performance in some specific use cases. Wikipedia's cat article discusses this in detail.

Unix shell script to sort files depending on the 'date string' present in their file name

I am trying to sort files in a directory, depending on the 'date string' attached in the file name, for example files looks as below
SSA_F12_05122013.request.done
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
Where 05122013,12142012 and 01062013 represents the dates in format.
Please help me in providing a unix shell script to sort these files on the date string present in their file name(in descending and ascending order).
Thanks in advance.
Hmmm... why call on heavyweights like awk and Perl when sort itself has the capability to define what exactly to sort by?
ls SSA_F*.request.done | sort -k 1.13,1.16 -k 1.9,1.10 -k 1.11,1.12
Each -k option defines a "sort key":
-k 1.13,1.16
This defines a sort key ranging from field 1, column 13 to field 1, column 16. (A field is by default delimited by whitespace, which your filenames don't have.)
If your filenames are varying in length, defining the underscore as field separator (using the -t option) and then addressing columns in the third field would be the way to go.
Refer to man sort for details. Use the -r option to sort in descending order.
one way with awk and sort:
ls -1|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|sort|awk '$0=$NF'
if we break it down:
ls -1|
awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
the ls -1 just example. I think you have your way to get the file list, one per line.
test a little bit:
kent$ echo "SSA_F13_12142012.request.done
SSA_F12_05122013.request.done
SSA_F14_01062013.request.done"|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
SSA_F12_05122013.request.done
ls -lrt *.done | perl -lane '#a=split /_|\./,$F[scalar(#F)-1];$a[2]=~s/(..)(..)(....)/$3$2$1/g;print $a[2]." ".$_' | sort -rn | awk '{$1=""}1'
ls *.done | perl -pe 's/^.*_(..)(..)(....)/$3$2$1$&/' | sort -rn | cut -b9-
this would do +

Bash Script: count unique lines in file

Situation:
I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:
ip.ad.dre.ss[:port]
Desired result:
There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format
ip.ad.dre.ss[:port] count
where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.
So far, I'm using this command to scrape all of the ip addresses from the log file:
grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt
From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)
I can then use the following to extract the unique entries:
sort -u ips.txt > intermediate.txt
I don't know how I can aggregate the line counts somehow with sort.
You can use the uniq command to get counts of sorted repeated lines:
sort ips.txt | uniq -c
To get the most frequent results at top (thanks to Peter Jaric):
sort ips.txt | uniq -c | sort -bgr
To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:
sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l
Awk's arrays are associative so it may run a little faster than sorting.
Generating text file:
$ for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175
real 0m1.193s
user 0m0.701s
sys 0m0.388s
$ time awk '!seen[$0]++' random.txt | wc -l
31175
real 0m0.675s
user 0m0.108s
sys 0m0.171s
This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:
awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
If you don't care about performance and you want something easier to remember, then simply run:
sort ips.txt | uniq -c | sort -n
PS:
sort -n parse the field as a number, that is correct since we're sorting using the counts.

Resources