Bash Script: count unique lines in file - bash

Situation:
I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:
ip.ad.dre.ss[:port]
Desired result:
There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format
ip.ad.dre.ss[:port] count
where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.
So far, I'm using this command to scrape all of the ip addresses from the log file:
grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt
From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)
I can then use the following to extract the unique entries:
sort -u ips.txt > intermediate.txt
I don't know how I can aggregate the line counts somehow with sort.

You can use the uniq command to get counts of sorted repeated lines:
sort ips.txt | uniq -c
To get the most frequent results at top (thanks to Peter Jaric):
sort ips.txt | uniq -c | sort -bgr

To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:
sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l
Awk's arrays are associative so it may run a little faster than sorting.
Generating text file:
$ for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175
real 0m1.193s
user 0m0.701s
sys 0m0.388s
$ time awk '!seen[$0]++' random.txt | wc -l
31175
real 0m0.675s
user 0m0.108s
sys 0m0.171s

This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:
awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
If you don't care about performance and you want something easier to remember, then simply run:
sort ips.txt | uniq -c | sort -n
PS:
sort -n parse the field as a number, that is correct since we're sorting using the counts.

Related

BASH script help using TOP, GREP and CUT

Use Top command which repeats 5 times, pipe the results to Grep and Cut command to print the PID for init process on your screen.
Hi all, I have my line of code:
top -n 5 | grep "init" | cut -d" " -f3 > topdata
But I cannot see any output to verify that it's working.
Also, the next script asks me to use a one line command which shows the total memory used in megabytes. I'm supposed to pipe results from Free to Grep to select or filter the lines with the pattern "Total:" then pipe that result to Cut and display the number representing total memory used. So far:
free -m -t | grep "total:" | cut -c25-30
Also not getting any print return on that one. Any help appreciated.
expanding on my comments:
grep is case sensitive. free says "Total", you grep "total". So no match! Either grep for "Total" or use grep -i.
Instead of cut, I prefer awk when I need to get a number out of a line. You do not know what length the number will be, but you know it will be the first number after Total:. So:
free -m -t | grep "Total:" | awk '{print $2}'
For your top command, if you have no init process (which you should, but it would probably not show in top), just grep for something else to see if your code works. I used cinnamon (running Mint). The top command is:
top -n 5 | grep "cinnamon" | awk '{print $1}'
Replace "cinnamon" by "init" for your requirement. Why $1 in the awk? My top puts the PID in the first column. Adjust accordingly.
Overall, using cut is good when you have a string that is delimited by some character. Ex. aaa;bbb;ccc, you would cut on -d';'. But here the numbers might have different lengths so using cut is not (IMHO) the best solution.
The init process has PID 1, to there's no reason to do like this.
To find the PID of a process in general, I'd recommend:
pidof <name>

Get most popular domains from log

I am trying to get most popular domain from a log file
The log format is like this
197.123.43.59, 27/May/2015:01:00:11 -0600, https://m.facebook.com/
I am interested only with the domain and i want an output as follows
XXXX facebook.com
where XXXX is the number of similar entries in logs
A one liner unix command anyone
Edit
I tried the following
grep -i * sites.log | sort | uniq -c | sort -nr | head -10 &> popular.log
but popular.log is empty , implying that command is wrong
perl -nle '$d{$1}++ if m!//([^/]+)!; END {foreach(sort {$d{$a} <= $d{$b}} keys(%d)) {print "$d{$_}\t$_"};}' your.log
if you don't mind perl
uniq -c counts unique occurences, but requires sorted input.
sort sorts a stream of data
grep has a flag -o which returns only the output that matched the regex
These three parts put together is what you need to perform map/reduce on this and get the data you want.
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c
The grep gets only the single topdomains and the domain name, all the entries are sored by the sort and the uniq counts the occurences.
Adding a sort -n at the end will give you a list where the top most entries are the highest
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c | sort -nr

Bash to find count of multiple strings in a large file

I'm trying to get the count of various strings in a large txt file using bash commands.
I.e. find the count of the strings 'pig', 'horse', and 'cat' using bash, and get an output say 'pig: 7, horse: 3, cat: 5'. I would like a way to search through the txt file only once, because it is very large (so I do not want to search for 'pig' through the whole txt file, then go back and search for 'horse', etc.)
Any help with commands would be appreciated. Thanks!
grep -Eo 'pig|horse|cat' txt.file | sort | uniq -c | awk '{print $2": "$1}'
Breaking that into pieces:
grep -Eo 'pig|horse|cat' Print all the occurrences (-o) of the
extended (-e) regex
sort Sort the resulting words
uniq -c Output unique values (of sorted input)
with the count (-c) of each value
awk '{print $2": "$1}' For each line, print the second field (the word)
then a colon and a space, and then the first
field (the count).

How to sort,uniq and display line that appear more than X times

I have a file like this:
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.1
80.13.178.3
80.13.178.3
80.13.178.3
80.13.178.4
80.13.178.4
80.13.178.7
I need to display unique entries for repeated line (similar to uniq -d) but only entries that occur more than just twice (twice being an example so flexibility to define the lower limit.)
Output for this example should be like this when looking for entries with three or more occurrences:
80.13.178.2
80.13.178.3
Feed the output from uniq -cd to awk
sort test.file | uniq -cd | awk -v limit=2 '$1 > limit{print $2}'
With pure awk:
awk '{a[$0]++}END{for(i in a){if(a[i] > 2){print i}}}' a.txt
It iterates over the file and counts the occurances of every IP. At the end of the file it outputs every IP which occurs more than 2 times.

Unix shell script to sort files depending on the 'date string' present in their file name

I am trying to sort files in a directory, depending on the 'date string' attached in the file name, for example files looks as below
SSA_F12_05122013.request.done
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
Where 05122013,12142012 and 01062013 represents the dates in format.
Please help me in providing a unix shell script to sort these files on the date string present in their file name(in descending and ascending order).
Thanks in advance.
Hmmm... why call on heavyweights like awk and Perl when sort itself has the capability to define what exactly to sort by?
ls SSA_F*.request.done | sort -k 1.13,1.16 -k 1.9,1.10 -k 1.11,1.12
Each -k option defines a "sort key":
-k 1.13,1.16
This defines a sort key ranging from field 1, column 13 to field 1, column 16. (A field is by default delimited by whitespace, which your filenames don't have.)
If your filenames are varying in length, defining the underscore as field separator (using the -t option) and then addressing columns in the third field would be the way to go.
Refer to man sort for details. Use the -r option to sort in descending order.
one way with awk and sort:
ls -1|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|sort|awk '$0=$NF'
if we break it down:
ls -1|
awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
the ls -1 just example. I think you have your way to get the file list, one per line.
test a little bit:
kent$ echo "SSA_F13_12142012.request.done
SSA_F12_05122013.request.done
SSA_F14_01062013.request.done"|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
SSA_F12_05122013.request.done
ls -lrt *.done | perl -lane '#a=split /_|\./,$F[scalar(#F)-1];$a[2]=~s/(..)(..)(....)/$3$2$1/g;print $a[2]." ".$_' | sort -rn | awk '{$1=""}1'
ls *.done | perl -pe 's/^.*_(..)(..)(....)/$3$2$1$&/' | sort -rn | cut -b9-
this would do +

Resources