Sort by number of occurrences - bash

What's the quickest way to sort items by number of occurrences on a Linux terminal?
Ideally, I'm looking for a one-liner.

some_command | sort | uniq -c | sort -n

Related

Get most popular domains from log

I am trying to get most popular domain from a log file
The log format is like this
197.123.43.59, 27/May/2015:01:00:11 -0600, https://m.facebook.com/
I am interested only with the domain and i want an output as follows
XXXX facebook.com
where XXXX is the number of similar entries in logs
A one liner unix command anyone
Edit
I tried the following
grep -i * sites.log | sort | uniq -c | sort -nr | head -10 &> popular.log
but popular.log is empty , implying that command is wrong
perl -nle '$d{$1}++ if m!//([^/]+)!; END {foreach(sort {$d{$a} <= $d{$b}} keys(%d)) {print "$d{$_}\t$_"};}' your.log
if you don't mind perl
uniq -c counts unique occurences, but requires sorted input.
sort sorts a stream of data
grep has a flag -o which returns only the output that matched the regex
These three parts put together is what you need to perform map/reduce on this and get the data you want.
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c
The grep gets only the single topdomains and the domain name, all the entries are sored by the sort and the uniq counts the occurences.
Adding a sort -n at the end will give you a list where the top most entries are the highest
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c | sort -nr

why uniq don't give non-duplicated results

find ./2012 -type f | cut -d '/' -f 5 | uniq
The usual filenames look like
./2012/NY/F/Zoe
./2012/NJ/M/Zoe
I suppose the command above should give non-duplicated result of file names like Zoe only for once, but it turns out not so.
Why? and how should I write to get the desired result?
uniq only detects duplicates if they're in consecutive lines. The usual idiom
is to sort | uniq to ensure that any duplicates will appear together.
uniq requires the duplicates to be adjacent, which means you need to sort the input, which means you might as well use sort -u;
find 2012 -type f | cut -d/ -f5 | sort -u

Unix shell script to sort files depending on the 'date string' present in their file name

I am trying to sort files in a directory, depending on the 'date string' attached in the file name, for example files looks as below
SSA_F12_05122013.request.done
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
Where 05122013,12142012 and 01062013 represents the dates in format.
Please help me in providing a unix shell script to sort these files on the date string present in their file name(in descending and ascending order).
Thanks in advance.
Hmmm... why call on heavyweights like awk and Perl when sort itself has the capability to define what exactly to sort by?
ls SSA_F*.request.done | sort -k 1.13,1.16 -k 1.9,1.10 -k 1.11,1.12
Each -k option defines a "sort key":
-k 1.13,1.16
This defines a sort key ranging from field 1, column 13 to field 1, column 16. (A field is by default delimited by whitespace, which your filenames don't have.)
If your filenames are varying in length, defining the underscore as field separator (using the -t option) and then addressing columns in the third field would be the way to go.
Refer to man sort for details. Use the -r option to sort in descending order.
one way with awk and sort:
ls -1|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|sort|awk '$0=$NF'
if we break it down:
ls -1|
awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
the ls -1 just example. I think you have your way to get the file list, one per line.
test a little bit:
kent$ echo "SSA_F13_12142012.request.done
SSA_F12_05122013.request.done
SSA_F14_01062013.request.done"|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
SSA_F12_05122013.request.done
ls -lrt *.done | perl -lane '#a=split /_|\./,$F[scalar(#F)-1];$a[2]=~s/(..)(..)(....)/$3$2$1/g;print $a[2]." ".$_' | sort -rn | awk '{$1=""}1'
ls *.done | perl -pe 's/^.*_(..)(..)(....)/$3$2$1$&/' | sort -rn | cut -b9-
this would do +

Bash Script: count unique lines in file

Situation:
I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:
ip.ad.dre.ss[:port]
Desired result:
There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format
ip.ad.dre.ss[:port] count
where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.
So far, I'm using this command to scrape all of the ip addresses from the log file:
grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt
From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)
I can then use the following to extract the unique entries:
sort -u ips.txt > intermediate.txt
I don't know how I can aggregate the line counts somehow with sort.
You can use the uniq command to get counts of sorted repeated lines:
sort ips.txt | uniq -c
To get the most frequent results at top (thanks to Peter Jaric):
sort ips.txt | uniq -c | sort -bgr
To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:
sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l
Awk's arrays are associative so it may run a little faster than sorting.
Generating text file:
$ for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175
real 0m1.193s
user 0m0.701s
sys 0m0.388s
$ time awk '!seen[$0]++' random.txt | wc -l
31175
real 0m0.675s
user 0m0.108s
sys 0m0.171s
This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:
awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
If you don't care about performance and you want something easier to remember, then simply run:
sort ips.txt | uniq -c | sort -n
PS:
sort -n parse the field as a number, that is correct since we're sorting using the counts.

How can I sort file names by version numbers?

In the directory "data" are these files:
command-1.9a-setup
command-2.0a-setup
command-2.0c-setup
command-2.0-setup
I would like to sort the files to get this result:
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
I tried this
find /data/ -name 'command-*-setup' | sort --version-sort --field-separator=- -k2
but the output was
command-1.9a-setup
command-2.0a-setup
command-2.0c-setup
command-2.0-setup
The only way I found that gave me my desired output was
tree -v /data
How could I get with sort the output in the wanted order?
Edit: It turns out that Benoit was sort of on the right track and Roland tipped the balance
You simply need to tell sort to consider only field 2 (add ",2"):
find ... | sort --version-sort --field-separator=- --key=2,2
Original Answer: ignore
If none of your filenames contain spaces between the hyphens, you can try this:
find ... | sed 's/.*-\([^-]*\)-.*/\1 \0/;s/[^0-9] /.&/' | sort --version-sort --field-separator=- --key=2 | sed 's/[^ ]* //'
The first sed command makes the lines look like this (I added "10" to show that the sort is numeric):
1.9.a command-1.9a-setup
2.0.c command-2.0c-setup
2.0.a command-2.0a-setup
2.0 command-2.0-setup
10 command-10-setup
The extra dot makes the letter suffixed version number sort after the version number without the suffix. The second sed command removes the prefixed version number from each line.
There are lots of ways this can fail.
If you specify to sort that you only want to consider the second field (-k2) don't complain that it does not consider the third one.
In your case, run sort --version-sort without any other argument, maybe this will suit better.
Looks like this works:
find /data/ -name 'command-*-setup' | sort -t - -V -k 2,2
not with sort but it works:
tree -ivL 1 /data/ | perl -nlE 'say if /\Acommand-[0-9][0-9a-z.]*-setup\z/'
-v: sort the output by version
-i: makes tree not print the indentation lines
-L level: max display depth of the directory tree
Another way to do this is to pad your numbers.
This example pads all numbers to 8 digits.
Then, it does a plain alphanumeric sort.
Then, it removes the pad.
$ pad() { perl -pe 's/(\d+)/0000000\1/g' | perl -pe 's/0*(\d{8})/\1/g'; }
$ unpad() { perl -pe 's/0*([1-9]\d*|0)/\1/g'; }
$ cat files | pad | sort | unpad
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
command-10.1-setup
To get some insight into how this works, let's look at the padded sorted result:
$ cat files | pad | sort
command-00000001.00000009a-setup
command-00000002.00000000-setup
command-00000002.00000000a-setup
command-00000002.00000000c-setup
command-00000010.00000001-setup
You'll see that with all the numbers nicely padded to 8 digits, the alphanumeric sort puts the filenames into their desired order.
Old post, but... ls -l --sort=version may be of assistance (although for OP's example the sort is the same as done by ls -l in a RHEL 7.2):
command-1.9a-setup
command-2.0a-setup
command-2.0c-setup
command-2.0-setup
YMMV i guess.
$ cat files
command-1.9a-setup
command-2.0c-setup
command-10.1-setup
command-2.0a-setup
command-2.0-setup
$ cat files | sort -t- -k2,2 -n
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
command-10.1-setup
$ tac files | sort -t- -k2,2 -n
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
command-10.1-setup
I have files in a folder and need to get those name in sort order, based on the number. E.g. -
abc_dr-1.txt
hg_io-5.txt
kls_er_we-3.txt
sd-4.txt
sl_rt_we_yh-2.txt
I need to sort them based on number.
So I used this to sort.
ls -1 | sort -t '-' -nk2
It gave me files in sort order based on number.

Resources