why uniq don't give non-duplicated results - shell

find ./2012 -type f | cut -d '/' -f 5 | uniq
The usual filenames look like
./2012/NY/F/Zoe
./2012/NJ/M/Zoe
I suppose the command above should give non-duplicated result of file names like Zoe only for once, but it turns out not so.
Why? and how should I write to get the desired result?

uniq only detects duplicates if they're in consecutive lines. The usual idiom
is to sort | uniq to ensure that any duplicates will appear together.

uniq requires the duplicates to be adjacent, which means you need to sort the input, which means you might as well use sort -u;
find 2012 -type f | cut -d/ -f5 | sort -u

Related

Get most popular domains from log

I am trying to get most popular domain from a log file
The log format is like this
197.123.43.59, 27/May/2015:01:00:11 -0600, https://m.facebook.com/
I am interested only with the domain and i want an output as follows
XXXX facebook.com
where XXXX is the number of similar entries in logs
A one liner unix command anyone
Edit
I tried the following
grep -i * sites.log | sort | uniq -c | sort -nr | head -10 &> popular.log
but popular.log is empty , implying that command is wrong
perl -nle '$d{$1}++ if m!//([^/]+)!; END {foreach(sort {$d{$a} <= $d{$b}} keys(%d)) {print "$d{$_}\t$_"};}' your.log
if you don't mind perl
uniq -c counts unique occurences, but requires sorted input.
sort sorts a stream of data
grep has a flag -o which returns only the output that matched the regex
These three parts put together is what you need to perform map/reduce on this and get the data you want.
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c
The grep gets only the single topdomains and the domain name, all the entries are sored by the sort and the uniq counts the occurences.
Adding a sort -n at the end will give you a list where the top most entries are the highest
grep -o '[^.]*\.[^.]*$' logfile | sort | uniq -c | sort -nr

How to find most frequent string in file

I have a question about bash script, lets say there is file witch contains lines, each line will have path to a file and a date, the problem is how to find most frequent path.
Thanks in advance.
Here's a suggestion
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
# \_____________________/ \__/ \_____/ \______/ \_______/
# select the file column sort print sort on print top
# files counts count result
Example use:
$ cat file.txt
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileB jan:17:13:46:27:2015
/home/admin/fileC jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
3 /home/admin/fileA
You can strip out 3 from the final result by another cut.
Reverse the lines, cut the begginning (the date), reverse them again, then sort and count unique lines:
cat file.txt | rev | cut -b 22- | rev | sort | uniq -c
If you're absolutely sure you won't have whitespace in your paths, you can avoid rev altogether:
cat file.txt | cut -d " " -f 1 | sort | uniq -c
If the output is too long to inspect visually, aioobe's suggestion of following this with sort -rn | head -n1 will serve you well
It's worth noticing, as aioobe mentioned, that many unix commands optionally take a file argument. By using it, you can avoid the extra cat command in the beginning, by supplying its argument to the next command:
cat file.txt | rev | ... vs rev file.txt | ...
While I personally find the first option both easier to remember and understand, the second is preferred by many (most?) people, as it saves up system resources (specifically, the memory and references used by an additional process) and can have better performance in some specific use cases. Wikipedia's cat article discusses this in detail.

How to sort the results of find (including nested directories) alphabetically in bash

I have a list of directories based on the results of running the "find" command in bash. As an example, the result of find are the files:
test/a/file
test/b/file
test/file
test/z/file
I want to sort the output so it appears as:
test/file
test/a/file
test/b/file
test/z/file
Is there any way to sort the results within the find command, or by piping the results into sort?
If you have the GNU version of find, try this:
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}'
To use these file names in a loop, do
find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}' | while read file; do
# use $file
done
The find command prints three things for each file: (1) its directory, (2) its depth in the directory tree, and (3) its full name. By including the depth in the output we can use sort -n to sort test/file above test/a/file. Finally we use awk to strip out the first two columns since they were only used for sorting.
Using \0 as a separator between the three fields allows us to handle file names with spaces and tabs in them (but not newlines, unfortunately).
$ find test -type f
test/b/file
test/a/file
test/file
test/z/file
$ find test -type f -printf '%h\0%d\0%p\n' | sort -t '\0' -n | awk -F'\0' '{print $3}'
test/file
test/a/file
test/b/file
test/z/file
If you are unable to modify the find command, then try this convoluted replacement:
find test -type f | while read file; do
printf '%s\0%s\0%s\n' "${file%/*}" "$(tr -dc / <<< "$file")" "$file"
done | sort -t '\0' | awk -F'\0' '{print $3}'
It does the same thing, with ${file%/*} being used to get a file's directory name and the tr command being used to count the number of slashes, which is equivalent to a file's "depth".
(I sure hope there's an easier answer out there. What you're asking doesn't seem that hard, but I am blanking on a simple solution.)
find test -type f -printf '%h\0%p\n' | sort | awk -F'\0' '{print $2}'
The result of find is, for example,
test/a'\0'test/a/file
test'\0'test/file
test/z'\0'test/z/file
test/b'\0'test/b/text file.txt
test/b'\0'test/b/file
where '\0' stands for null character.
These compound strings can be properly sorted with a simple sort:
test'\0'test/file
test/a'\0'test/a/file
test/b'\0'test/b/file
test/b'\0'test/b/text file.txt
test/z'\0'test/z/file
And the final result is
test/file
test/a/file
test/b/file
test/b/text file.txt
test/z/file
(Based on the John Kugelman's answer, with "depth" element removed which is absolutely redundant.)
If you want to sort alphabetically, the best way is:
find test -print0 | sort -z
(The example in the original question actually wanted files before directories, which is not the same and requires extra steps)
try this. for reference, it firsts sorts on the second field second char. which only exists on the file, and has a r for reverse meaning it is first, after that it will sort on the first char of the second field. [-t is field deliminator, -k is key]
find test -name file |sort -t'/' -k2.2r -k2.1
do a info sort for more info. there is a ton of different ways to use the -t and -k together to get different results.

Sort by number of occurrences

What's the quickest way to sort items by number of occurrences on a Linux terminal?
Ideally, I'm looking for a one-liner.
some_command | sort | uniq -c | sort -n

How can I sort file names by version numbers?

In the directory "data" are these files:
command-1.9a-setup
command-2.0a-setup
command-2.0c-setup
command-2.0-setup
I would like to sort the files to get this result:
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
I tried this
find /data/ -name 'command-*-setup' | sort --version-sort --field-separator=- -k2
but the output was
command-1.9a-setup
command-2.0a-setup
command-2.0c-setup
command-2.0-setup
The only way I found that gave me my desired output was
tree -v /data
How could I get with sort the output in the wanted order?
Edit: It turns out that Benoit was sort of on the right track and Roland tipped the balance
You simply need to tell sort to consider only field 2 (add ",2"):
find ... | sort --version-sort --field-separator=- --key=2,2
Original Answer: ignore
If none of your filenames contain spaces between the hyphens, you can try this:
find ... | sed 's/.*-\([^-]*\)-.*/\1 \0/;s/[^0-9] /.&/' | sort --version-sort --field-separator=- --key=2 | sed 's/[^ ]* //'
The first sed command makes the lines look like this (I added "10" to show that the sort is numeric):
1.9.a command-1.9a-setup
2.0.c command-2.0c-setup
2.0.a command-2.0a-setup
2.0 command-2.0-setup
10 command-10-setup
The extra dot makes the letter suffixed version number sort after the version number without the suffix. The second sed command removes the prefixed version number from each line.
There are lots of ways this can fail.
If you specify to sort that you only want to consider the second field (-k2) don't complain that it does not consider the third one.
In your case, run sort --version-sort without any other argument, maybe this will suit better.
Looks like this works:
find /data/ -name 'command-*-setup' | sort -t - -V -k 2,2
not with sort but it works:
tree -ivL 1 /data/ | perl -nlE 'say if /\Acommand-[0-9][0-9a-z.]*-setup\z/'
-v: sort the output by version
-i: makes tree not print the indentation lines
-L level: max display depth of the directory tree
Another way to do this is to pad your numbers.
This example pads all numbers to 8 digits.
Then, it does a plain alphanumeric sort.
Then, it removes the pad.
$ pad() { perl -pe 's/(\d+)/0000000\1/g' | perl -pe 's/0*(\d{8})/\1/g'; }
$ unpad() { perl -pe 's/0*([1-9]\d*|0)/\1/g'; }
$ cat files | pad | sort | unpad
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
command-10.1-setup
To get some insight into how this works, let's look at the padded sorted result:
$ cat files | pad | sort
command-00000001.00000009a-setup
command-00000002.00000000-setup
command-00000002.00000000a-setup
command-00000002.00000000c-setup
command-00000010.00000001-setup
You'll see that with all the numbers nicely padded to 8 digits, the alphanumeric sort puts the filenames into their desired order.
Old post, but... ls -l --sort=version may be of assistance (although for OP's example the sort is the same as done by ls -l in a RHEL 7.2):
command-1.9a-setup
command-2.0a-setup
command-2.0c-setup
command-2.0-setup
YMMV i guess.
$ cat files
command-1.9a-setup
command-2.0c-setup
command-10.1-setup
command-2.0a-setup
command-2.0-setup
$ cat files | sort -t- -k2,2 -n
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
command-10.1-setup
$ tac files | sort -t- -k2,2 -n
command-1.9a-setup
command-2.0-setup
command-2.0a-setup
command-2.0c-setup
command-10.1-setup
I have files in a folder and need to get those name in sort order, based on the number. E.g. -
abc_dr-1.txt
hg_io-5.txt
kls_er_we-3.txt
sd-4.txt
sl_rt_we_yh-2.txt
I need to sort them based on number.
So I used this to sort.
ls -1 | sort -t '-' -nk2
It gave me files in sort order based on number.

Resources