Sorting a column by number of occurrences

Sorting a column by number of occurrences - sorting

I have some data separated by tabs
8/1/12 15:22 622070509 Pig 123123123
8/1/12 15:27 569038096 Monkey 123123123
8/1/12 15:21 389549550 CatDog 123123
8/1/12 15:26 558161100 Monkey 1231245
8/1/12 15:28 274990777 CatDog 112312
8/1/12 15:22 274990777 CatDog 12341
I want to sort column four by number of occurrences, in decending order so the output would look like this:
8/1/12 15:22 274990777 CatDog 12341
8/1/12 15:28 274990777 CatDog 112312
8/1/12 15:21 389549550 CatDog 123123
8/1/12 15:26 558161100 Monkey 1231245
8/1/12 15:27 569038096 Monkey 123123123
8/1/12 15:22 622070509 Pig 123123123
So far:
sort -t$'\t' -k4 file.txt
Sorts by alphabetical order just fine, but I'm not seeing a parameter for sort by # of occurrences.

Learn to think algorithmically. How would you process the data by hand?
Count the number of occurrences of each value in the fourth column, giving you a pair {Name, Count}.
Join the main data with the {Name, Count} data, giving you an extra column that tells you the number of occurrences.
Sort the augmented data by descending Count, and within equal counts by Name.
Drop the Count column from the output.
There are Unix tools to support all those operations with greater or less degrees of difficulty. There are, indeed, multiple ways to do each step. You can do it all in Perl or Python (or, indeed, awk). Or you can do it in stages, using awk, join, sort, and perhaps sed.

cat infile.txt |awk -F\t '{print $4}' |sort |uniq -c |sort -nr |awk {'print $2'} |xargs -I % grep % infile.txt > outfile.txt

You have to set the flag for numerical comparison (-n):
sort -t$'\t' -k 4 -n file.txt
You can also define the second sorting column like this:
sort -t$'\t' -k4n,4 -k3,3 file.txt
This will sort first by the 4th column numerically, and when it finds equal items, it will sort by the 3rd column alphabetically.

Related

How to delete lines with a duplicate numbers

I want to delete any lines that have the same number at end, for example:
Input:
abc 77777
rgtds 77777
aswa 77777
gdf 845
sdf 845
ytn 963
fgnb 963
Output:
abc 77777
gdf 845
ytn 963
Note: every line with a same number most deleted and one of all the lines that had the same number must stay.
I want to convert this text file to my output:
Input:
c:/files/company/aj/psohz.mp4 905
c:/files/company/rs/oxija.mp4 905
c:/files/company/nw/kzlkg.mp4 905
c:/files/company/wn/wpqov.mp4 905
c:/files/company/qi/jzdjg.mp4 905
c:/files/company/kq/dadfr..mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/fx/jszmn.jpg 7839
c:/files/company/me/plsqx.mp4 7839
c:/files/company/xm/uswjb.mp4 7839
c:/files/company/ay/pnnhu.pdf 8636184
c:/files/company/os/glwou.pdf 8636184
c:/files/company/px/kucdu.pdf 8636184
Output:
c:/files/company/kq/dadfr..mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/ay/pnnhu.pdf 8636184

If the same numbers are always grouped together, you can use uniq (tested with the version from GNU coreutils):
uniq -f1 input.txt
-f1 means skip the first field when checking duplicities.
Note that it returns the first element of each group, i.e. psohz instead of dadfr in your example. It's not clear what element of each group you wanted, as you returned the last one from the first group, but the first element of the other groups.
If the same numbers aren't grouped together, use sort to group them together:
sort -k2 -su input.txt
-s means stable, i.e. you'll always get the first element of each group, but the groups won't be sorted in the orginal order in the output
-u means unique
-k2 means use only field 2 in comparisons
If you want the first element of each group with the elements sorted the same as in the input, you can use perl.
perl -ane 'print unless $seen{ $F[1] }++' -- input.txt
-n reads the input line by line
-a splits the input on whitespace into the #F array
every second column is saved as a key in the %seen hash. If you see a number for the first time, the line will be printed, but any following occurrence won't, as $seen{ $F[1] } will be greater than 0, i.e. true.

If you know that there are always just two columns (i.e., no blanks in the filename) and that the lines with the same number are always in the same block, you can use uniq:
$ uniq -f1 infile
c:/files/company/aj/psohz.mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/ay/pnnhu.pdf 8636184
-f1 says to ignore the first field when asserting uniqueness.
If you don't know about blanks, and the same numbers might be anywhere in the file, you can use awk:
$ awk '!a[$NF]++' infile
c:/files/company/aj/psohz.mp4 905
c:/files/company/kp/xmpye.jpg 7839
c:/files/company/ay/pnnhu.pdf 8636184
This counts the number of occurrences of the last field of each line, and if that number is zero before incrementing, the line gets printed. It's a compact way of expressing
awk '{ if (a[$NF] == 0) { print; a[$NF] += 1 } }' infile

How do I sort lines by number of appearances UNIX?

I want to sort input by number of appearances. However I don't want to delete either the unique or non-unique lines. For instance if I was given the following input:
Not unique
This line is unique
Not unique
Also not unique
Also unique
Also not unique
Not unique
I'd be looking for a set of pipelined commands that would output the following:
This line is unique
Also unique
Also not unique
Also not unique
Not unique
Not unique
Not unique
Thank you for any help that you can provide, I've been trying to use different combinations of unique and sort but can't figure it out, the solution would preferably be a one liner.
UPDATE: Thank you to all who responded, especially #batMan who's answer was exactly what I was looking for with commands with which I was familiar.
I'm still trying to learn how to pipeline and use multiple commands for seemingly simple tasks so is it possible for me to adapt his answer to work with 2 columns? For instance if the original input had been:
Notunique dog
Thislineisunique cat
Notunique parrot
Alsonotunique monkey
Alsounique zebra
Alsonotunique beaver
Notunique dragon
And I wanted the output to be sorted by first column like so:
Thislineisunique cat
Alsounique zebra
Alsonotunique monkey
Alsonotunique beaver
Notunique dog
Notunique parrot
Notunique dragon
Thank you all for being so helpful in advance!

The awk alone would be best for your updated question.
$ awk '{file[$0]++; count[$1]++; max_count= count[$1]>max_count?count[$1]:max_count;} END{ k=1; for(n=1; n<=max_count; n++){ for(i in count) if(count[i]==n) ordered[k++]=i} for(j in ordered) for( line in file) if (line~ordered[j]) print line; }' file
Alsounique zebra
Thislineisunique cat
Alsonotunique beaver
Alsonotunique monkey
Notunique parrot
Notunique dog
Notunique dragon
Explanation:
Part-1:
{file[$0]++; count[$1]++; max_count= count[$1]>max_count?count[$1]:max_count;}:
We are storing your input file in file array; The count array keeps track of counts of each unique first field based on which you want your file to be sorted. max_count keeps track of max count.
Part-2:
Once awk finishes reading file, the content of count would be as following : (keys, values)
Alsounique 1
Notunique 3
Thislineisunique 1
Alsonotunique 2
Now our aim is to sort these keys by values as shown below. This is our key step as for each field/key/column 1 in below output we'll iterate over file array and print the lines that contains these keys and it will give us the final desired output.
Alsounique
Thislineisunique
Alsonotunique
Notunique
Below loop does the operation of storing the content of count array in another array called ordered in the sorted by values fashion. The content of ordered will be same as the output shown above.
for(n=1; n<=max_count; n++)
{
for(i in count)
if(count[i]==n)
ordered[k++]=i
}
The final step: i.e to iterate over file array and print the lines in the order of the fields stored in ordered array.
for(field in ordered)
for( line in file)
if (line~ordered[field])
print line;
}
Solution-2 :
The other possible solution would be using sort, uniq and awk/cut. But I won't recommend using this if your input file is very large as multiple pipes invokes multiple processes which slows down the whole operation.
$ cut -d ' ' -f1 file | sort | uniq -c | sort -n | awk 'FNR==NR{ordered[i++]=$2; next} {file[$0]++;} END{for(j in ordered) for( line in file) if (line~ordered[j]) print line;} ' - file
Alsounique zebra
Thislineisunique cat
Alsonotunique beaver
Alsonotunique monkey
Notunique parrot
Notunique dog
Notunique dragon
Previous solution (Before OP Edited the question)
This could be done using sort, uniq and awk like this :
$ uniq -c <(sort f1) | sort -n | awk '{ for (i=1; i<$1; i++){print}}1'
1 Also unique
1 This line is unique
2 Also not unique
2 Also not unique
3 Not unique
3 Not unique
3 Not unique

I would use awk to count the number of times each line occurs and then print them out (pre-pended by frequency) and sort numerically using sort -n:
awk 'FNR==NR{freq[$0]++; next} {print freq[$0],$0}' data.txt data.txt | sort -n
Sample Output
1 Also unique
1 This line is unique
2 Also not unique
2 Also not unique
3 Not unique
3 Not unique
3 Not unique
It's a Schwartzian transform really. If you want to discard the leading frequency column, just add | cut -d ' ' -f 2- to the end of the command.

uniq + sort + grep solution:
Extended inputfile contents:
Not unique
This line is unique
Not unique
Also not unique
Also unique
Also not unique
Not unique
Also not unique
Also not unique
Sorting the initial file beforehand:
sort inputfile > /tmp/sorted
uniq -u /tmp/sorted; uniq -dc /tmp/sorted | sort -n | cut -d' ' -f8- \
| while read -r l; do grep -x "$l" /tmp/sorted; done
The output:
Also unique
This line is unique
Not unique
Not unique
Not unique
Also not unique
Also not unique
Also not unique
Also not unique
----------
You may also enclose the whole job into bash script:
#!/bash/bash
sort "$1" > /tmp/sorted # $1 - the 1st argument (filename)
uniq -u /tmp/sorted
while read -r l; do
grep -x "$l" /tmp/sorted
done < <(uniq -dc /tmp/sorted | sort -n | cut -d' ' -f8-)

Why does coreutils sort give a different result when I use a different field delimiter?

When using sort on the command line, why does the sorted order depend on which field delimiter I use? As an example,
$ # The test file:
$ cat test.csv
2,az,a,2
3,a,az,3
1,az,az,1
4,a,a,4
$ # sort based on fields 2 and 3, comma separated. Gives correct order.
$ LC_ALL=C sort -t, -k2,3 test.csv
4,a,a,4
3,a,az,3
2,az,a,2
1,az,az,1
$ # replace , by ~ as field separator, then sort as before. Gives incorrect order.
$ tr "," "~" < test.csv | LC_ALL=C sort -t"~" -k2,3
2~az~a~2
1~az~az~1
4~a~a~4
3~a~az~3
The second case not only gets the ordering wrong, but is inconsistent between field 2 (where az < a) and field 3 (where a < az).

There is a mistake in -k2,3. That means that sort should sort starting at the 2nd field and ending at the 3rd field. That means that the delimiter between them is also part of what is to be sorted and therefore counts as character. That's why you encounter different sorts with different delimiters.
What you want is the following:
LC_ALL=C sort -t"," -k2,2 -k3,3 file
And:
tr "," "~" < file | LC_ALL=C sort -t"~" -k2,2 -k3,3
That means sort should sort the 2nd field and is the 2nd field has dublicates sort the 3rd field.

How to combine ascending and descending sorting?

I have a very big file (many gigabytes) which looks like
input.txt
a|textA|2
c|textB|4
b|textC|5
e|textD|1
d|textE|4
b|textF|5
At the first step, I want to sort lines numerically by the third column in descending order, and if lines have the same value of the third column, they must be sorted by the text of the first column – in ascending order. And if lines have equal values for their 1st and 3rd columns, they must be sorted by the 2nd column in ascending order. The second columns are guaranteed to be unique and different.
So, I want the result to be:
desiredOutput.txt
b|textC|5
b|textF|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
I can take the first step:
sort -t\| -bfrnk3 path/to/input.txt > path/to/output.txt
But what is the next steps? And maybe the result might be achieved in a single pass?
EDIT
I tested sort -t '|' -k 3,3nr -k 1,1 -k 2,2 input.txt > output.txt. It gives the following "output.txt":
b|textF|5
b|textC|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
which is not what I want.

$ cat file
a|textA|2
c|textB|4
b|textC|5
e|textD|1
d|textE|4
b|textF|5
$ sort -t '|' -k 3,3nr -k 1,1 -k 2,2 file
b|textC|5
b|textF|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
$ sort -t '|' -k 3,3nr file
b|textC|5
b|textF|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
$
n in 3,3nr means numeric sorting, r means reverse. Seems like -k 1,1 -k 2,2 is optional as I guess sort would sort in the ascending order by default.

If this is UNIX:
sort -k 3 path/to/input.txt > path/to/output.txt
You can use multiple -k flags to sort on more than one column. For example, to sort by 3rd column then 1st column as a tie breaker:
sort -k 3,2 -k 1,1 input.txt > output.txt
Relevant options from "man sort":
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition.

You can do it by Sort Command only :-
sort -t"|" -k3 -n -k1 -k2 inputFile.txt
k3 specifying that sort according to 3rd column and similarly k1 & k2 according to column 1st & 2nd respectively.

How to remove duplicates by column (inverse ordering)

I've looking for this in here, but did not found the exact case. Sorry if it is duplicated, but I couldn't find it.
I have a huge file in Debian that contains 4 columns separated by "#", with the following format:
username#source#date#time
For example:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I want to print unique rows based on the first two columns, and if duplicates found, it has to print the last event based on date/time. With the list above, the result should be:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I have tested it using two commands:
cat file | sort -u -t# -k1,2
cat file | sort -r -u -t# -k1,2
But both of them print the following:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40 --> Wrong line, it is older than the duplicate one
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
Is there any way to do it?
Thanks!

This should work
tac file | awk -F# '!a[$1,$2]++' | tac
Output
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30

First, you need sort the input file to ensure the order of lines, e.g. for duplicate username#source you will get ordered times. Best is sort reverse, so last event comes first. This can be done with an simple sort, like:
sort -r < yourfile
This will produce from your input the next:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A222222#Juniper#2014-08-07#14:31:40
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
reverse-ordered lines, where for the each username#source combination the latest event comes first.
next, you need somewhat filter the sorted lines, to get only the first event. This can be done, with several tools, like awk or uniq or perl and such,
So, the solution
sort -r <yourfile | uniq -w16
or
sort -r <yourfile | awk -F# '!seen[$1,$2]++'
or
sort -r yourfile | perl -F'#' -lanE 'say $_ unless $seen{"$F[0],$F[1]"}++'
all the above will print the next
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
Finally you can re-sort the unique lines as you want and needed.

awk -F\# '{ p = ($1 FS $2 in a ); a[$1 FS $2] = $0 }
!p { keys[++k] = $1 FS $2 }
END { for (k = 1; k in keys; ++k) print a[keys[k]] }' file
Output:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30

If you know for a fact that the first column is always 7 chars long, and second column also 7 chars long, you can extract unique lines considering only the first 16 characters with:
uniq file -w 16
Since you want the latter duplicate, you can reverse the data using tac prior to uniq and then reverse the output again:
tac file | uniq -w 16 | tac
Update: As commented below, uniq needs the lines to be sorted. In which case this starts to become contrived, and the awk based suggestions are better. Something like this would still work though:
sort -s -t"#" -k1,2 file | tac | uniq -w 16 | tac

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sorting a column by number of occurrences - sorting

cat infile.txt |awk -F\t '{print $4}' |sort |uniq -c |sort -nr |awk {'print $2'} |xargs -I % grep % infile.txt > outfile.txt

Related

How to delete lines with a duplicate numbers

How do I sort lines by number of appearances UNIX?

Why does coreutils sort give a different result when I use a different field delimiter?

How to combine ascending and descending sorting?

How to remove duplicates by column (inverse ordering)

Categories

Resources