Sort and remove duplicates based on column - bash

I have a text file:
$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10
I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.
Approach 1
$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1
It is not sorting based on the first column.
Approach 2
$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1
It removes the 542,9,1,418,1 line but I'd like to keep one copy.
It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?

The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.
Your best bet would be to either sort all columns:
sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text
or
use awk to filter duplicate lines and pipe it to sort.
awk '!_[$0]++' text | sort -t, -nk1,1

When sorting on a key, you must provide the end of the key as well, otherwise sort uses all following keys as well.
The following should work:
sort -t, -u -k1,1n text

Related

how to check if a file is sorted on nth column in unix?

lets say that I have a file as below:(comma separated)
cat test.csv
Rohit,India
Rahul,India
Surya Kumar,India
Shreyas Iyer,India
Ravindra Jadeja India
Rishabh Pant India
zzabc,abc
Now I want to check if the above file is sorted on 02nd column.
I tried the command sort -ct"," -k2,2 test.csv
I'm expecting it to say disorder in last line, but it is giving me disorder in 02nd line.
Could anybody tell me what is wrong here? and how to get the expected output?
The sort is not guaranteed to be stable. But some implementations of sort support an option which will force that. Try adding -s:
sort -sc -t, -k2,2 test.csv
but note that I would expect the first out of order line to be Ravindra Jadeja India, since the 2nd field of that line is the empty string which should sort before "India".

How to sort by numbers that are part of a filename in bash?

I'm trying to assign a variable in bash to the file in this directory with the largest number before the '.tar.gz' and I'm drawing a complete blank on the best way to approach this:
ls /dirname | sort
daily-500-12345.tar.gz
daily-500-12345678.tar.gz
daily-500-987654321.tar.gz
weekly-200-1111111.tar.gz
monthly-100-8675309.tar.gz
sort -Vrt - -k3,3
-V Natural sort
-r Reverse, so you can use head -1 to get the first line only
-t - Use hyphen as field separator
-k3,3 Sort using only the third field
Output:
daily-500-987654321.tar.gz
daily-500-12345678.tar.gz
monthly-100-8675309.tar.gz
weekly-200-1111111.tar.gz
daily-500-12345.tar.gz

Sort list of files with multiple sort keys

I want to be able to list all files within a directory sorted by multiple sort keys. For example:
Level_5_10_1.jpg
Level_5_1_1.jpg
I want Level_5_1_1.jpg to show up first. The sort order should start from the last number, so:
Level_4_2_1.jpg > Level_4_1_10.jpg
Level_3_2_1.jpg > Level_3_1_10.jpg
and so on..
I tried:
ls | sort -h -k3,3n -k2,2n -k1,1n -t_
but didn't get the result I wanted. For example, it listed Level_5_1_2.jpg < Level_1_2_1.jpg which is incorrect
Any ideas?
PS: This is a pastebin of the file list.
I've taken a small sample of filenames. When you split the filenames by _ with the -t option, the first field is 1 which would be "Level", field 2 would be the first number and so on. I'm not entirely sure of the order that you are specifically after, but I think this solution should at least provide you with something to work with. Note that I have truncated some of the results so that the overall pattern can hopefully be viewed more easily.
me#machine:~$ ls Level*.jpg | sort -t_ -k2n -k3n -k4n
Level_1_1_1.jpg
Level_1_1_2.jpg
Level_1_1_3.jpg
Level_1_1_4.jpg
Level_1_1_5.jpg
Level_1_2_1.jpg
Level_1_2_2.jpg
Level_1_2_3.jpg
Level_1_2_4.jpg
Level_1_2_5.jpg
Level_1_3_1.jpg
...
Level_1_10_5.jpg
Level_2_1_1.jpg
...
Level_2_1_5.jpg
Level_2_2_1.jpg
...
Level_2_2_5.jpg
Level_2_3_1.jpg
...
Level_2_10_5.jpg
Level_3_1_1.jpg
From your description, I think I'm getting the right results from this:
$ ls | sort -nt_ -k4,4 -k3,3 -k2,2
Remember that your first field (-k1) is the word "Level" in the files you've included in your question.
If you have really complex sorting needs, of course, you can always "map" your criteria onto simpler sortable items. For example, if your sort didn't include a -k option, you might do this:
$ ls | awk '{printf "%2d %2d %2d %s\n", $4, $3, $2, $0}' FS="[_.]" - | sort -n | awk '{print $NF}'
This takes the important fields, translates them in prefixed digits, sorts, then prints only the filename. You could use this technique if you wanted to map weekdays, or months, or something that doesn't sort naturally.
Of course, all this suffers from the standard set of ParsingLS issues.

Sort CSV file based on first column

Is there a way to sort a csv file based on the 1st column using some shell command?
I have this huge file with more than 150k lines hence I can do it in excel:( is there an alternate way ?
sort -k1 -n -t, filename should do the trick.
-k1 sorts by column 1.
-n sorts numerically instead of lexicographically (so "11" will not come before "2,3...").
-t, sets the delimiter (what separates values in your file) to , since your file is comma-separated.
Using csvsort.
Install csvkit if not already installed.
brew install csvkit
Sort CSV by first column.
csvsort -c 1 original.csv > sorted.csv
I don't know why above solution was not working in my case.
15,5
17,2
18,6
19,4
8,25
8,90
9,47
9,49
10,67
10,90
13,96
159,9
however this command solved my problem.
sort -t"," -k1n,1 fileName

sort and uniq csv files

using linux commands,
I have a quoted csv file which I sorted by first column and then second column, now I want to remove duplicates where they match in the first and second column, how can this be done? uniq doesn't seem to be enough, or is it?
You could reverse (rev) the file, then uniq ignoring the first N-2 fields (everything but the first two columns), then rev again.
rev | uniq -f N-2 -u | rev
Okay, I better understand what you need now. What about using awk?
http://www.unix.com/shell-programming-scripting/62574-finding-duplicates-columns-removing-lines.html

Resources