using linux commands,
I have a quoted csv file which I sorted by first column and then second column, now I want to remove duplicates where they match in the first and second column, how can this be done? uniq doesn't seem to be enough, or is it?
You could reverse (rev) the file, then uniq ignoring the first N-2 fields (everything but the first two columns), then rev again.
rev | uniq -f N-2 -u | rev
Okay, I better understand what you need now. What about using awk?
http://www.unix.com/shell-programming-scripting/62574-finding-duplicates-columns-removing-lines.html
Related
I'm trying to assign a variable in bash to the file in this directory with the largest number before the '.tar.gz' and I'm drawing a complete blank on the best way to approach this:
ls /dirname | sort
daily-500-12345.tar.gz
daily-500-12345678.tar.gz
daily-500-987654321.tar.gz
weekly-200-1111111.tar.gz
monthly-100-8675309.tar.gz
sort -Vrt - -k3,3
-V Natural sort
-r Reverse, so you can use head -1 to get the first line only
-t - Use hyphen as field separator
-k3,3 Sort using only the third field
Output:
daily-500-987654321.tar.gz
daily-500-12345678.tar.gz
monthly-100-8675309.tar.gz
weekly-200-1111111.tar.gz
daily-500-12345.tar.gz
Is there a way to sort a csv file based on the 1st column using some shell command?
I have this huge file with more than 150k lines hence I can do it in excel:( is there an alternate way ?
sort -k1 -n -t, filename should do the trick.
-k1 sorts by column 1.
-n sorts numerically instead of lexicographically (so "11" will not come before "2,3...").
-t, sets the delimiter (what separates values in your file) to , since your file is comma-separated.
Using csvsort.
Install csvkit if not already installed.
brew install csvkit
Sort CSV by first column.
csvsort -c 1 original.csv > sorted.csv
I don't know why above solution was not working in my case.
15,5
17,2
18,6
19,4
8,25
8,90
9,47
9,49
10,67
10,90
13,96
159,9
however this command solved my problem.
sort -t"," -k1n,1 fileName
I have a text file:
$ cat text
542,8,1,418,1
542,9,1,418,1
301,34,1,689070,1
542,9,1,418,1
199,7,1,419,10
I'd like to sort the file based on the first column and remove duplicates using sort, but things are not going as expected.
Approach 1
$ sort -t, -u -b -k1n text
542,8,1,418,1
542,9,1,418,1
199,7,1,419,10
301,34,1,689070,1
It is not sorting based on the first column.
Approach 2
$ sort -t, -u -b -k1n,1n text
199,7,1,419,10
301,34,1,689070,1
542,8,1,418,1
It removes the 542,9,1,418,1 line but I'd like to keep one copy.
It seems that the first approach removes duplicate but not sorts correctly, whereas the second one sorts right but removes more than I want. How should I get the correct result?
The problem is that when you provide a key to sort the unique occurrences are looked for that particular field. Since the line 542,8,1,418,1 is displayed, sort sees the next two lines starting with 542 as duplicate and filters them out.
Your best bet would be to either sort all columns:
sort -t, -nk1,1 -nk2,2 -nk3,3 -nk4,4 -nk5,5 -u text
or
use awk to filter duplicate lines and pipe it to sort.
awk '!_[$0]++' text | sort -t, -nk1,1
When sorting on a key, you must provide the end of the key as well, otherwise sort uses all following keys as well.
The following should work:
sort -t, -u -k1,1n text
Sorry for the vague title, I couldn't think of a better one...
I have 2 tab-delimited files with identical first columns (different numbers of total columns). I would like to sort both files by their first column.
I think I could do this either with it -t\t option or the -k1,12 option (since first column is never longer than 12 characters.) Both options produce the same (wrong) output.
Even though both files have the same first column, they are sorted differently. Notice that on the file1 I get 23,29,2; file2, I get 2,23,29.
$ head file1 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs10000023
rs10000029
rs1000002
rs10000030
$ head file2 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs1000002
rs10000023
rs10000029
rs10000030
how I can I sort both files such that the first column is in the same order in each?
Thank you!
sort -t $'\t' -k 1,1
Use $'\t' to have the shell interpret \t as a tab since sort doesn't parse escape sequences. Use -k to tell it to only sort on the first field rather than the entire line.
You might also want the -V flag if you want 2 to sort in between 0 and 10.
This question already has answers here:
How to delete duplicate lines in a file without sorting it in Unix
(9 answers)
Closed 4 years ago.
I have a utility script in Python:
#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
if line in unique_lines:
duplicate_lines.append(line)
else:
unique_lines.append(line)
sys.stdout.write(line)
# optionally do something with duplicate_lines
This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?
Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.
The UNIX Bash Scripting blog suggests:
awk '!x[$0]++'
This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.
A late answer - I just ran into a duplicate of this - but perhaps worth adding...
The principle behind #1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:
cat -n file_name | sort -uk2 | sort -n | cut -f2-
Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')
To remove duplicate from 2 files :
awk '!a[$0]++' file1.csv file2.csv
Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
Now you can check out this small tool written in Rust: uq.
It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.
There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:
uq remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.
uq can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the awk solution could run into OOM when there are too many lines.
Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:
awk '{
if ($0 != PREVLINE) print $0;
PREVLINE=$0;
}'
the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html