Sort command not working as expected? - shell

I have got a dataset like this
tack2#domain.com,2009-11-27
overflow#domain2.com,2009-11-27
overflow#domain2.com,2009-11-27
When I am running command to delete all of the same entries of column2
sort -t ',' -k2 stars.txt -u
It is deleting the entry of column1, and in order to delete the duplicate entries of column2, I am having to enter -k3 flag
sort -t ',' -k3 stars.txt -u
Can anyone explain to me why it is happening? Why I have to enter +1 to the column in the file to delete the column?

In my system all works correctly:
$ sort -t, -k1 -u 1.txt
overflow#domain2.com,2009-11-27
tack2#domain.com,2009-11-27
$ sort -t, -k2 -u 1.txt
tack2#domain.com,2009-11-27
It may be due to your locale.
Can you please repleat the command but with LANG=C?
$ LANG=C sort -t, -k1 -u 1.txt
$ LANG=C sort -t, -k2 -u 1.txt

this is typical awk job, no sorting needed. I add one short line here, in case you want to give it a try.
awk -F, '!a[$2]++' file
will do the job.

Related

How do I get the total number of distinct values in a column in a CSV?

I have a CSV file named test.csv. It looks like this:
1,Color
1,Width
2,Color
2,Height
I want to find out how many distinct values are in the first column. The shell script should return 2 in this case.
I tried running sort -u -t, -k2,2 test.csv, which I saw on another question, but it printed out far more info than I need.
How do I write a shell script that prints the number of distinct values in the first column of test.csv?
Using awk you can do:
awk -F, '!seen[$1]++{c++} END{print c}' file
2
This awk command uses key $1, and stores them in an array seen. Value of which is incremented to 1 when a key is populated first time. Every time we get a unique key we increment count c and print it in the end.
Or
cut -d, -f1 file | sort -u | wc -l
Use cut to extract the first column, then sort to get the unique values, then wc to count them.
#List the first column of the CSV, then sort and filter uniq then take count.
awk -F, '{print $1}' test.csv |sort -u |wc -l
To ignore header:
awk -F, 'NR>1{print $1}' test.csv |sort -u |wc -l

Sort text file with cat and sort concatenation

I got a txt file with some content looking like
stuff,stuff,2012-12-12
morestuff,morestuff,2012-09-09
evenmorestuff,yeah,2012-08-02
and I want to use cat and sort to get them reverse ordered by the date as an output on my command-line by concatenation.
not sure why you think you need to cat a file into sort, but here are 2 options
cat yourFile | sort -t, -k3r
sort -t, -k3r yourFile
To test this I did
echo "stuff,stuff,2012-12-12
morestuff,morestuff,2012-09-09
evenmorestuff,yeah,2012-08-02" \
| sort -t, -k3r
output
stuff,stuff,2012-12-12
morestuff,morestuff,2012-09-09
evenmorestuff,yeah,2012-08-02
And finally, you can overwrite your existing file using the -o option like
sort -t, -o yourFile -k3r yourFile
Thanks to #karakfa for reminding me your your requirement for reverse order sort. This is accomplished by adding an r to the key specification, hence -k3r.
IHTH

Sort text file using bash sort

I'm trying to sort the following file by date with earliest to latest:
$NAME DIA
# Date,Open,High,Low,Close,Volume,Adj Close
01-10-2014,169.91,169.98,167.42,167.68,11019000,167.68
29-04-2014,164.62,165.27,164.49,165.00,4581400,163.40
17-10-2013,152.11,153.59,152.05,153.48,9916600,150.26
06-09-2013,149.70,149.97,147.77,149.09,9001900,145.68
02-11-2012,132.56,132.61,130.47,130.67,5141300,125.01
01-11-2012,131.02,132.44,130.97,131.98,3807400,126.27
sort -t- -k3 -k2 -k1 DIA.txt gets the year right but scrambles the month and day.
any help would be greatly appreciated.
This seems to produce correct output
sort -s -t- -k3,3 -k2,2 -k1,1
output:
$ sort -s -t- -k3,3 -k2,2 -k1,1 dia.txt
# Date,Open,High,Low,Close,Volume,Adj Close
01-11-2012,131.02,132.44,130.97,131.98,3807400,126.27
02-11-2012,132.56,132.61,130.47,130.67,5141300,125.01
06-09-2013,149.70,149.97,147.77,149.09,9001900,145.68
17-10-2013,152.11,153.59,152.05,153.48,9916600,150.26
29-04-2014,164.62,165.27,164.49,165.00,4581400,163.40
01-10-2014,169.91,169.98,167.42,167.68,11019000,167.68
I would try changing the date format first.
sed -r "s/(..)-(..)-(....)/\\3-\\2-\\1/" DIA.txt | sort
You can also change it back after sorting the lines.
sed -r "s/(..)-(..)-(....)/\\3-\\2-\\1/" DIA.txt | sort | sed -r "s/(....)-(..)-(..)/\\3-\\2-\\1/"
sort's -k flag only allows you to specify two columns that give the range of keys to use in the sort. Here you want to involve a third column before that. There is a special syntax to use an additional column to resolve ties (here between rows when sorting with column 3 and 2):
sort -t'-' -k3,2.1 d

Unix shell script to sort files depending on the 'date string' present in their file name

I am trying to sort files in a directory, depending on the 'date string' attached in the file name, for example files looks as below
SSA_F12_05122013.request.done
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
Where 05122013,12142012 and 01062013 represents the dates in format.
Please help me in providing a unix shell script to sort these files on the date string present in their file name(in descending and ascending order).
Thanks in advance.
Hmmm... why call on heavyweights like awk and Perl when sort itself has the capability to define what exactly to sort by?
ls SSA_F*.request.done | sort -k 1.13,1.16 -k 1.9,1.10 -k 1.11,1.12
Each -k option defines a "sort key":
-k 1.13,1.16
This defines a sort key ranging from field 1, column 13 to field 1, column 16. (A field is by default delimited by whitespace, which your filenames don't have.)
If your filenames are varying in length, defining the underscore as field separator (using the -t option) and then addressing columns in the third field would be the way to go.
Refer to man sort for details. Use the -r option to sort in descending order.
one way with awk and sort:
ls -1|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|sort|awk '$0=$NF'
if we break it down:
ls -1|
awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
the ls -1 just example. I think you have your way to get the file list, one per line.
test a little bit:
kent$ echo "SSA_F13_12142012.request.done
SSA_F12_05122013.request.done
SSA_F14_01062013.request.done"|awk -F'[_.]' '{s=gensub(/^([0-9]{4})(.*)/,"\\2\\1","g",$3);print s,$0}'|
sort|
awk '$0=$NF'
SSA_F13_12142012.request.done
SSA_F14_01062013.request.done
SSA_F12_05122013.request.done
ls -lrt *.done | perl -lane '#a=split /_|\./,$F[scalar(#F)-1];$a[2]=~s/(..)(..)(....)/$3$2$1/g;print $a[2]." ".$_' | sort -rn | awk '{$1=""}1'
ls *.done | perl -pe 's/^.*_(..)(..)(....)/$3$2$1$&/' | sort -rn | cut -b9-
this would do +

Sorting with unix tools and multiple columns

I am looking for the easiest way to solve this problem. I have a huge data set that i cannot load into excel of this type of format
This is a sentence|10
This is another sentence|5
This is the last sentence|20
What I want to do is sort this from least to greatest based on the number.
cat MyDataSet.txt | tr "|" "\t" | ???
Not sure what the best way is to do this, I was thinking about using awk to switch the columns and the do a sort, but I was having trouble doing it.
Help me out please
sort -t\| -k +2n dataset.txt
Should do it. field separator and alternate key selection
You usually don't need cat to send the file to a filter. That said, you can use the sort filter.
sort -t "|" -k 2 -n MyDataSet.txt
This sorts the MyDataSet.txt file using the | character as field separator and sorting numerically according to the second field (the number).
have you tried sort -n
$ sort -n inputFile
This is another sentence|5
This is a sentence|10
This is the last sentence|20
you could switch the columns with awk too
$ awk -F"|" '{print $2"|"$1}' inputFile
10|This is a sentence
5|This is another sentence
20|This is the last sentence
combining awk and sort:
$ awk -F"|" '{print $2"|"$1}' inputFile | sort -n
5|This is another sentence
10|This is a sentence
20|This is the last sentence
per comments
if you have numbers in the sentence
$ sort -n -t"|" -k2 inputFile
This is another sentence|5
This is a sentence|10
This is the last sentence|20
this is a sentence with a number in it 2|22
and of course you could redirect it to a new file:
$ awk -F"|" '{print $2"|"$1}' inputFile | sort -n > outFile
Try this sort command:
sort -n -t '|' -k2 file.txt
Sort by number, change the separator and grab the second group using sort.
sort -n -t'|' -k2 dataset.txt

Resources