Using sort | awk on one column from a csv? - sorting

Using p.txt:
$cat p.txt
R 3
R 4
S 1
S 2
R 1
T 1
R 3
The following command sorts based on the second column:
$cat p.txt | sort -k2
R 1
S 1
T 1
S 2
R 3
R 3
R 4
The following command removes repeated values in the second column:
$cat p.txt | sort -k2 | awk '!x[$2]++'
R 1
S 2
R 3
R 4
Now inserting a comma for the sapce, we have the following file:
$cat p1.csv
R,3
R,4
S,1
S,2
R,1
T,1
R,3
The following command still sorts based on the second column:
$cat p1.csv | sort -t "," -k2
R,1
S,1
T,1
S,2
R,3
R,3
R,4
Below is NOT the correct output:
$cat p1.csv | sort -t "," -k2 | awk '!x[$2]++'
R,1
Correct output:
R,1
S,2
R,3
R,4
Any suggestions?

well you have already used sort, then you don't need the awk at all. sort has -u
Also the cat is not needed either:
sort -t, -k2 -u p1.csv
should give you expected output.

Try awk -F, in your last command. So:
cat p1.csv | sort -t "," -k2 | awk -F, '!x[$2]++'
Since your fields are separated by commas, you need to tell awk that the field separator is no longer whitespace, but instead the comma. The -F option to awk does that.

Well you don't need all such things, sort and uniq are enough to do such things
sort -t "," -k2 p1.csv | uniq -s 2
uniq -s 2 tells uniq to skip first 2 characters (i.e. till ,)

You need to provide field separator for awk
cat p1.csv | sort -t "," -k2 | awk -F, '!x[$2]++'

Related

How can I sort and list the word frequency in the second column?

My data input looks like this:
1RDD4_00022_02842 o220
1RDD4_00024_03137 o132
1RDD4_00035_05208 o216
1RDD4_00045_05573 o132
1RDD4_00046_02134 o132
1RDD4_00051_04040 o154
In numerical order, I want to sort and list the frequency of words in the right column so the output looks like this:
o132 3
o154 1
o216 1
o220 1
I've tried the following pipeline, but it only works on the input's left column, and I don't know how to modify for the right column:
sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputfile | sort | uniq -c
use
cat inputfile | cut -f2
or
cat inputfile | awk '{print $2}'
(heavy)
to select second column only
cat inputfile | cut -f2 | sort | uniq -c

How to do specific sorting in unix

How can I sort following two lines
ABCTz.T.3a.B Student 1 1.4345
ABCTz.T.3.B Student 1 1.5465
to print them like below.
ABCTa.T.3.B Student 1 1.5465
ABCTa.T.3a.B Student 1 1.4345
It can be definitely done using a mixture of sed and sort command but that's not a generic solution. Here is the sample code,
cat 1 | sed "s/\./ ./g" | sort -k3,3 | sed "s/ \././g"
This solution requires customization if the length of string changes or number of character changes between two dots(i.e....
ABCTz.T.SC.D.3a.B Student 1 1.4345
ABCTz.T.SC.D.3.B Student 1 1.5465
Again, I need to modify the sort expression to consider the length in this case. Looking forward to have something very generic.
Regards, Divesh
You can use version sort, available with gnu sort on first field:
sort -V -rk1 file
ABCTz.T.3.B Student 1 1.5465
ABCTz.T.3a.B Student 1 1.4345
If the format is based on tabs, it's easy.
cat 1|sort -t"[Control-V][TAB]" -n -r -k4
But if the number of spaces is variable, I sort with awk.
This formula will put the 4th field at the beginning, followed by |, then it will sort based on this field, and then will strip it out:
cat 1|awk '{print $4 "|" $0}' |sort -t"|" -n -r -k1|cut -d"|" -f2-
Example:
boxes#osboxes Desktop]$ cat 1
asdfa safadf 1.2
asldfkañ sdlfsld 1.3
[osboxes#osboxes Desktop]$ cat 1 | awk '{print $3 "|" $0}'|sort -t"|" -n -r -k1|cut -d"|" -f2-
asldfkañ sdlfsld 1.3
asdfa safadf 1.2
Enjoy!

How to sort one column and add values later

I would to first sort a specific column, which I do using sort -k2 <file>. Then, after it is sorted using the values from the second column, I would like to add all the values from column 1 , delete duplicates, and keep the value from column 1.
Example:
2 AAAAAA
3 BBBBBB
1 AAAAAA
2 BBBBBB
1 CCCCCC
sort -k2 <file> does this:
2 AAAAAA
1 AAAAAA
3 BBBBBB
2 BBBBBB
1 CCCCCC
I know uniq -c will removes duplicates and outputs how many times it occurred, however I don't want to know how many times it occurred, I just need column 1 to be added and displayed. So that I would get:
3 AAAAAA
5 BBBBBB
1 CCCCCC
I came up with a solution using two for loops:
The first loop loops over all different strings in the file (test.txt), for each one we find all the numbers in the original file, and add them in the second loop. After adding all numbers we echo the total, and the string.
for chars in `sort -k2 test.txt | uniq -f 1 | cut -d' ' -f 2 `;
do
total=0;
for nr in `grep $a test.txt | cut -d' ' -f 1`;
do
total=$(($total+$nr));
done;
echo $total $chars
done
-c is your enemy. You explicitly asked for the count . Here is my suggestion:
sort -k2 <file>| uniq -f1 file2
which gives me
cat file2
1 AAAAAA
2 BBBBBB
1 CCCCCC
If you want only column 2 in file, then use awk
sort -k2 <file>| uniq -f1 |awk '{print $2}' > file2
leading to
AAAAAA
BBBBBB
CCCCCC
Now I got it at last.
.... But if you want to sum in column 1, then just use awk ... Of course you could not make a grouped count with uniq...
awk '{array[$2]+=$1} END { for (i in array) {print array[i], i}}' file |sort -k2
which leads to your solution (even if I sorted afterwards):
3 AAAAAA
5 BBBBBB
1 CCCCCC

uniq -c unable to count unique lines

I am trying to count unique occurrences of numbers in the 3rd column of a text file, a very simple command:
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | uniq -c
which should say something like
1 10103
2 2093
3 109
but instead puts out nonsense, where the same number is counted multiple times, like
20 1
1 2
1 1
1 2
14 1
1 2
I've also tried
awk 'BEGIN {FS = "\t"}; {print $3}' bisulfite_seq_set0_v_set1.tsv | sed -e 's/ //g' -e 's/\t//g' | uniq -c
I've tried every combination I can think of from the uniq man page. How can I correctly count the unique occurrences of numbers with uniq?
uniq -c counts the contiguous repeats. To count them all you need to sort it first. However, with awk you don't need to.
$ awk '{count[$3]++} END{for(c in count) print count[c], c}' file
will do
awk-free version with cut, sort and uniq:
cut -f 3 bisulfite_seq_set0_v_set1.tsv | sort | uniq -c
uniq operates on adjacent matching lines, so the input has to be sorted first.

How to sort in bash?

I have several lines in file, just like below:
/adbc/eee/ddd/baa/
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/fff/ddd/c/
/adbc/ccc/ddd/bf/
/adbc/ccc/ddd/bc/
The sort algorithm must first get the string before last /, that is:
baa
avfff
b
c
bf
bc
and then sort it by the first character, and then the length of the string, and then alphabetically.
The expected result is
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/ccc/ddd/bc/
/adbc/ccc/ddd/bf/
/adbc/eee/ddd/baa/
/adbc/fff/ddd/c/
You could use awk in a pre-processing step to add 3 columns all based on the field of the interest, feed to sort, then use cut to discard the extra fields
awk -F'/' -v OFS="/" '{x=substr($(NF-1), 1, 1);
print(x, length($(NF-1)), $(NF-1), $0)}' file.txt |
sort -k1,1 -k2,2n -k3,3 -t'/' |
cut -f4- -d'/'
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/ccc/ddd/bc/
/adbc/ccc/ddd/bf/
/adbc/eee/ddd/baa/
/adbc/fff/ddd/c/
cat sortthisfile | while read line
do
field=$( echo $line | sed -e 's:/$::' -e 's:.*/::' )
firstchar=${field:0:1}
fieldlen=${#field}
echo "${firstchar},${fieldlen},${field},${line}"
done | sort-k1,1 -k2,2n -k3,3 -t, | sed 's:.*,/::'
Obviously, sortthisfile is the name of your file.

Resources