Use bash commands to sort list according to the certain column - sorting

I have a list of data with four column like below:
chr1 9778939 10199603 DEL
chr1 143804138 143808614 DEL
chr1 8541961 8757598 DEL
chr1 141480516 141909199 INV
chr1 3902285 4665319 INV
chr1 10212548 10467934 DEL
chr1 225767517 226730696 INV
chr1 10807309 11011343 DEL
chr1 23663773 23957334 DEL
chr1 4468523 4665322 DEL
chr1 24458662 24704306 DEL
....
....
chr2
....
....
chr10
....
....
chr22
....
....
chrX
....
....
chrY
....
....
I hope to:
first sort according to chr1, chr2, chr3.....till chr22,chrX,chrY. If simply use sort -n, it'll sort as chr10, chr1, chr11....blabla. I hope to sort according to the numeric value of the fist column.
Then under each chromosome(chr1,chr2...) how can I sort according to the last column, that is "DEL" or "INV"?
Then sort according to the second column,again, the numeric value. Say 104000 should go after 10500 because 104000 > 10500, but not based on the third digit comparison(4 and 5)
Thanks Hope I've made it clear.

Assuming the columns in the file afile are seprated by a single space character
$ cat afile | sed 's/chr/chr /' | sort -k2,2n -k5,5 -k3,3n | sed 's/chr /chr/'

Convert X and Y to 23 and 24 to sort numerically, and then back after the sort.
cat file | sed 's/chr/chr /' | sed 's/ X/ 23/' | sed 's/ Y/ 24/' | sort -k 2,2n -k 5,5n -k 3,3n | sed 's/chr 23/chrX/' | sed 's/chr 24/chrY/' | sed 's/chr /chr/'
It's a long string of seds, but they run quickly.

Related

A UNIX Command to Find the Name of the Student who has the Second Highest Score

I am new to Unix Programming. Could you please help me to solve the question.
For example, If the input file has the below content
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
The output will be
ABC
I tried something like this
sort -k3,3 -rn -t" " | head -n2 | awk '{print $2}'
Using awk
awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}'
Demo:
$cat file.txt
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
$awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}' file.txt
ABC
$
Explanation:
NR>1 --> Skip first record
{arr[$3]=$2} --> Create associtive array with marks as index and name as value
END <-- read till end of file
n=asorti(arr,arr_sorted) <-- Sort array arr on index value(i.e marks) and save in arr_sorted. n= number of element in array
print arr[arr_sorted[n-1]]} <-- n-1 will point to second last value in arr_sorted (i.e marks) and print corresponding value from arr
Your attempt is 90% correct just a single change
Try this...it will work.
sort -k3,3 -rn -t" " | head -n1 | awk '{print $2}'
Instead of using head -n2 replace it with head -n1

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Error with using join "is not sorted" using files previously sorted (Ubuntu termina and Gawkl)

I have this tabulated document with over 60,000 registers:
head -2 hg38.txt
717 NM_000525 chr11 - 17385248 17388659 17386918 17388091 117385248, 17388659, 0 KCNJ11 cmpl cmpl 0,
987 NM_000242 chr10 - 52765379 52771700 52768136 52771635 452765379,52769246,52770669,52771448, 52768510,52769315,52770786,52771700, 0 MBL2 cmpl cmpl 1,1,1,0,
Previously, I extracted from it, som selected lines of the third column, and save it in another chromosomes.txt file
gawk '{print $3}' hg38.txt | sort -u | grep -v "_" | sort -o chromosomes.txt
head -5 chromosomes.txt
chr1
chr10
chr11
chr12
chr13
And now, I want to select those register which have the same field for "chromosomes", but since I want also another field in my end result, I do this:
gawk '{print $3, $13}' hg38.txt | sort | join - chromosomes.txt > final.txt
But the join command warns that:
join: -:833: is not sorted: chr10 GLRX3
How can I join them? Could also, after joining them, instead of creating a temp file, do more stuff by just adding |? For example:
gawk '{print $3, $13}' hg38.txt | sort | join - chromosomes.txt | gawk '{print $2}' | uniq -c | gawk 'BEGIN{t=0}{t=t+$1} END{print t/NR}'
Thank you for you answers in advance!
Why are you not doing the filtering in gawk as well?
gawk '{ if (!match($3,"_")) {print $3, $13} }' hg38.txt

Sort a file to put 10, 11, 12... before 1, 2, 3... and X,Y

I have a list of chromosome data with the columns (chromosome, start, and end) like this:
chr1 6252071 6253740
chr1 6965107 6966070
chr1 6966038 6967016
chr1 7066595 7068694
chr1 7100956 7102296
chr1 7153422 7154635
chr1 7155112 7156181
....
chr2
....
chr10
....
chrX
....
chrY
....
etc.
I am trying to use bash to sort the chromosome sections to this order:
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY
in the first column, and then in numerical order by start position in the second column, but no variation of sort seems to do the job. Any ideas? Thanks.
Split your file into two streams with separate filtering, then recombine them:
cat <(grep '^chr1[[:digit:]][[:space:]]' <inputfile | sort) \
<(grep -v '^chr1[[:digit:]][[:space:]]' <inputfile | sort) \
>outputfile
perl -E '
open $f, "<", shift;
say join "",
map {$_->[0]}
sort {length($b->[1]) <=> length($a->[1]) or $a->[1] cmp $b->[1]}
map {[$_, (split)[0]]}
<$f>
' file
It first opens the file.
Then it uses a Schwartzian Transform: read the next command from the bottom up:
read the lines: <$f>
transform the lines to a list of pairs: the original line, and the first word:
map {[$_, (split)[0]}
sort, first by length (longest to shortest), then lexically (A to Z)
transform the list of pairs to a list of lines (the first element of the pair)
map {$_->[0]}
join (the lines still have their newlines, so join on the empty string

bash uniq, how to show count number at back

Normally when I do cat number.txt | sort -n | uniq -c , I get numbers like this:
3 43
4 66
2 96
1 97
But what I need is the number shows of occurrences at the back, like this:
43 3
66 4
96 2
97 1
Please give advice on how to change this. Thanks.
Use awk to change the order of columns:
cat number.txt | sort -n | uniq -c | awk '{ print $2, $1 }'
Perl version:
perl -lne '$occ{0+$_}++; END {print "$_ $occ{$_}" for sort {$a <=> $b} keys %occ}' < numbers.txt
Through GNU sed,
cat number.txt | sort -n | uniq -c | sed -r 's/^([0-9]+) ([0-9]+)$/\2 \1/g'

Resources