Error with using join "is not sorted" using files previously sorted (Ubuntu termina and Gawkl) - sorting

I have this tabulated document with over 60,000 registers:
head -2 hg38.txt
717 NM_000525 chr11 - 17385248 17388659 17386918 17388091 117385248, 17388659, 0 KCNJ11 cmpl cmpl 0,
987 NM_000242 chr10 - 52765379 52771700 52768136 52771635 452765379,52769246,52770669,52771448, 52768510,52769315,52770786,52771700, 0 MBL2 cmpl cmpl 1,1,1,0,
Previously, I extracted from it, som selected lines of the third column, and save it in another chromosomes.txt file
gawk '{print $3}' hg38.txt | sort -u | grep -v "_" | sort -o chromosomes.txt
head -5 chromosomes.txt
chr1
chr10
chr11
chr12
chr13
And now, I want to select those register which have the same field for "chromosomes", but since I want also another field in my end result, I do this:
gawk '{print $3, $13}' hg38.txt | sort | join - chromosomes.txt > final.txt
But the join command warns that:
join: -:833: is not sorted: chr10 GLRX3
How can I join them? Could also, after joining them, instead of creating a temp file, do more stuff by just adding |? For example:
gawk '{print $3, $13}' hg38.txt | sort | join - chromosomes.txt | gawk '{print $2}' | uniq -c | gawk 'BEGIN{t=0}{t=t+$1} END{print t/NR}'
Thank you for you answers in advance!

Why are you not doing the filtering in gawk as well?
gawk '{ if (!match($3,"_")) {print $3, $13} }' hg38.txt

Related

Bash Shell: How do I sort by values on last column, but ignoring the header of a file?

file
ID First_Name Last_Name(s) Average_Winter_Grade
323 Popa Arianna 10
317 Tabarcea Andreea 5.24
326 Balan Ionut 9.935
327 Balan Tudor-Emanuel 8.4
329 Lungu Iulian-Gabriel 7.78
365 Brailean Mircea 7.615
365 Popescu Anca-Maria 7.38
398 Acatrinei Andrei 8
How do I sort it by last column, except for the header ?
This is what file should look like after the changes:
ID First_Name Last_Name(s) Average_Winter_Grade
323 Popa Arianna 10
326 Balan Ionut 9.935
327 Balan Tudor-Emanuel 8.4
398 Acatrinei Andrei 8
329 Lungu Iulian-Gabriel 7.78
365 Brailean Mircea 7.615
365 Popescu Anca-Maria 7.38
317 Tabarcea Andreea 5.24
If it's always 4th column:
head -n 1 file; tail -n +2 file | sort -n -r -k 4,4
If all you know is that it's the last column:
head -n 1 file; tail -n +2 file | awk '{print $NF,$0}' | sort -n -r | cut -f2- -d' '
You'd like to just sort by the last column, but sort doesn't allow you to do that easily. So rewrite the data with the column to be sorted at the beginning of each line:
Ignoring the header for the moment (although this will often work by itself):
awk '{print $NF, $0 | "sort -nr" }' input | cut -d ' ' -f 2-
If you do need to trim the order (eg, it's getting mixed in the sort), you can do things like:
< input awk 'NR==1; NR>1 {print $NF, $0 | "sh -c \"sort -nr | cut -d \\\ -f 2-\"" }'
or
awk 'NR==1{ print " ", $0} NR>1 {print $NF, $0 | "sort -nr" }' OFS=\; input | cut -d \; -f 2-

bash- get all lines with the same column value in two files

I have two text files each with 3 fields. I need to get the lines with the same value on the third field. The 3rd field value is unique in each file. Example:
file1:
1 John 300
2 Eli 200
3 Chris 100
4 Ann 600
file2:
6 Kevin 250
7 Nancy 300
8 John 100
output:
1 John 300
7 Nancy 300
3 Chris 100
8 John 100
When I use the following command:
cat file1 file2 | sort -k 3 | uniq -c -f 2
I get only one row from an input file with the duplicate value. I need both!
this one-liner gives you that output:
awk 'NR==FNR{a[$3]=$0;next}$3 in a{print a[$3];print}' file1 file2
My solution is
join -1 3 -2 3 <(sort -k3 file1) <(sort -k3 file2) | awk '{print $2, $3, $1; print $4, $5, $1}'
or
join -1 3 -2 3 <(sort -k3 file1) <(sort -k3 file2) -o "1.1 1.2 0 2.1 2.2 0" | xargs -n3

bash uniq, how to show count number at back

Normally when I do cat number.txt | sort -n | uniq -c , I get numbers like this:
3 43
4 66
2 96
1 97
But what I need is the number shows of occurrences at the back, like this:
43 3
66 4
96 2
97 1
Please give advice on how to change this. Thanks.
Use awk to change the order of columns:
cat number.txt | sort -n | uniq -c | awk '{ print $2, $1 }'
Perl version:
perl -lne '$occ{0+$_}++; END {print "$_ $occ{$_}" for sort {$a <=> $b} keys %occ}' < numbers.txt
Through GNU sed,
cat number.txt | sort -n | uniq -c | sed -r 's/^([0-9]+) ([0-9]+)$/\2 \1/g'

replace string in comma delimiter file using nawk

I need to implement the if condition in the below nawk command to process input file if the third column has more that three digit.Pls help with the command what i am doing wrong as it is not working.
inputfile.txt
123 | abc | 321456 | tre
213 | fbc | 342 | poi
outputfile.txt
123 | abc | 321### | tre
213 | fbc | 342 | poi
cat inputfile.txt | nawk 'BEGIN {FS="|"; OFS="|"} {if($3 > 3) $3=substr($3, 1, 3)"###" print}'
Try:
awk 'length($3) > 3 { $3=substr($3, 1, 3)"###" } 1 ' FS=\| OFS=\| test1.txt
This works with gawk:
awk -F '[[:blank:]]*\\\|[[:blank:]]*' -v OFS=' | ' '
$3 ~ /^[[:digit:]]{4,}/ {$3 = substr($3,1,3) "###"}
1
' inputfile.txt
It won't preserve the whitespace so you might want to pipe through column -t

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Resources