How to merge, join, concatenate the first two columns in a text file separated by underscore? - bash

I have a text: SG_gen.txt file with multiple columns looking like this:
snp_CHR POS HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
snp_3 47609552 0 1 1 1 1 0 1
snp_3 47614413 0 1 1 1 1 0 1
snp_3 47616151 0 1 1 1 1 0 1
snp_3 47616155 0 1 1 1 1 0 1
snp_3 47617504 0 1 1 1 1 0 1
snp_3 47617679 0 1 1 1 1 0 1
...
I would like to join the first two columns: snp_CHR and POS by "_" and rename it to ID so that column would look like this:
ID
snp_3_47609552
snp_3_47614413
snp_3_47616151
...
This new ID column would be the first column and I would keep all those other columns HG00096, HG00097...I would not keep the original snp_CHR and POS. How would I do this?
I tried using:
awk '{print $0, $1 "_" $NF}' SG_gen.txt > SG_gen1.txt
but this didn't gave me my desired result.

this should do:
awk '{$1=(NR==1?"ID":$1"_"$2); $2=""}1' file
there will be extra white space, which can be normalized afterwards if needed.

If you want to try Perl. Note that this retains the spaces between other columns as in your sample input.
$ cat anika.txt
snp_CHR POS HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
snp_3 47609552 0 1 1 1 1 0 1
snp_3 47614413 0 1 1 1 1 0 1
snp_3 47616151 0 1 1 1 1 0 1
snp_3 47616155 0 1 1 1 1 0 1
snp_3 47617504 0 1 1 1 1 0 1
snp_3 47617679 0 1 1 1 1 0 1
$ perl -pe 's/^\s*//g; s/\s/_/; s/^\S+\s+\S+/ID/ if $.==1' anika.txt
ID HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
snp_3_47609552 0 1 1 1 1 0 1
snp_3_47614413 0 1 1 1 1 0 1
snp_3_47616151 0 1 1 1 1 0 1
snp_3_47616155 0 1 1 1 1 0 1
snp_3_47617504 0 1 1 1 1 0 1
snp_3_47617679 0 1 1 1 1 0 1
$

Related

Delete lines with particular number of columns in linux [duplicate]

This question already has answers here:
Delete lines or rows in a tab-delimited file, by number of cells in that lines or rows
(2 answers)
Closed 1 year ago.
My file.fam looks like following containing around 22k lines. I want to delete rows containing less than 6 columns.
06S14031708 36125 0 0 2 2
06S14031716 38824 0 0 1 2
06S14031729 27949 0 0 2 2
06S14031742 30585 0 0 2 2
5 5 0 0 1 1
6 6 0 0 1
12 12 0 0 1 2
16 16 0 0 1 2
18_0004 21213 0 0 1 1
18_0006 35931 0 0 1 1
18_0008 31975 0 0 1 1
An awk version redirecting all lines w/ more than 5 "words" to a new file:
awk 'NF>=6' file.fam > file.fam.new
mv file.fam.new file.fam
Or a somewhat more unsightly variant using sed with inline replacement:
sed -i -r '/^\s*(\w+\s+){5}\w+\s*$/!d' file.fam

Replace values of one column based on other column conditions in shell

I have a tab separated text file below. I want to match values in column 2 and replace the values in column 5. The condition is if there are X or Y in column 2, I want column 5 to have 1 just like in the result below.
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 0
1:943937:C:T X 0 943937 0
1:944858:A:G Y 0 944858 0
1:945010:C:A X 0 945010 0
1:946247:G:A 1 0 946247 0
result:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
I tried awk -F'\t' '{ $5 = ($2 == X ? 1 : $2) } 1' OFS='\t' file.txt but I am not sure how to match both X and Y in one step.
With awk:
awk 'BEGIN{FS=OFS="\t"} $2=="X" || $2=="Y"{$5="1"}1' file
Output:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Assuming you want $5 to be zero (as opposed to remaining unchanged) if the condition is false:
$ awk 'BEGIN{FS=OFS="\t"} {$5=($2 ~ /^[XY]$/)} 1' file
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0

Common lines from 2 files based on 2 columns per file

I have two file:
file1:
1 imm_1_898835 0 908972 0 A
1 vh_1_1108138 0 1118275 T C
1 vh_1_1110294 0 1120431 A G
1 rs9729550 0 1135242 C A
file2:
1 exm1916089 0 865545 0 0
1 exm44 0 865584 0 G
1 exm46 0 865625 0 G
1 exm47 0 865628 A G
1 exm51 0 908972 0 G
1 exmF 0 1120431 C A
I want to obtain a file that is the overlap between file 1 and 2 based on columns 1 and 4,and I would print the common values for columns 1 and 4 and also columns 2 for file1 and file2.
e.g
I want:
1 908972 imm_1_898835 exm51
1 1120431 vh_1_1110294 exmF
Could you please try following.
awk 'FNR==NR{a[$1,$4]=$2;next} (($1,$4) in a){print $1,$4,a[$1,$4],$2}' file1 file2

Removing duplicate lines with different columns

I have a file which looks like follows:
ENSG00000197111:I12 0
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 0
ENSG00000197111:I5 1
I have some lines that are duplicated but I cannot remove by sort -u because the second column has different values for them (1 or 0). How do I remove such duplicates by keeping the lines with second column as 1 such that the file will be
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
you can use awk and or operator, if the order isn't mandatory
awk '{d[$1]=d[$1] || $2}END{for(k in d) print k, d[k]}' file
you get
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
Edit, only sort solution
You can use sort with a double pass, example
sort -k1,1 -k2,2r file | sort -u -k1,1
you get,
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1

Replace blank fields with zeros in AWK

I wish to replace blank fields with zeros using awk but when I use sed 's/ /0/' file, I seem to replace all white spaces when I only wish to consider missing data. Using awk '{print NF}' file returns different field numbers (i.e. 9,4) due to some empty fields
input
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 M *BB
501666604 0 M *OO
90007162 7 M +178852
90007568 3 M +189182
output
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 0 M *BB 0 0 0 0
501666604 0 0 M *OO 0 0 0 0
90007162 0 7 M +178852 0 0 0 0
90007568 0 3 M +189182 0 0 0 0
Using GNU awk FIELDWIDTHS feature for fixed width processing:
$ awk '{for(i=1;i<=NF;i++)if($i~/^ *$/)$i=0}1' FIELDWIDTHS="11 9 5 2 16 3 2 11 2" file | column -t
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 0 M *BB 0 0 0 0
501666604 0 0 M *OO 0 0 0 0
90007162 0 7 M +178852 0 0 0 0
90007568 0 3 M +189182 0 0 0 0

Resources