How to merge files depending on a string in a specific column - bash

I have two files that I need to merge together based on what string they contain in a specific column.
File 1 looks like this:
1 1655 1552 189
1 1433 1552 185
1 1623 1553 175
1 691 1554 182
1 1770 1554 184
1 1923 1554 182
1 1336 1554 181
1 660 1592 179
1 743 1597 179
File 2 looks like this:
1 1552 0 0 2 -9 G A A A
1 1553 0 0 2 -9 A A G A
1 1554 0 751 2 -9 A A A A
1 1592 0 577 1 -9 G A A A
1 1597 0 749 2 -9 A A G A
1 1598 0 420 1 -9 A A A A
1 1600 0 0 1 -9 A A G G
1 1604 0 1583 1 -9 A A A A
1 1605 0 1080 2 -9 G A A A
I am wanting to match column 3 from file 1 to column 2 on file 2, with my output looking like:
1 1655 1552 189 0 0 2 -9 G A A A
1 1433 1552 185 0 0 2 -9 G A A A
1 1623 1553 175 0 0 2 -9 A A G A
1 691 1554 182 0 751 2 -9 A A A A
1 1770 1554 184 0 751 2 -9 A A A A
1 1923 1554 182 0 751 2 -9 A A A A
1 1336 1554 181 0 751 2 -9 A A A A
1 660 1592 179 0 577 1 -9 G A A A
1 743 1597 179 0 749 2 -9 A A G A
I am not interested in keeping any lines in file 2 that are not in file 1. Thanks in advance!

Thanks to #Abelisto I managed to figure something out 4 hours later!
sort -k 3,3 File1.txt > Pheno1.txt
awk '($2 >0)' File2.ped > Ped1.ped
sort -k 2,2 Ped1.ped > Ped2.ped
join -1 3 -2 2 Pheno1.txt Ped2.ped > Ped3.txt
cut -d ' ' -f 1,4,5 --complement Ped3.txt > Output.ped
My real File2 actually contained negative values in the 2nd column (thankfully my real File1 didn't have any negatives) hence the use of awk to remove those rows

Using awk:
awk 'NR == FNR { arr[$2]=$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10 } NR != FNR { print $1" "$2" "$3" "$4" "arr[$3] }' file2 file1
Process file2 first (NR==FNR) Set up an array called arr with the 3rd space delimited field as the index and the 3rd to 10th fields as values separated with a space. Then when processing the first file (NR!=FNR) print the 1st to the 4th space delimited fields followed by the contents of arr, index field 3.

Since $1 seems like constant 1 and I have no idea about rowcounts of either file (800,000 columns in file2 sounded a lot) I'm hashing file1 instead:
$ awk '
NR==FNR {
a[$3]=a[$3] (a[$3]==""?"":ORS) $2 OFS $3 OFS $4
next
}
($2 in a) {
n=split(a[$2],t,ORS)
for(i=1;i<=n;i++) {
$2=t[i]
print
}
}' file1 file2
Output:
1 1655 1552 189 0 0 2 -9 G A A A
1 1433 1552 185 0 0 2 -9 G A A A
1 1623 1553 175 0 0 2 -9 A A G A
1 691 1554 182 0 751 2 -9 A A A A
1 1770 1554 184 0 751 2 -9 A A A A
1 1923 1554 182 0 751 2 -9 A A A A
1 1336 1554 181 0 751 2 -9 A A A A
1 660 1592 179 0 577 1 -9 G A A A
1 743 1597 179 0 749 2 -9 A A G A
When posting a question, please add details such as row and column counts to it. Better requirements yield better answers.

Related

Print values from both files using awk as vlookup

I have two files, file1:
1 I0626_all 0 0 1 1
2 I0627_all 0 0 2 1
3 I1137_all_published 0 0 1 1
4 I1859_all 0 0 2 1
5 I2497_all 0 0 2 1
6 I2731_all 0 0 1 1
7 I4451_all 0 0 1 1
8 I0626 0 0 1 1
9 I0627 0 0 2 1
10 I0944 0 0 2 1
and file 2:
I0626_all 1 138
I0627_all 1 139
I1137_all_published 1 364
I4089 1 365
AfontovaGora2.SG 1 377
AfontovaGora3_d 1 378
At the end I want
1 I0626_all 138
2 I0627_all 139
3 I1137_all_published 364
I tried using:
awk 'NR==FNR{a[$1]=$2;next} {b[$3]} {print $1,$2,b[$3]}' file2 file1
But It doesnt work.
You may use this awk:
awk 'NR == FNR {map[$1] = $NF; next} $2 in map {print $1, $2, map[$2]}' file2 file1
1 I0626_all 138
2 I0627_all 139
3 I1137_all_published 364

Sort text file after the 16th column with p-values in bash

I have a tab delimited textfile with 18 column and more than 300000 rows. I have also a header line and I would sort the whole text file by the 16th column, which contains p-values. So I would like to sort it, having the lowest p-values above and also leaving the headline as it is.
I already have a code, it doesn't give me any error message, but it only shows the header line in the output file, and nothing else.
Here is my file:
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
Output should look like this:
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
Here is my code:
awk 'NR==1; NR > 1 {print $0 | "sort -g -rk 16,16"}' file.txt > file_out.txt
I'm guessing your sort doesn't have a -g option and so it's failing and not producing any output. Try this instead just using POSIX options:
$ awk 'NR==1; NR > 1 {print | "sort -nrk 16,16"}' file
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
Would you please try the following:
cat <(head -n 1 file.txt) <(tail -n +2 file.txt | sort -nk16,16) > file_out.txt
Using GNU awk (for array sorting):
awk 'NR==1 { print;next } { map[$3][$16]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for(i in map) { for(j in map[i]) { print map[i][j] } } }' file
Explanation
awk 'NR==1 {
print;next # Header record, print and skip to the next line
}
{
map[$3][$16]=$0 # None header line - create a two dimensional array indexed by ID (assuming that it is unique in the file) and by 16th field with the line as the value
}
END { PROCINFO["sorted_in"]="#ind_num_asc"; # Set the array sorting to index number ascending
for(i in map) {
for(j in map[i]) {
print map[i][j] # Loop through the array printing the values
}
}
}' file
I suggest you to try next script:
#!/bin/bash
head -n 1 file.txt > file_out.txt
tail -n +2 file.txt | sort -k 16 >> file_out.txt
This definitely works, according to your output sample, when you convert the blanks into tabs, obviously.
awk to the rescue!
$ awk 'NR==1; NR>1{print | "sort -k16n"}' file | column -t
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571

How to use this awk command without affecting the header

Good nigt. I have this two files:
File 1 - with phenotype informations, the first column are the Ids, the orinal file has 400 rows:
ID a b c d
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
217 2 0 2 1
218 0 2 0 2
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
I used this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file2 file1 > file3
and I can get this output(file 3):
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
but I lose the header, how do I not lose the header?
To keep the header of the second file, add a condition{action} like this:
awk 'NR==FNR {a[$1]; next}
FNR==1 {print $0; next} # <= this will print the header of file2.
$1 in a {print $0}' file1 file2
NR holds the total record number while FNR is the file record number, it counts the records of the file currently being processed. Also the next statements are important, so that to continue with the next record and don't try the rest of the actions.

how to add text to next line in tab separated file from other file?

I have a set of files contain tab separated values, at the last but third line, I have my desired values. I have extracted that value with
cat result1.tsv | tail -3 | head -1 > final1.tsv
cat resilt2.tsv | tail -3 | head -1 >final2.tsv
..... so on (I have almost 30-40 files)
I want the content of final tsv files in next line in a new single file.
I tried
cat final1.tsv final2.tsv > final.tsv
but this works for the limited amount of files difficult to write the name of all files.
I tried to put the file names in a loop as variables but not worked.
final1.tsv contains:
270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2.tsv contains:
1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
all the files (final1.tsv,final2.tsv,final3.tsv,final5..... contains same number of columns but different values)
I want the rows of each file merged in new file like
final.tsv
final1 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
final3 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final4 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
here you go...
for f in final{1..4}.tsv;
do
echo -en $f'\t' >> final.tsv;
cat $f >> final.tsv;
done
Try this:
rm final.tsv
for FILE in result*.tsv
do
tail -3 $FILE | head -1 >> final.tsv
done
As long as the files aren't enormous, it's simplest to read each file into an array and select the third record from the end
This solves your problem for you. It looks for all files in the current directory that match result*.tsv and writes the required line from each of them to final.tsv
use strict;
use warnings 'all';
my #results = sort {
my ($aa, $bb) = map /(\d+)/, ($a, $b);
$aa <=> $bb;
} glob 'result*.tsv';
open my $out_fh, '>', 'final.tsv';
for my $result_file ( #results ) {
open my $fh, '<', $result_file or die qq({Unable to open "$result_file" for input: $!};
my #data = <$fh>;
next unless #data >= 3;
my ($name) = $result_file =~ /([^.]+)/;
print { $out_fh } "$name\t$data[-3]";
}

filtering fields based on certain values

I wish you you all a very happy New Year.
I have a file that looks like this(example): There is no header and this file has about 10000 such rows
123 345 676 58 1
464 222 0 0 1
555 22 888 555 1
777 333 676 0 1
555 444 0 58 1
PROBLEM: I only want those rows where both field 3 and 4 have a non zero value i.e. in the above example row 1 & row 3 should be included and rest should be excluded. How can I do this?
The output should look like this:
123 345 676 58 1
555 22 888 555 1
Thanks.
awk is perfect for this kind of stuff:
awk '$3 && $4' input.txt
This will give you the output that you want.
$3 && $4 is a filter. $3 is the value of the 3rd field, $4 is the value of the forth. 0 values will be evaluated as false, anything else will be evaluated as true. If there can be negative values, than you need to write more precisely:
awk '$3 > 0 && $4 > 0' input.txt

Resources