I have a long text file (haplotypes.txt) that looks like this:
19 rs541392352 55101281 A 0 0 ...
19 rs546022921 55106773 C T 0 ...
19 rs531959574 31298342 T 0 0 ...
And a simple text file (positions.txt) that looks like this:
55103603
55106773
55107854
55112489
If would like to remove all the rows where the third field is present in positions.txt, to obtain the following output:
19 rs541392352 55101281 A 0 0 ...
19 rs531959574 31298342 T 0 0 ...
I hope someone can help.
With AWK:
awk 'NR == FNR{a[$0] = 1;next}!a[$3]' positions.txt haplotypes.txt
Breakdown:
NR == FNR { # If file is 'positions.txt'
a[$0] = 1 # Store line as key in associtive array 'a'
next # Skip next blocks
}
!a[$3] # Print if third column is not in the array 'a'
This should work:
$ grep -vwFf positions.txt haplotypes.txt
19 rs541392352 55101281 A 0 0 ...
19 rs531959574 31298342 T 0 0 ...
-f positions.txt: read patterns from file
-v: invert matches
-w: match only complete words (avoid substring matches)
-F: fixed string matching (don't interpret patterns as regular expressions)
This expects that only the third column looks like a long number. If the pattern happens to match the exact same word in one of the columns that aren't shown, you can get false positives. To avoid that, you'd have to use an awk solution filtering by column (see andlrc's answer).
Related
I have the following example lines in a file:
sweet_25 2 0 4
guy_guy 2 4 6
ging_ging 0 0 3
moat_2 0 1 0
I want to process the file and have the following output:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
Notice that the required effect happened in lines 2 and 3 - that an underscore and text follwing a text is remove on lines where this pattern occurs.
I have not succeeded with the follwing:
sed -E 's/([a-zA-Z])_[a-zA-Z]/$1/g' file.txt >out.txt
Any bash or awk advice will be welcome.Thanks
If you want to replace the whole word after the underscore, you have to repeat the character class one or more times using [a-zA-Z]+ and use \1 in the replacement.
sed -E 's/([a-zA-Z])_[a-zA-Z]+/\1/g' file.txt >out.txt
If the words should be the same before and after the underscore, you can use a repeating capture group with a backreference.
If you only want to do this for the start of the string you can prepend ^ to the pattern and omit the /g at the end of the sed command.
sed -E 's/([a-zA-Z]+)(_\1)+/\1/g' file.txt >out.txt
The pattern matches:
([a-zA-Z]+) Capture group 1, match 1 or more occurrences of a char a-zA-Z
(_\1)+ Capture group 2, repeat matching _ and the same text captured by group 1
The file out.txt will contain:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
With your shown samples, please try following awk code.
awk 'split($1,arr,"_") && arr[1] == arr[2]{$1=arr[1]} 1' Input_file
Explanation: Simple explanation would be, using awk's split function that splits 1st field into an array named arr with delimiter _ AND then checking condition if 1st element of arr is EQAUL to 2nd element of arr then save only 1st element of arr to first field($1) and by mentioning 1 printing edited/non-edited lines.
You can do it more simply, like this:
sed -E 's/_[a-zA-Z]+//' file.txt >out.txt
This just replaces an underscore followed by any number of alphabetical characters with nothing.
$ awk 'NR~/^[23]$/{sub(/_[^ ]+/,"")} 1' file
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
I would do:
awk '$1~/[[:alpha:]]_[[:alpha:]]/{sub(/_.*/,"",$1)} 1' file
Prints:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
I have a file with 3 columns like this:
NC_0001 10 x
NC_0001 11 x
NC_0002 90 y
I want to change the names of the first column using another file .txt that contains the conversion, it's like:
NC_0001 1
NC_0001 1
NC_0002 2
...
So finally I should have:
1 10 x
1 11 x
2 90 y
How can I do that?
P.S. the first file is very huge (50 GB) so I must use a unix command like awk.
awk -f script.awk map_file data_file
NR == FNR { # for the first file
tab[$1] = $2 # create a k/v of the colname and rename value
}
NR != FNR { # for the second file
$1 = tab[$1] # set first column equal to the map value
print # print
}
As a one-liner
awk 'NR==FNR{t[$1]=$2} NR!=FNR{$1=t[$1];print}' map_file data_file
If possible, you should split the first file and run this command on each partition file in parallel. Then, join the results.
I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines
I have a text file like below.
1 1223 abc
2 4234 weroi
0 3234 omsder
1 1111 abc
2 6666 weroi
I want to have unique values for the column 3. So I want to have the below file.
1 1223 abc
2 4234 weroi
0 3234 omsder
Can I do this using some basic commands in Linux? without using Java or something.
You could do this with some awk scripting. Here is a piece of code I came up with to address your problem :
awk 'BEGIN {col=3; sep=" "; forbidden=sep} {if (match(forbidden, sep $col sep) == 0) {forbidden=forbidden $col sep; print $0}}' input.file
The BEGIN keyword declares the forbidden string, which is used to monitor the 3rd column values. Then, the match keyword check if the 3rd column of the current line contains any forbidden value. If not, it adds the content of the column to the forbidden list and print the whole line.
Here, sep=" " instantiate the separator. We use sep between each forbidden value in order to avoid words created by putting several values next to one another. For instance :
1 1111 ta
2 2222 to
3 3333 t
4 4444 tato
In this case, without a separator, t and tato would be considered a forbidden value. We use " " as a separator as it is used by default to separate each column, thus a column cannot include a space in its name.
Note that if you want to change the number of the column in which you need to remove duplicate, just adapt col=3 with the number of the column you need (0 for the whole line, 1 for the first column, 2 for the second, ...)
my file looks like this
Tree:0,pos:0,len:2.29276,TMRCA:0.795328,ARG:,len:2.29276,TMRCA:0.795328
NEWICK_TREE: [169]((2:0.147398,(6:0.136844,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.0105546):0.647929,(0:0.199142,(1:0.0103058,3:0.0103058):0.188836):0.596186);
SITE: 0 0.0123617064 0.648849164 0010111111
iHistoryMax: 0
Tree:1,pos:0.0169589,len:2.28476,TMRCA:0.795328,ARG:,len:2.28476,TMRCA:0.795328
NEWICK_TREE: [303]((2:0.147398,((6:0.00230499,1:0.00230499):0.134539,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.0105546):0.647929,(0:0.199142,3:0.199142):0.596186);
iHistoryMax: 1
Tree:2,pos:0.0472255,len:2.77342,TMRCA:0.795328,ARG:,len:2.77342,TMRCA:0.795328
NEWICK_TREE: [67](((6:0.00230499,1:0.00230499):0.134539,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.658484,((0:0.199142,3:0.199142):0.436921,2:0.636062):0.159266);
iHistoryMax: 2
Tree:3,pos:0.0539094,len:2.96385,TMRCA:0.795328,ARG:,len:2.96385,TMRCA:0.795328
NEWICK_TREE: [40](((6:0.00230499,1:0.00230499):0.134539,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.658484,((0:0.389568,3:0.389568):0.246494,2:0.636062):0.159266);
iHistoryMax: 3
However, what I only need is the pos of each Tree (in the line Tree:1,pos) and the output should be only the number followed by pos in 1 column with 3 rows (or more). The position of the Tree line is not always in each 3 line as the part in between can change in length. This can be done in bash?
Use awk with a delimiter of : and , and then print the fields you want. For example, this will print the the Tree and pos numbers:
awk -F[:,] '/^Tree:/{print $2,$4}' file
using grep with -P
grep -Po "(?<=Tree.*pos:)[0-9.]+" file
0
0.0169589
0.0472255
0.0539094