I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines
Related
I have a function append_val_to_line which appends the $append_val to the line in Input.txt and writes it back. I want to insert 45 at column 1 and it should and since its length is two values, it should at it at column 1 and column 0. But my code below is adding it at column 2 and column 3.
I am not sure why it is so. Can someone help me achieve the goal mentioned above?
My working solution is below but it does not add the 45 at column 0 and 1. I want a generic solution as I can insert the value at any column and the value should be added starting at column N to column N-1 ... N-2 based on the length of the append_val.
As you can see in the sample input before and sample input after the call to append_val_to_line function, the value 45 is added at column 1 and column 2 but I wanted to the from column 0 and end on column 1 as the value 45 is of length two. But my code adds it starting at column 2 and 3 instead.
Space is also a valid column number but I will not be adding any values in those spaces in a line.
#! /bin/bash
function append_val_to_line{
sed -i 's/\(.\{'$1'\}\)/\1'$2'/' "input.txt"
}
column_num=1
append_val=45
append_vals_to_line "$column_num" "$append_val"
Input.txt BEFORE call to function append_vals_to_line
1200 5600 775000 34555
Input.txt AFTER call to function append_vals_to_line
145200 5600 775000 34555
Note 45 has been added at column 2 and 3.
Since OP told OP wants to start character's position from 0 and I believe rather than column its character's position number which we are talking about here, so based on that and shown samples following may help then.
awk -v after="0" -v value="45" '{print substr($0,1,after+1) value substr($0,after+2)}' Input_file
Non one liner form of above:
awk -v after="0" -v value="45" '
{
print substr($0,1,after+1) value substr($0,after+2)
}
' Input_file
Explanation: Adding detailed explanation for above one.
awk -v after="0" -v value="45" ' ##Start awk prorgam from heer and setting after variable to 0 a per OP and value to 45.
{
print substr($0,1,after+1) value substr($0,after+2) ##Printing sub-string from 1 to till after+1 value since OP wants to insert value to 2nd character so printing 1st character here. Then I a printing value here, then printing rest of the current Line.
}
' Input_file ##Mentioning Input_file name here.
I need to find this number: The '3' on the second column where there is a '3' in the first column.
(This is an example. I also could need to find the '25' on the second column where there is a '36' on the first column).
The numbers on the first column are unique. There is no other row starting with a '3' on the first column.
This data is in a text file, and I'd like to use bash (awk, sed, grep, etc.)
I need to find a number on the second column, knowing the (unique) number of the first column.
In this case, I need to grep the 3 below the 0, on the second column (and, in this case, third row):
108 330
132 0
3 3
26 350
36 25
43 20
93 10
101 3
102 3
103 1
This is not good enough, because the grep should apply only to the first column elements, so the output would by a single number:
cat foo.txt | grep -w '3' | awk '{print $2}'
Although it is a text file, let's imagine it is a MySQL table. The needed query would be:
SELECT column2 WHERE column1='3'
In this case (text file), I know the input value of the first column (eg, 132, or 93, or 3), and I need to find the value of the same row, in the second column (eg, 0, 10, 3, respectively).
Regards,
Assuming you mean "find rows where the first column contains a specific string exactly, and the second contains that string somewhere within";
awk -v val="3" '$1 == val && $2 ~ $1 { print $2 }' foo.txt
Notice also how this avoids a useless use of cat.
You can find repeated patterns by grouping the first match and looking for a repeat. This works for your example:
grep -wE '([^ ]+) +\1' infile
Output:
3 3
I have a fasta file_imagine as a txt file in which even lines are sequences of characters and odd lines are sequence id's_ I would like to search for a string in sequences and get the position for matching substrings as well as their ids. Example:
Input:
>111
AACCTTGG
>222
CTTCCAACC
>333
AATCG
search for "CC" . output:
3 111
4 8 222
$ awk -F'CC' 'NR%2==1{id=substr($0,2);next} NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}' file
3 111
4 8 222
Explanation:
-F'CC'
awk breaks input lines into fields. We instruct it to use the sequence of interest, CC in this example, as the field separator.
NR%2==1{id=substr($0,2);next}
On odd number lines, we save the id to variable id. The assumption is that the first character is > and the id is whatever comes after. Having captured the id, we instruct awk to skip the remaining commands and start over with the next line.
NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}
If awk finds only one field on an input line, NF==1, that means that there were no field separators found and we ignore those lines.
For the rest of the lines, we calculate the positions of each match in x and then save each value of x found in the string b.
Finally, we print the match locations, b, and the id.
Will print line number and the position of each start of each match.
awk 'NR%2==1{t=substr($0,2)}{z=a="";while(y=match($0,"CC")){a=a?a" "z+y:z+y;$0=substr($0,z=(y+RLENGTH));z-=1}}a{print a,t }' file
Neater
awk '
NR%2==1{t=substr($0,2)}
{
z=a=""
while ( y = match($0,"CC") ) {
a=a?a" "z+y:z+y
$0=substr($0,z=(y+RLENGTH))
z-=1
}
}
a { print a,t }' file
.
3 111
4 8 222
I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.
I have a text file like below.
1 1223 abc
2 4234 weroi
0 3234 omsder
1 1111 abc
2 6666 weroi
I want to have unique values for the column 3. So I want to have the below file.
1 1223 abc
2 4234 weroi
0 3234 omsder
Can I do this using some basic commands in Linux? without using Java or something.
You could do this with some awk scripting. Here is a piece of code I came up with to address your problem :
awk 'BEGIN {col=3; sep=" "; forbidden=sep} {if (match(forbidden, sep $col sep) == 0) {forbidden=forbidden $col sep; print $0}}' input.file
The BEGIN keyword declares the forbidden string, which is used to monitor the 3rd column values. Then, the match keyword check if the 3rd column of the current line contains any forbidden value. If not, it adds the content of the column to the forbidden list and print the whole line.
Here, sep=" " instantiate the separator. We use sep between each forbidden value in order to avoid words created by putting several values next to one another. For instance :
1 1111 ta
2 2222 to
3 3333 t
4 4444 tato
In this case, without a separator, t and tato would be considered a forbidden value. We use " " as a separator as it is used by default to separate each column, thus a column cannot include a space in its name.
Note that if you want to change the number of the column in which you need to remove duplicate, just adapt col=3 with the number of the column you need (0 for the whole line, 1 for the first column, 2 for the second, ...)