Remove lines from a text file where columns getting repeated ubuntu - bash

I have a text file like below.
1 1223 abc
2 4234 weroi
0 3234 omsder
1 1111 abc
2 6666 weroi
I want to have unique values for the column 3. So I want to have the below file.
1 1223 abc
2 4234 weroi
0 3234 omsder
Can I do this using some basic commands in Linux? without using Java or something.

You could do this with some awk scripting. Here is a piece of code I came up with to address your problem :
awk 'BEGIN {col=3; sep=" "; forbidden=sep} {if (match(forbidden, sep $col sep) == 0) {forbidden=forbidden $col sep; print $0}}' input.file
The BEGIN keyword declares the forbidden string, which is used to monitor the 3rd column values. Then, the match keyword check if the 3rd column of the current line contains any forbidden value. If not, it adds the content of the column to the forbidden list and print the whole line.
Here, sep=" " instantiate the separator. We use sep between each forbidden value in order to avoid words created by putting several values next to one another. For instance :
1 1111 ta
2 2222 to
3 3333 t
4 4444 tato
In this case, without a separator, t and tato would be considered a forbidden value. We use " " as a separator as it is used by default to separate each column, thus a column cannot include a space in its name.
Note that if you want to change the number of the column in which you need to remove duplicate, just adapt col=3 with the number of the column you need (0 for the whole line, 1 for the first column, 2 for the second, ...)

Related

find and replace substrings in a file which match strings in another file

I have two txt files: File1 is a tsv with 9 columns. Following is its first row (SRR6691737.359236/0_14228//11999_12313 is the first column and after Repeat is the 9th column):
SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOVA2 SINE 1 260 9
File2 is a tsv with 9 columns. Following is its first row (after Read is the 9th column):
CM011822.1 reefer discordance 63738705 63738727 . + . Read SRR6691737.359236 11999 12313; Dup 277
File1 contains information of read name (SRR6691737.359236), read length (0_14228) and coordinates (11999_12313) while file two contains only read name and coordinate. All read names and coordinates in file1 are present in file2, but file2 may also contain the same read names with different coordinates. Also file2 contains read names which are not present in file1.
I want to write a script which finds read names and coordinates in file2 that match those in file1 and adds the read length from file1 to file2. i.e. changes the last column of file2:
Read SRR6691737.359236 11999 12313; Dup 277
to:
Read SRR6691737.359236/0_14228//11999_12313; Dup 277
any help?
If unclear how your input files look look like.
You write:
I have two txt files: File1 is a tsv with 9 columns. Following is
its first row (SRR6691737.359236/0_14228//11999_12313 is the first
column and after Repeat is the 9th column):
SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOV, ancd A2 SINE 1 260 9
If I try to check the columns (and put them in a 'Column,Value' pair):
Column,Value
1,SRR6691737.359236/0_14228//11999_12313
2,Censor
3,repeat
4,5
5,264
6,1169
7,+
8,.
9,Repeat
10,BOVA2
11,SINE
12,1
13,260
14,9
That seems to have 14 columns, you specify 9 columns...
Can you edit your question, and be clear about this?
i.e. specify as csv
SRR6691737.359236/0_14228//11999_12313,Censor,repeat,5,.....
Added info, after feedback :
file1 contains the following fields (tab-, ancd separated):
SRR6691737.359236/0_14228//11999_12313
Censor
5
264
1169
+
.
Repeat BOVA2 SINE 1 260 9
You want to convert this (using a script) to a tab-separated file:
CM011822.1
reefer
distance
63738705
63738727
+
.
Read SRR6691737.359236 11999 12313
Dup 277
More info is needed to solve this!
field 1: How/Where is the info for 'CM011822.1' coming from?
field 2 and 3: 'reefer'/'distance'. Is this fixed text, should, ancd these fields always contain these texts or are there exceptions?
field 4 and 5: Where are these values (63738705 ; 63738727) coming from?
OK, it's clear that there are more questions to be asked than can give here …
second change...:
create a file, name if 'mani.awk':
FILENAME=="file1"{
split($1,a,"/");
x=a[1] " " a[4];
y=x; gsub(/_/," ",y);
r[y]=$1;
c=1; for (i in r) { print c++,i,"....",r[i]; }
}
FILENAME=="file2"{
print "<--", $0, "--> " ;
for (i in r) {
if ($9 ~ i) {
print "B:" r[i];
split(r[i],b,"/");
$9="Read " r[i];
print "OK";
}
};
print "<--", $0, "--> " ;
}
After this gawk -f mani.awk file1 file2 should produce the correct result.
If not, than i suggest you to learn AWK 😉, and change the script as needed.

How can I compare rows in Unix text files and add them together in another text file?

I have a text file 1 that has 3 columns. The first column contains a number, the second a word (which can be either a sting like dog or a number like 1050), the third column a TAG in capital letters.
I have another text file 2 that has 2 columns. The first column has a number, the second one has a TAG in capital letters.
I want to compare every row in my text file 1 with every row in my text file 2. If the TAG in column [3] in text file 1 is the same as the TAG in column [2] in text file 2, then I want to store the number in text file 1 next to the number in text file 2 next to the word in text file 1. There are no duplicate TAGS in text file 2 and there are no duplicate words in text file 1.
Illustration:
Text file 1
2 2737 HPL
32 hello PLS
3 world PLS
323 . OPS
Text file 2
342 HPL
56 PLS
342 DCC
4 OPS
I want:
2 342 2737
32 56 hello
3 56 world
323 4 .
You can do this in awk like this:
awk 'FNR==NR { h[$2] = $1; next } $3 in h { print $1, h[$3], $2 }' file2 file1
The first part saves the key and column from file 2 in an associative array (h), the second part compares column 3 from file 1 to this array and prints the relevant parts.

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

Delete row if value in 3rd column is in another text file

I have a long text file (haplotypes.txt) that looks like this:
19 rs541392352 55101281 A 0 0 ...
19 rs546022921 55106773 C T 0 ...
19 rs531959574 31298342 T 0 0 ...
And a simple text file (positions.txt) that looks like this:
55103603
55106773
55107854
55112489
If would like to remove all the rows where the third field is present in positions.txt, to obtain the following output:
19 rs541392352 55101281 A 0 0 ...
19 rs531959574 31298342 T 0 0 ...
I hope someone can help.
With AWK:
awk 'NR == FNR{a[$0] = 1;next}!a[$3]' positions.txt haplotypes.txt
Breakdown:
NR == FNR { # If file is 'positions.txt'
a[$0] = 1 # Store line as key in associtive array 'a'
next # Skip next blocks
}
!a[$3] # Print if third column is not in the array 'a'
This should work:
$ grep -vwFf positions.txt haplotypes.txt
19 rs541392352 55101281 A 0 0 ...
19 rs531959574 31298342 T 0 0 ...
-f positions.txt: read patterns from file
-v: invert matches
-w: match only complete words (avoid substring matches)
-F: fixed string matching (don't interpret patterns as regular expressions)
This expects that only the third column looks like a long number. If the pattern happens to match the exact same word in one of the columns that aren't shown, you can get false positives. To avoid that, you'd have to use an awk solution filtering by column (see andlrc's answer).

Bash - fill empty cell with following value in the column

I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.

Resources