Bash exclude lines where proportion of columns contain matched value - bash

I have a lage text file that I would like to filter by excluding lines that have a number of columns matching a certain character. I had previously removed lines where all columns from 2 onwards contained a 0 or a . like so:
awk '{
for (i=2; i<=NF; i++)
if ($i!~/^(\.|0)/) {
print
break
}
}'
but now I would like it so that I would print lines that had less than a specific number of columns with this value (".").
For example with data:
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .
and a match value of 2 I would expect the bottom two lines to be excluded so that the output would be:
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
Any ideas?

With awk:
$ awk '{c=0;for(i=1;i<NF;i++) c += ($i == ".")}c<2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
Basically it iterates each column and add one to the counter if the column equals a period (.).
The c<2 part will only print the line if there is less than two columns with periods.
With sed one can use:
$ sed -r 'h;s/[^. ]+//g;s/\.\. *//g;/\. \./d;x' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
-r enables extended regular expressions (-E on *BSD).
Basically a copy of the pattern space is stored in the hold buffer, then all but spaces and periods is removed.
Now if the pattern space contains two separate periods it can be deleted if not the pattern space can be exchanged with the hold buffer.

$ awk '{delete a; for(i=1;i<=NF;i++) a[$i]++; if(a["."]>=2) next} 1' foo
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
It iterates all fields (for), counts field values and if 2 or more . in a record, restrains from printing (next). If you want to count the periods only from field 3 onward, change the start value of i in the for: for(i=3; ...).

$ cat ip.txt
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .
$ perl -ne '(#c)=/\.\/\.|\./g; print if $#c < 1' ip.txt
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
(#c)=/\.\/\.|\./g array of ./. or . matches from current line
$#c indicates index of last element, i.e (size of array - 1)
So, to ignore lines containing 3 elements like ./. or . use $#c < 2

Similar to #spasic's answer, but easier (for me) to read!
perl -ane 'print if (grep { /^\.$/} #F) < 2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
The -a separates the space-separated fields into an array called #F for me. I then grep in the array #F looking for elements that consist of just a period - i.e. those that start with a period and end immediately after the period. That counts the lone periods in each line and I print the line if that number is less than 2.

Perhaps this is alright.
awk '$0 !~/\. \./' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0

Related

How to find column values and replace in bash

I could do this easily in R with grepl and row indexing, but wanted to try this in shell. I have a text file that looks like what I have below. I would like to find rows where It matches TWGX and wherever it match, I would like to concatenate column 1 and column 2 separated by _ and make it column values for both column 1 and column 2.
text:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP 10064-8036056040 0 0 0 -9
TWGX-MAP 11570-8036056502 0 0 0 -9
TWGX-MAP 11680-8036055912 0 0 0 -9
This is the result I want:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
The regex /TWGX/ selects the lines containing that string and applies the action that follows. The 1 is an awk shorthand that will print both the modified and unmodified lines.
$ awk 'BEGIN{FS=OFS="\t"} /TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}1' file
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
BEGIN { FS = OFS = "\t" }
# Just once, before processing the file, set FS (file separator) and OFS (output file separator) to be the tab character
/TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}
# For every line that contains a match for TWGX create a mashup of the first two columns, and assign it to each of columns 1 and 2. (Note that in awk string concatenation is done by simply putting expressions next to one another)
1
# This is an awk idiom that consists of the pattern 1, which is always true. By not explicitly specifying an action to go with that pattern, the default action of printing the whole line will be executed.

Common lines from 2 files based on 2 columns per file

I have two file:
file1:
1 imm_1_898835 0 908972 0 A
1 vh_1_1108138 0 1118275 T C
1 vh_1_1110294 0 1120431 A G
1 rs9729550 0 1135242 C A
file2:
1 exm1916089 0 865545 0 0
1 exm44 0 865584 0 G
1 exm46 0 865625 0 G
1 exm47 0 865628 A G
1 exm51 0 908972 0 G
1 exmF 0 1120431 C A
I want to obtain a file that is the overlap between file 1 and 2 based on columns 1 and 4,and I would print the common values for columns 1 and 4 and also columns 2 for file1 and file2.
e.g
I want:
1 908972 imm_1_898835 exm51
1 1120431 vh_1_1110294 exmF
Could you please try following.
awk 'FNR==NR{a[$1,$4]=$2;next} (($1,$4) in a){print $1,$4,a[$1,$4],$2}' file1 file2

Removing duplicate lines with different columns

I have a file which looks like follows:
ENSG00000197111:I12 0
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 0
ENSG00000197111:I5 1
I have some lines that are duplicated but I cannot remove by sort -u because the second column has different values for them (1 or 0). How do I remove such duplicates by keeping the lines with second column as 1 such that the file will be
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
you can use awk and or operator, if the order isn't mandatory
awk '{d[$1]=d[$1] || $2}END{for(k in d) print k, d[k]}' file
you get
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
Edit, only sort solution
You can use sort with a double pass, example
sort -k1,1 -k2,2r file | sort -u -k1,1
you get,
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1

How do I filter tab-separated input by the count of fields with a given value?

My data(tab separated):
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0
...
how can i grep the lines with exact, for example, 5 '1's,
ideal output:
1 0 0 1 0 1 1 0 1
Also, how can i grep lines with equal or more than (>=) 5 '1's,
ideal output:
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
i tried,
grep 1$'\t'1$'\t'1$'\t'1$'\t'1
however this will only output consecutive '1's, which is not all i want.
i wonder if there will be any simple method to achieve this, thank you!
John Bollinger's helpful answer and anishane's answer show that it can be done with grep, but, as has been noted, that is quite cumbersome, given that regular expression aren't designed for counting.
awk, by contrast, is built for field-based parsing and counting (often combined with regular expressions to identify field separators, or, as below, the fields themselves).
Assuming you have GNU awk, you can use the following:
Exactly 5 1s:
awk -v FPAT='\\<1\\>' 'NF==5' file
5 or more 1s:
awk -v FPAT='\\<1\\>' 'NF>=5' file
Special variable FPAT is a GNU awk extension that allows you to identify fields via a regex that describes the fields themselves, in contrast with the standard approach of using a regex to define the separators between fields (via special variable FS or option -F):
'\\<1\\>' identifies any "isolated" 1 (surrounded by non-word characters) as a field, based on word-boundary assertions \< and \>; the \ must be doubled here so that the initial string parsing performed by awk doesn't "eat" single \s.
Standard variable NF contains the count of input fields in the line at hand, which allows easy numerical comparison. If the conditional evaluates to true, the input line at hand is implicitly printed (in other words: NF==5 is implicitly the same as NF==5 { print } and, more verbosely, NF==5 { print $0 }).
A POSIX-compliant awk solution is a little more complicated:
Exactly 5 1s:
awk '{ l=$0; gsub("[\t0]", "") }; length($0)==5 { print l }' file
5 or more 1s:
awk '{ l=$0; gsub("[\t0]", "") }; length($0)>=5 { print l }' file
l=$0 saves the input line ($0) in its original form in variable l.
gsub("[\t0]", "") replaces all \t and 0 chars. in the input line with the empty string, i.e., effectively removes them, and only leaves (directly concatenated) 1 instances (if any).
length($0)==5 { print l } then prints the original input line (l) only if the resulting string of 1s (i.e., the count of 1s now stored in the modified input line ($0)) matches the specified count.
You can use grep. But that would be an abuse of regex.
$ cat countme
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0
$ grep -P '^[0\t]*(1[0\t]*){5}[0\t]*$' countme # Match exactly 5
1 0 0 1 0 1 1 0 1
$ grep -P '^[0\t]*(1[0\t]*){5,}[0\t]*$' countme # Match >=5
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
You can do this to get lines with exactly five '1's:
grep '^[^1]*\(1[^1]*\)\{5,5\}[^1]*$'
You can simplify that to this for at least five '1's:
grep '\(1[^1]*\)\{5,\}'
The enumerated quantifier (\{n,m\}) enables you to conveniently specify a particular number or range of numbers of consecutive matches to a sub-pattern. To avoid matching lines with extra matches to such a pattern, however, you must also anchor it to the beginning and end of the line.
The other other trick is to make sure the gaps previous to the first 1, between the 1s, and after the last 1 are matched. In your case, all of those gaps can be represented pretty simply as ranges of zero or more characters other than 1: [^1]*. Putting those pieces together gives you the above regular expressions.
Do
sed -nE '/^([^1]*1[^1]*){5}$/p' your_file
for exactly 5 matches and
sed -nE '/^([^1]*1[^1]*){5,}$/p' your_file
for 5 or more matches.
Note: In GNU sed you may not see the -E option in the manpage, but it is supported. Using -E is for portability to, say, Mac OSX.
with perl
$ perl -ane 'print if (grep {$_==1} #F) == 5' ip.txt
1 0 0 1 0 1 1 0 1
$ perl -ane 'print if (grep {$_==1} #F) >= 5' ip.txt
1 0 0 1 0 1 1 0 1
1 1 0 1 0 1 0 1 1
1 1 1 1 1 1 1 1 1
-a to automatically split input line on whitespaces and save to #F array
grep {$_==1} #F returns array with elements from #F array which are exactly equal to 1
(grep {$_==1} #F) == 5 in scalar context, comparison will be done based on number of elements of array
See http://perldoc.perl.org/perlrun.html#Command-Switches for details on -ane options

command using awk only outputting 1 line

I have two files:
file 1:
rs3094315 1 0 742429 G A
rs12124819 1 0 766409 G A
rs2272756 1 0 871896 A G
rs3128126 1 0 952073 G A
rs3934834 1 0 995669 A G
rs3766192 1 0 1007060 G A
file 2:
rs12565286 1 0 711153 C G
rs12138618 1 0 740098 A G
rs3094315 1 0 742429 G A
rs3131968 1 0 744055 A G
rs12562034 1 0 758311 A G
rs2905035 1 0 765522 A G
rs12124819 1 0 766409 G A
rs2980319 1 0 766985 A T
rs4040617 1 0 769185 G A
rs2980300 1 0 775852 T C
rs4951864 1 0 787889 C T
rs12132517 1 0 788664 A G
rs950122 1 0 836727 C G
rs2272756 1 0 871896 A G
rs3128126 1 0 952073 G A
rs3121561 1 0 980243 T C
rs3813193 1 0 988364 C G
rs4075116 1 0 993492 C T
rs3934834 1 0 995669 T C
rs3766193 1 0 1007033 C G
rs3766192 1 0 1007060 C T
rs3766191 1 0 1007450 T C
The files have many more matches in the first column after these shown here, there are about 500k lines in both files.
I'm trying to use the following command to find matches in the first column (rs####) and if found, put the matches on one line in a new folder.
awk 'NF==FNR{s=$1; a[s]=$0; next} a[$1]{print $0" "a[$1]}' file1 file2 > mergedfiles
However, this command only gives 1 match (shown below) in mergedfiles and I just can't figure out what is going wrong. It's probably something really easy :s. Thanks in advance if you are able to clear this problem up.
rs3766192 1 0 1007060 C T rs3766192 1 0 1007060 G A
Use:
NR==FNR
Your condition only picks up the sixth line (because there are 6 fields in the first file)!

Resources