Delete lines containing a range pattern in 4th column - bash

In a file 4th column contains a floating point numbers
dsfsd sdfsd sdfds 4.5 dfsdfsd
I want to delete the entire line if the number between -0.1 and 0.1 (or some other range).
Can sed or awk do that for me?
thanks

I recommend using the "pattern { expression }" syntax:
awk '($4 < -0.1) || ($4 > 0.1) {print}' test.txt
Or, even more concicely:
awk '($4 < -0.1) || ($4 > 0.1)' test.txt
Since {print} is the default action. I've assumed that you have a file "test.txt" containing your data.

awk:
{ if ($4 > 0.1 || $4 < -0.1) print $0 }

Related

Loop to create a a DF from values in bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.
While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

awk script for decimal values

I am using this script to extract lines if column 7 is < 1.0E-08 AND
column eight has one or more than one values > 0.2 and 0.3
Is it the right approach ?
InputFile: head -1 test.txt
A2 DR28 P3379 72 7 5.008 8.252e-14
0.05132,0.04248,0.002704,0.116,0.04439,0.2,0.3
A2 DR28 P3379 72 7 5.008 0.05
0.05132,0.04248,0.002704,0.116,0.04439,0.006,0.004
Script: first I did
awk '{if($7 < 1.0E-08 || $8 > 0.2) print}' test.txt
This gives the first line as output but i want to use && (AND) instead of || (OR)
when I use AND (&&)
awk '{if($7 < 1.0E-08 && $8 > 0.2) print}' test.txt
no result though line one fits this criteria.
I also try this but here just considering column eight as a cut-off point
awk -F',' '$8 > 0.2' test.txt
this script work fine but I need to consider column 7 too as I have few lines in output so just want to make sure that i am not missing anything
not tested, but something like this should work
$ awk 'function anyGreater(x,v) {
n=split(x,f8,",");
for(i=1;i<=n;i++) if(f8[i]>v) return 1;
return 0}
$7<1.0E-08 && anyGreater($8,0.2)' file

How to do a if else match on pattern in awk

I've tried the below command:
awk '/search-pattern/ {print $1}'
How do I write the else part for the above command?
Classic way:
awk '{if ($0 ~ /pattern/) {then_actions} else {else_actions}}' file
$0 represents the whole input record.
Another idiomatic way
based on the ternary operator syntax selector ? if-true-exp : if-false-exp
awk '{print ($0 ~ /pattern/)?text_for_true:text_for_false}'
awk '{x == y ? a[i++] : b[i++]}'
awk '{print ($0 ~ /two/)?NR "yes":NR "No"}' <<<$'one two\nthree four\nfive six\nseven two'
1yes
2No
3No
4yes
A straightforward method is,
/REGEX/ {action-if-matches...}
! /REGEX/ {action-if-does-not-match}
Here's a simple example,
$ cat test.txt
123
456
$ awk '/123/{print "O",$0} !/123/{print "X",$0}' test.txt
O 123
X 456
Equivalent to the above, but without violating the DRY principle:
awk '/123/{print "O",$0}{print "X",$0}' test.txt
This is functionally equivalent to awk '/123/{print "O",$0} !/123/{print "X",$0}' test.txt
Depending what you want to do in the else part and other things about your script, choose between these options:
awk '/regexp/{print "true"; next} {print "false"}'
awk '{if (/regexp/) {print "true"} else {print "false"}}'
awk '{print (/regexp/ ? "true" : "false")}'
The default action of awk is to print a line. You're encouraged to use more idiomatic awk
awk '/pattern/' filename
#prints all lines that contain the pattern.
awk '!/pattern/' filename
#prints all lines that do not contain the pattern.
# If you find if(condition){}else{} an overkill to use
awk '/pattern/{print "yes";next}{print "no"}' filename
# Same as if(pattern){print "yes"}else{print "no"}
This command will check whether the values in the $1 $2 and $7-th column are greater than 1, 2, and 5.
!IF! the values do not mach they will be ignored by the filter we declared in awk.
(You can use logical Operators and = "&&"; or= "||".)
awk '($1 > 1) && ($2 > 1) && ($7 > 5)'
You can monitoring your system with the "vmstat 3" command, where "3" means a 3 second delay between the new values
vmstat 3 | awk '($1 > 1) && ($2 > 1) && ($7 > 5)'
I stressed my computer with 13GB copy between USB connected HardDisks, and scrolling youtube video in Chrome browser.

Awk & Sort-Output as Comma Delimited?

I am trying to get this to output as comma delimited. The current version doesn't work at all (I get a blank file as an output), and previous versions (where I keep the awk BEGIN statements but don't have the sort delimiter) will just output as tab delimited, not comma delimited. In the previous versions, without attempting to get the comma delimiters, I do get the expected answer (with the complicated filters, etc), so I'm not asking for help with that portion of it. I realize this is a very ugly way to filter and the numbers are also ugly/very large.
The background of the question: Find the regions in the file lamina.bed that overlap with the region chr12:5000000-6000000, and to sort descending by column 4, output as comma delimited. Chromosome is the first column, start position of the region is column 2, end position is column 3, value is column 4. We are supposed to use awk (in Unix bash shell). Thank you in advance for your help!
awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000)' /vol1/opt/data/lamina.bed | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn > ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
cat ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
sample lines of input (tab delimited, including the lines on chr12 that should work):
#chrom start end value
chr1 11323785 11617177 0.86217008797654
chr1 12645605 13926923 0.934891485809683
chr1 14750216 15119039 0.945945945945946
chr12 3306736 5048326 0.913561847988077
chr12 5294045 5393088 0.923076923076923
chr12 5505370 6006665 0.791318864774624
chr12 7214638 7827375 0.8562874251497
chr12 8139885 10173149 0.884353741496599
To get comma-separated output, use the following:
$ awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1;print}' file | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn
chr12,5294045,5393088,0.923076923076923
chr12,3306736,5048326,0.913561847988077
chr12,5505370,6006665,0.791318864774624
The only change above is the addition on the action:
{$1=$1;print}
awk will only reformat a line with a new field separator if the one or more of the fields on the line have been changed in some way. $1=$1 is sufficient to indicate that field 1 has been changed. Consequently, the new field separators are inserted.
Also, the two calls to awk can be combined into a single call:
awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1; if($1 == "chr12") print}' file | sort -t$"," -k4rn
Simpler Example
In the following, the input is tab-separated and the output field separator, OFS, is set to a comma. In this first example, the awk command print is used:
$ echo $'a\tb\tc' | awk -v OFS=, '{print}'
a b c
Despite OFS=,, the output retains the tab-separator.
Now, we add the simple statement $1=$1 and observe the output:
$ echo $'a\tb\tc' | awk -v OFS=, '{$1=$1;print}'
a,b,c
The output is now comma-separated. Again, that is because awk only reformats a line with the new OFS if it thinks that a field on the line has been changed in some way. The assignment of $1 to itself is sufficient to trigger that reformat.
Note that it is not sufficient to make a change that affects the line as a whole. For example, the following does not trigger a reformat:
$ echo $'a\tb\tc' | awk -v OFS=, '{$0=$0;print}'
a b c
It is necessary to change one or more fields of the line individually. In the following, sub operates on $0 as a whole and, consequently, no reformat is triggered:
$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW");print}'
NEW b c
In the example below, however, sub operates specifically on field $1 and hence triggers a reformat:
$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW", $1);print}'
NEW,b,c

Creating an array with awk and passing it to a second awk operation

I have a column file and I want to print all the lines that do not contain the string SOL, and to print only the lines that do contain SOL but has the 5th column <1.2 or >4.8.
The file is structured as: MOLECULENAME ATOMNAME X Y Z
Example:
151SOL OW 6554 5.160 2.323 4.956
151SOL HW1 6555 5.188 2.254 4.690 ----> as you can see this atom is out of the
151SOL HW2 6556 5.115 2.279 5.034 threshold, but it need to be printed
What I thought is to save a vector with all the MOLECULENAME that I want, and then tell awk to match all the MOLECULENAME saved in vector "a" with the file, and print the complete output. ( if I only do the first awk i end up having bad atom linkage near the thershold)
The problem is that i have to pass the vector from the first awk to the second... I tried like this with a[], but of course it doesn't work.
How can i do this ?
Here is the code I have so far:
a[] = (awk 'BEGIN{i=0} $1 !~ /SOL/{a[i]=$1;i++}; /SOL/ && $5 > 4.8 {a[i]=$1;i++};/SOL/ &&$5<1.2 {a[i]=$1;i++}')
awk -v a="$a[$i]" 'BEGIN{i=0} $1 ~ $a[i] {if (NR>6540) {for (j=0;j<3;j++) {print $0}} else {print $0}
You can put all of the same molecule names in one row by using sort on the file and then running this AWK which basically uses printf to print on the same line until a different molecule name is found. Then, a new line starts. The second AWK script is used to detect which molecules names have 3 valid lines in the original file. I hope this can help you to solve your problem
sort your_file | awk 'BEGIN{ molname=""; } ( $0 !~ "SOL" || ( $0 ~ "SOL" && ( $5<1.2 || $5>4.8 ) ) ){ if($1!=molname){printf("\n");molname=$1}for(i=1;i<=NF;i++){printf("%s ",$i);}}' | awk 'NF>12 {print $0}'
awk '!/SOL/ || $5 < 1.2 || $5 > 4.8' inputfile.txt
Print (default behaviour) lines where:
"SOL" is not found
SOL is found and fifth column < 1.2
SOL is found and fifth column > 4.8
SOLVED! Thanks to all, here is how i solved it.
#!/bin/bash
file=$1
awk 'BEGIN {molecola="";i=0;j=1;}
{if ($1 !~ /SOL/) {print $0}
else if ( $1 != molecola && $1 ~ /SOL/ ) {
for (j in arr_comp) {if( arr_comp[j] < 1.2 || arr_comp[j] > 5) {for(j in arr_comp) {print arr_mol[j] };break}}
delete(arr_comp)
delete(arr_mol)
arr_mol[0]=$0
arr_comp[0]=$5
molecola=$1
j=1
}
else {arr_mol[j]=$0;arr_comp[j]=$5;j++} }' $file

Resources