How to remove lines based on duplicated value in a specific field? - shell

For example, I have this chromosome file:
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
...
I'd like to remove the third line because Region2 appeared on line 2. I would greatly appreciate any suggestion. Thank you!

Assuming you've got a tab delimiter, this should work using awk:
awk -F'\t' '!x[$4]++' file.txt
If its not tab, just change '\t' to whatever the delimiter is, as by default awk assumes space.
Here's an example showing the results:
input:
~$ cat file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
awk:
awk -F'\t' '!x[$4]++' file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
This works by printing when an element is added to the array that has not been encountered before. It's a pretty standard deduping one-liner just modified to care about a specific field and not the whole line.
It works by adding the 4th field to an associative array and post increments it, so it returns 0 the first time it's added and increments with each subsequent duplicate item in the array. Adding in the ! to reverse this logic, we'll print if the post increment is 0, and not if its anything else, which it will be with each subsequent duplicate addition.
For example, adding a few more lines to the file:
~$ cat file.txt
Chr1 0 145 Region1
Chr1 450 500 Region2
Chr1 499 549 Region2
Chr1 499 555 Region2
Chr1 499 555 Region3
Chr1 499 556 Region3
And then changing our print to show the output we're testing:
~$ awk -F'\t' '{print x[$4]++}' file.txt
0
0
1
2
0
1
It should be much more obvious what is happening here.

Related

If value of a column equals value of same column in previous line plus one, give the same code

I have some data that looks like this:
chr1 3861154 N 20
chr1 3861155 N 20
chr1 3861156 N 20
chr1 3949989 N 22
chr1 3949990 N 22
chr1 3949991 N 22
What I need to do is to give a code based on column 2. If the value equals the value of previous line plus one, then they come from the same series and I need to give them the same code in a new column. That code could be the value of the first line of that series. The desired output for this example would be:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
I was thinking of using awk, but of course that's not a requirement.
Any ideas of how could I make this work?
Edit to add the code I'm working in:
awk 'BEGIN {var = $2} {if ($2 == var+1) print $0"\t"var; else print $0"\t"$2; var = $2 }' test
I think the idea is there, but it's not quite right yet. The result I'm getting is:
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861155
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949990
Thanks!
$ cat tst.awk
(NR == 1) || ($2 != (prev+1)) {
val = $2
}
{
print $0, val
prev = $2
}
$ awk -f tst.awk file
chr1 3861154 N 20 3861154
chr1 3861155 N 20 3861154
chr1 3861156 N 20 3861154
chr1 3949989 N 22 3949989
chr1 3949990 N 22 3949989
chr1 3949991 N 22 3949989
The big mistake in your script was this part:
BEGIN {var = $2}
because:
$2 is the 2nd field of the current line of input.
BEGIN is executed before any input lines have been read.
So the value of $2 in the BEGIN section is zero-or-null just like any other unset variable.

How to merge two files into a unique file based on 2 columns of each file

I have two tab delimitated fires as follow:
File1:
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
File2:
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
I want to merge these two files in a unique file if both values in columns 3 and 4 of file1 are equal to columns 1 and 2 of file2 and to keep all columns of file2 plus column 2 of file1.
output like this:
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
Thank you so much,
Vahid.
I tried this awk command:
awk 'NR==FNR{a[$3,$4]=$1OFS$2;next}{$6=a[$1,$2];print}' file1.tsv file2.tsv
Bu it does not give me the unique output I an looking for and the out put is a combination of both files like this:
chr1 10468 10470 2 100 cg00000292 0.780482425 0.78
chr1 10483 10496 4 264 0.924
chr3 10524 10524 1 47 cg00008493 0.982994402 0.936
chr1 10541 10541 1 64 0.781
chr3 10562 10588 5 510 0.941
chr1 10608 10619 3 243 0.951
chr7 10630 10794 42 5292 cg00006414 0.000000456 0.952
chr1 10810 10815 3 135 0.756
The basic idea here to to read the first file, and using each line's third and fourth columns as a key, save the second column in an array. Then for each line in the second file, if its first two columns were seen in the first file, print that line and the saved second column of the first file.
$ awk 'BEGIN{ FS=OFS="\t" }
NR==FNR { seen[$3,$4]=$2; next }
($1,$2) in seen { print $0, seen[$1,$2] }' file1.tsv file2.tsv
chr1 10468 10470 2 100 78 0.780 0.780482425
chr3 10524 10524 1 47 44 0.936 0.982994402
chr7 10630 10794 42 5292 5040 0.952 0.000000456
# I want to merge these two files in a unique file
# if both values in columns 3 and 4 of file1
# are equal to columns 1 and 2 of file2
# and to keep all columns of file2 plus column 2 of file1.
join -t$'\t' -11 -21 -o2.2,2.3,2.4,2.5,2.6,2.7,2.8,1.3 <(
<file1 awk -vFS=$'\t' -vOFS=$'\t' '{ print $3 $4,$0 }' |
sort -t$'\t' -k1,1
) <(
<file2 awk -vFS=$'\t' -vOFS=$'\t' '{ print $1 $2,$0 }' |
sort -t$'\t' -k1,1
)
First preprocess the files and extract the fields you want to join on.
Sort and join
Specify the output format to join.
Tested on repl against:
# recreate input files
tr -s ' ' <<EOF | tr ' ' '\t' >file1
cg00000292 0.780482425 chr1 10468 10470
cg00002426 0.914482257 chr3 57757816 57757817
cg00003994 0.017355388 chr1 15686237 15686238
cg00005847 0.065539061 chr1 176164345 176164346
cg00006414 0.000000456 chr7 10630 10794
cg00007981 0.018839033 chr11 94129428 94129429
cg00008493 0.982994402 chr3 10524 10524
cg00008713 0.018604172 chr18 11980954 11980955
cg00009407 0.002403351 chr3 88824577 88824578
EOF
tr -s ' ' <<EOF | tr ' ' '\t' >file2
chr1 10468 10470 2 100 78 0.780
chr1 10483 10496 4 264 244 0.924
chr3 10524 10524 1 47 44 0.936
chr1 10541 10541 1 64 50 0.781
chr3 10562 10588 5 510 480 0.941
chr1 10608 10619 3 243 231 0.951
chr7 10630 10794 42 5292 5040 0.952
chr1 10810 10815 3 135 102 0.756
EOF

Difference of column value received from operation done on 2 different files

I have following 2 files file1.txt and file2.txt with the data as given below-
Data in file1.txt
125
125
295
295
355
355
355
Data in file2.txt
125
125
295
355
I did below operation over the files and got following output-
Operation1-
sort file1.txt | uniq -c
2 125
2 295
3 355
Operation2-
sort file2.txt | uniq -c
2 125
1 295
1 355
Now, I want following output using the result of Operation1 and Operation2 -
I want to compare the result of Operation1 and Operation2 and get the output which will show the difference of values from column 1 of both the files, and it will show the column 2 as it is as given below-
0 125
1 295
2 355
redirect output of operation 1 and operation 2 in some files. Let say
file1
and
file2
, then write like this:-
paste file1 file2 | awk '{print $1-$3,$2}'
you will have output
0 125
1 295
2 355

awk trying to use output as input and erroring

In the below awk I am trying to combine all matching $4 into a single $5 (up to the -), and average all values in $7. Why is the awk complaining about the output not being foung (that is the /home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/${pref}_genes.txt). Thank you :).
input (`/home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/*30reads_perbase.txt')
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 4 14
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 1 28
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 2 27
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 3 27
desired output
chr1:955543-955763 4 AGRN 15
chr1:976035-976270 3 AGRN 27
awk
for f in /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/*30reads_perbase.txt ; do bname=`basename "$f"`; pref=${bname%%.txt}; awk '{k=$4 FS $5; a[k]+=$7; c[k]++}
END{for(k in a)
split(k,ks,FS);
print ks[1],c[k],ks[2],a[k]/c[k]}' "$f" > /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/"${pref}"_genes.txt; done
current output
chr1:976035-976270 3 AGRN 27.3333
Using the functions substr and match when you are printing the variables:
cat | awk '{k=$4 FS $5; a[k]+=$7; c[k]++}END{for(k in a)split(k,ks,FS);print ks[1],c[k],substr(ks[2],0,match(ks[2],"-")-1),a[k]/c[k]}'
chr1:955543-955763 4 AGRN 15.25

filtering fields based on certain values

I wish you you all a very happy New Year.
I have a file that looks like this(example): There is no header and this file has about 10000 such rows
123 345 676 58 1
464 222 0 0 1
555 22 888 555 1
777 333 676 0 1
555 444 0 58 1
PROBLEM: I only want those rows where both field 3 and 4 have a non zero value i.e. in the above example row 1 & row 3 should be included and rest should be excluded. How can I do this?
The output should look like this:
123 345 676 58 1
555 22 888 555 1
Thanks.
awk is perfect for this kind of stuff:
awk '$3 && $4' input.txt
This will give you the output that you want.
$3 && $4 is a filter. $3 is the value of the 3rd field, $4 is the value of the forth. 0 values will be evaluated as false, anything else will be evaluated as true. If there can be negative values, than you need to write more precisely:
awk '$3 > 0 && $4 > 0' input.txt

Resources