How to exclude lines in a file based on a range of values taken from a second file - bash

I have a file with a list of value ranges:
2 4
6 9
13 14
and a second file that looks like this:
HiC_scaffold_1 1 26
HiC_scaffold_1 2 27
HiC_scaffold_1 3 27
HiC_scaffold_1 4 31
HiC_scaffold_1 5 34
HiC_scaffold_1 6 35
HiC_scaffold_1 7 37
HiC_scaffold_1 8 37
HiC_scaffold_1 9 38
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 13 39
HiC_scaffold_1 14 39
HiC_scaffold_1 15 42
and I would like to exclude rows from file 2 where the value of column 2 falls within a range defined by file 1. The ideal output would be:
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
I know how to extract a single range with awk:
awk '$2 == "2", $2 == "4"' file2.txt
but my file 1 has many many range values (lines) and I need to exclude rather than extract the rows that correspond to these values.

This is one awy:
$ awk '
NR==FNR { # first file
min[NR]=$1 # store mins and maxes in pairs
max[NR]=$2
next
}
{ # second file
for(i in min)
if($2>=min[i]&&$2<=max[i])
next
}1' ranges data
Output:
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
If the ranges are not huge and integer valued but the data is huge, you could make an exclude map of the values to speed up comparing:
$ awk '
NR==FNR { # ranges file
for(i=$1;i<=$2;ex[i++]); # each value in the range goes to exclude hash
next
}
!($2 in ex)' ranges data # print if not found in ex hash

If your ranges aren't huge:
$ cat tst.awk
NR==FNR {
for (i=$1; i<=$2; i++) {
bad[i]
}
next
}
!($2 in bad)
$ awk -f tst.awk file1 file2
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42

sedception
If the second column of file2.txt always equals to the index of its line, you can use sed to prune the lines. If this is not your case, please refer to the awkception paragraph.
sed $(sed 's/^\([0-9]*\)[[:space:]]*\([0-9]*\)/-e \1,\2d/' file1.txt) file2.txt
Where file1.txt contains your ranges and file2.txt is the data itself.
Basically it constructs a sed call that chains a list of -e i,jd expressions, meaning that it will delete lines between the ith line and the jth line.
In your example sed 's/^\([0-9]*\)[[:space:]]*\([0-9]*\)/-e \1,\2d/' file1.txt would output -e 2,4d -e 6,9d -e 13,14d which is the list of expressions for calling sed on file2.txt.
In the end it will call:
sed -e 2,4d -e 6,9d -e 13,14d file2.txt
This command deletes all lines between the 2nd and the 4th, and all lines between the 6th and the 9th, and all lines between the 13th and the 14th.
Obviously it does not work if the second column of file2.txt does not match the index of its own line.
awkception
awk "{$(awk '{printf "if ($2>=%d && $2<=%d) next\n", $1, $2}' file1.txt)}1" file2.txt
This solution works even if the second column does not match the index of its line.
The method uses awk to create an awk program, just like sed created sed expressions in the sedception solution.
In the end this will call :
awk '{
if ($2>=2 && $2<=4) next
if ($2>=6 && $2<=9) next
if ($2>=13 && $2<=14) next
}1' file2.txt
It should be noted that this solution is significantly slower than sed.

Related

Insert rows using awk

How can I insert a row using awk?
My file looks as:
1 43
2 34
3 65
4 75
I would like to insert three rows with "?" So my desire file looks as:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I am trying with the below script.
awk '{if(NR<=3){print "NR ?"}} {printf" " NR $2}' file.txt
Here's one way to do it:
$ awk 'BEGIN{s=" "; for(c=1; c<4; c++) print c s "?"}
{print c s $2; c++}' ip.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
$ awk 'BEGIN {printf "1 ?\n2 ?\n3 ?\n"} {printf "%d", $1 + 3; printf " %s\n", $2}' file.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
You could also add the 3 lines before awk, e.g.:
{ seq 3; cat file.txt; } | awk 'NR <= 3 { $2 = "?" } $1 = NR' OFS='\t'
Output:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I would do it following way using GNU AWK, let file.txt content be
1 43
2 34
3 65
4 75
then
awk 'BEGIN{OFS=" "}NR==1{print 1,"?";print 2,"?";print 3,"?"}{print NR+3,$2}' file.txt
output
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
Explanation: I set output field separator (OFS) to 7 spaces. For 1st row I do print three lines which consisting of subsequent number and ? sheared by output field separator. You might elect to do this using for loop, especially if you expect that requirement might change here. For every line I print number of row plus 4 (to keep order) and 2nd column ($2). Thanks to use of OFS, you would need to make only one change if requirement regarding number of spaces will be altered. Note that construct like
{if(condition){dosomething}}
might be written in GNU AWK in more concise manner as
(condition){dosomething}
(tested in gawk 4.2.1)

extracting lines if the first field matches another list saved in a different file -- shell command

I have two files. One contains a list of items, e.g.,
Allie
Bob
John
Laurie
Another file (file2) contains a different list of items in a different order, but some items might overlap with the items in file 1, e.g,
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23
I want to intersect these two files and extract only those lines from file 2 whose first field matches an item in field 1.
i.e., my output should be
Allie 45 56 67 23
Laurie 45 56 6 75
(preferably in this order, but it's OK if not)
grep -f file1 file2 doesn't do what I want.
I also need something efficient because the second file is HUGE.
I also tried this:
awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' file2 file1
If order doesn't matter then
awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Explanation
FNR==NR{ arr[$1]; next } Here we read first file (file1), arr is array, whose index key being first field $1.
$1 in arr we read second file ( file2), if array arr which was created while reading first file, has index key which is second file's first column ($1 in arr gives true, if index key exists), then print current record/row/line from file2
Test Results:
akshay#db-3325:/tmp$ cat file1
Allie
Bob
John
Laurie
akshay#db-3325:/tmp$ cat file2
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23
akshay#db-3325:/tmp$ awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23
No need for complex joins, it is a filtering function
$ grep -wFf file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23
has the benefit or keeping the order in file2 as well. -w option is for full word matches to eliminate sub-string matches to create false positives. Of course if your sample input is not representative and your data may contain key like entries in other fields this will not work without qualifying beginning of line.
This is the job that join is built for.
Providing a reproducer testable via copy-and-paste with shell functions (which you could replace with your actual input files):
cat_file1() {
printf '%s\n' Allie Bob John Laurie
}
cat_file2() {
printf '%s\n' 'Laurie 45 56 6 75' \
'Moxipen 10 45 56 56' \
'Allie 45 56 67 23'
}
join <(cat_file1 | sort) <(cat_file2 | sort)
...properly emits:
Allie 45 56 67 23
Laurie 45 56 6 75
Of course, don't cat file1 | sort -- run sort <file1 to provide a real handle for better efficiency, or (better!) store your inputs in sorted form in the first place.

how to print a specific column but ignoring first 10 lines and last 10 lines in unix shell

I want to print 2nd column but i don't want first 10 and last 10 lines.
awk 'NR>10' filename.txt | awk '{ print $2 }'| head --lines=-10
It didn't work for me
What you want is:
tail -n+11 filename.txt | head -n-10 | awk '{print $2}'
Input
$cat lines_1-40.txt
line 1 in the file
line 2 in the file
line 3 in the file
line 4 in the file
...
line 38 in the file
line 39 in the file
line 40 in the file
Output
$ tail -n+11 lines_1-40.txt | head -n-10 | awk '{print $2}'
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Collapse sequential numbers to ranges in bash

I am trying to collapse sequential numbers to ranges in bash. For example, if my input file is
1
2
3
4
15
16
17
18
22
23
45
46
47
I want the output as:
1 4
15 18
22 23
45 47
How can I do this with awk or sed in a single line command?
Thanks for any help!
$ awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' file
1 4
15 18
22 23
45 47
Explanation
NR==1{first=$1;last=$1;next}
On the first line, initialize the variables first and last and skip to next line.
$1 == last+1 {last=$1;next}
If this line continues in the sequence from the last, update last and jump to the next line.
print first,last;first=$1;last=first
If we get here, we have a break in the sequence. Print out the range for the last sequence and reinitialize the variables for a new sequence.
END{print first,last}
After we get to the end of the file, print the final sequence.

Removing multiple block of lines of a text file in bash

Assume a text file with 40 lines of data. How can I remove lines 3 to 10, 13 to 20, 23 to 30, 33 to 40, in place using bash script?
I already know how to remove lines 3 to 10 with sed, but I wonder if there is a way to do all the removing, in place, with only one command line. I can use for loop but the problem is that with each iteration of loop the lines number will be changed and it needs some additional calculation of line numbers to be removed.
here is an awk oneliner, works for your needs no matter your file has 40 lines or 40k lines:
awk 'NR~/[12]$/' file
for example, with 50 lines:
kent$ seq 50|awk 'NR~/[12]$/'
1
2
11
12
21
22
31
32
41
42
sed -i '3,10d;13,20d;23,30d;33,40d' file
This might work for you (GNU sed):
sed '3~10,+7d' file
Deletes lines in the range of 3 and thereafter steps of 10 for the following 7 lines to be deleted.
If the file was longer than 40 lines and you were only interested in the first 40 lines:
sed '41,$b;3~10,+7d' file
The first instruction tells sed to ignore lines 41 to end-of-file.
Could also be written:
sed '1,40{3~10,+7d}' file
#Kent's answer is the way to go for this particular case, but in general:
$ seq 50 | awk '{idx=(NR%10)} idx>=1 && idx<=2'
1
2
11
12
21
22
31
32
41
The above will work even if you want to select the 4th through 7th lines out of every 13, for example:
$ seq 50 | awk '{idx=(NR%13)} idx>=4 && idx<=7'
4
5
6
7
17
18
19
20
30
31
32
33
43
44
45
46
its not constrained to N out of 10.
Or to select just the 3rd, 5th and 6th lines out of every 13:
$ seq 50 | awk 'BEGIN{split("3 5 6",tmp); for (i in tmp) tgt[tmp[i]]=1} tgt[NR%13]'
3
5
6
16
18
19
29
31
32
42
44
45
The point is - selecting ranges of lines is a job for awk, definitely not sed.
awk '{m=NR%10} !(m==0 || m>=3)' file > tmp && mv tmp file

Resources