I am trying to collapse sequential numbers to ranges in bash. For example, if my input file is
1
2
3
4
15
16
17
18
22
23
45
46
47
I want the output as:
1 4
15 18
22 23
45 47
How can I do this with awk or sed in a single line command?
Thanks for any help!
$ awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' file
1 4
15 18
22 23
45 47
Explanation
NR==1{first=$1;last=$1;next}
On the first line, initialize the variables first and last and skip to next line.
$1 == last+1 {last=$1;next}
If this line continues in the sequence from the last, update last and jump to the next line.
print first,last;first=$1;last=first
If we get here, we have a break in the sequence. Print out the range for the last sequence and reinitialize the variables for a new sequence.
END{print first,last}
After we get to the end of the file, print the final sequence.
Related
I have a file with a list of value ranges:
2 4
6 9
13 14
and a second file that looks like this:
HiC_scaffold_1 1 26
HiC_scaffold_1 2 27
HiC_scaffold_1 3 27
HiC_scaffold_1 4 31
HiC_scaffold_1 5 34
HiC_scaffold_1 6 35
HiC_scaffold_1 7 37
HiC_scaffold_1 8 37
HiC_scaffold_1 9 38
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 13 39
HiC_scaffold_1 14 39
HiC_scaffold_1 15 42
and I would like to exclude rows from file 2 where the value of column 2 falls within a range defined by file 1. The ideal output would be:
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
I know how to extract a single range with awk:
awk '$2 == "2", $2 == "4"' file2.txt
but my file 1 has many many range values (lines) and I need to exclude rather than extract the rows that correspond to these values.
This is one awy:
$ awk '
NR==FNR { # first file
min[NR]=$1 # store mins and maxes in pairs
max[NR]=$2
next
}
{ # second file
for(i in min)
if($2>=min[i]&&$2<=max[i])
next
}1' ranges data
Output:
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
If the ranges are not huge and integer valued but the data is huge, you could make an exclude map of the values to speed up comparing:
$ awk '
NR==FNR { # ranges file
for(i=$1;i<=$2;ex[i++]); # each value in the range goes to exclude hash
next
}
!($2 in ex)' ranges data # print if not found in ex hash
If your ranges aren't huge:
$ cat tst.awk
NR==FNR {
for (i=$1; i<=$2; i++) {
bad[i]
}
next
}
!($2 in bad)
$ awk -f tst.awk file1 file2
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
sedception
If the second column of file2.txt always equals to the index of its line, you can use sed to prune the lines. If this is not your case, please refer to the awkception paragraph.
sed $(sed 's/^\([0-9]*\)[[:space:]]*\([0-9]*\)/-e \1,\2d/' file1.txt) file2.txt
Where file1.txt contains your ranges and file2.txt is the data itself.
Basically it constructs a sed call that chains a list of -e i,jd expressions, meaning that it will delete lines between the ith line and the jth line.
In your example sed 's/^\([0-9]*\)[[:space:]]*\([0-9]*\)/-e \1,\2d/' file1.txt would output -e 2,4d -e 6,9d -e 13,14d which is the list of expressions for calling sed on file2.txt.
In the end it will call:
sed -e 2,4d -e 6,9d -e 13,14d file2.txt
This command deletes all lines between the 2nd and the 4th, and all lines between the 6th and the 9th, and all lines between the 13th and the 14th.
Obviously it does not work if the second column of file2.txt does not match the index of its own line.
awkception
awk "{$(awk '{printf "if ($2>=%d && $2<=%d) next\n", $1, $2}' file1.txt)}1" file2.txt
This solution works even if the second column does not match the index of its line.
The method uses awk to create an awk program, just like sed created sed expressions in the sedception solution.
In the end this will call :
awk '{
if ($2>=2 && $2<=4) next
if ($2>=6 && $2<=9) next
if ($2>=13 && $2<=14) next
}1' file2.txt
It should be noted that this solution is significantly slower than sed.
I have a tabulated file something like that
Q8VYA50 210 69 2 8 3
Q8VYA50 208 69 1 2 8 3
Q9C8G30 316 182 4 4 7
P335430 657 98 1 10 7
That I would like to do is to apply a cumulative sum from the 4rd column up to NF and print in every column the result of the sum for this column and the original value of previous columns if any. So that, the desired output would be
Q8VYA50 210 69 2 10 13
Q8VYA50 208 69 1 3 11 14
Q9C8G30 316 182 4 8 15
P335430 657 98 1 11 18
I have tried to do it through different ways by means of sum function inside an awk script including for-loop specifying the fields where must apply the cumulative sum. However, the result obtained is wrong.
Are there some way to do it correctly by Unix (Bash)? Thanks in advance!
This is one way I have tried to do #Inian
gawk 'BEGIN {FS=OFS="\t"} {
for (i=4;i<=NF;i++)
{
sum[i]+=$i; print $1,$2,$3,$i
}
}' "input_file"
Other way is to do for every column manually. $4,$5+$4,$6+$5+$4,$7+$6+$5+$4 and so on, but I think is a "seedy" method.
Following awk may help you here.
awk '{for(i=5;i<=NF;i++){$i+=$(i-1)}} 1' OFS="\t" Input_file
So I'm trying to filter 'duplicate' results from a file.
Ive a file that looks like:
7 14 35 35 4 23
23 53 85 27 49 1
35 4 23 27 49 1
....
that I mentally can divide up into item 1 and item 2. Item 1 is the first 3 numbers on each line and item 2 is the last 3 numbers on each line.
I've also got a list of 'items':
7 14 35
23 53 85
35 4 23
27 49 1
...
At a certain point in the file, lets say line number 3 (this number is arbitrary and for example), the 'items' can be separated. Lets say lines 1 and 2 are red and lines 3 and 4 are blue.
I want to make sure on my original file that there are no red red or blue blues - only red blue or blue red, while retaining the original numbers.
So ideally the file would go from:
7 14 35 35 4 23 (red blue)
23 53 85 27 49 1 (red blue)
35 4 23 27 49 1 (blue blue)
....
to
7 14 35 35 4 23 (red blue)
23 53 85 27 49 1 (red blue)
....
I'm having trouble thinking of a good (or any) way to do it.
Any help is appreciated.
EDIT:
An filtering script I have that grabs lines if they have blue or red on the lines:
#!/bin/bash
while read name; do
grep "$name" Twoitems
done < Itemblue > filtered
while read name2; do
grep "$name2" filtered
done < Itemred > double filtered
EDIT2:
Example input an item files:
This is pretty easy using grep with option -f.
First of all, generate four 'pattern' files out of your items file.
I am using AWK here, but you might as well use Perl or what not.
Following your example, I put the 'split' between line 2 and 3; please adjust when necessary.
awk 'NR <= 2 {print "^" $0 " "}' items.txt > starts_red.txt
awk 'NR <= 2 {print " " $0 "$"}' items.txt > ends_red.txt
awk 'NR >= 3 {print "^" $0 " "}' items.txt > starts_blue.txt
awk 'NR >= 3 {print " " $0 "$"}' items.txt > ends_blue.txt
Next, use a grep pipeline using the pattern files (option -f) to filter the appropriate lines from the input file.
grep -f starts_red.txt input.txt | grep -f ends_blue.txt > red_blue.txt
grep -f starts_blue.txt input.txt | grep -f ends_red.txt > blue_red.txt
Finally, concatenate the two output files.
Of course, you might as well use >> to let the second grep pipeline append its output to the output of the first.
Let's say file1 contents
7 14 35 35 4 23
23 53 85 27 49 1
35 4 23 27 49 1
and file2 contents are
7 14 35
23 53 85
35 4 23
27 49 1
Then, you can use a hash to map line-nos to colors based on your cutoff and using that hash, compare lines in first file for the existence of different colors after splitting on third space of each line.
I suppose you want something like below script.Feel free to modify it according to your requirements.
#!/usr/bin/perl
use strict;
use warnings;
#declare a global hash to keep track of line and colors
my %color;
#open both the files
open my $fh1, '<', 'file1' or die "unable to open file1: $! \n";
open my $fh2, '<', 'file2' or die "unable to open file2: $! \n";
#iterate over the second file and store the lines as
#red or blue in hash based on line nos
while(<$fh2>){
chomp;
if($. <= 2){
$color{$_}="red";
}
else{
$color{$_}="blue";
}
}
#close second file
close($fh2);
#iterate over first file
while(<$fh1>){
chomp;
#split the line on 3rd space
my ($part1,$part2)=split /(?:\d+\s){3}\K/;
#remove trailing spaces present
$part1=~s/\s+$//;
#print if $part1 and $part does not belong to same color
print "$_\n" if($color{$part1} ne $color{$part2});
}
#close first file
close($fh1);
Assume a text file with 40 lines of data. How can I remove lines 3 to 10, 13 to 20, 23 to 30, 33 to 40, in place using bash script?
I already know how to remove lines 3 to 10 with sed, but I wonder if there is a way to do all the removing, in place, with only one command line. I can use for loop but the problem is that with each iteration of loop the lines number will be changed and it needs some additional calculation of line numbers to be removed.
here is an awk oneliner, works for your needs no matter your file has 40 lines or 40k lines:
awk 'NR~/[12]$/' file
for example, with 50 lines:
kent$ seq 50|awk 'NR~/[12]$/'
1
2
11
12
21
22
31
32
41
42
sed -i '3,10d;13,20d;23,30d;33,40d' file
This might work for you (GNU sed):
sed '3~10,+7d' file
Deletes lines in the range of 3 and thereafter steps of 10 for the following 7 lines to be deleted.
If the file was longer than 40 lines and you were only interested in the first 40 lines:
sed '41,$b;3~10,+7d' file
The first instruction tells sed to ignore lines 41 to end-of-file.
Could also be written:
sed '1,40{3~10,+7d}' file
#Kent's answer is the way to go for this particular case, but in general:
$ seq 50 | awk '{idx=(NR%10)} idx>=1 && idx<=2'
1
2
11
12
21
22
31
32
41
The above will work even if you want to select the 4th through 7th lines out of every 13, for example:
$ seq 50 | awk '{idx=(NR%13)} idx>=4 && idx<=7'
4
5
6
7
17
18
19
20
30
31
32
33
43
44
45
46
its not constrained to N out of 10.
Or to select just the 3rd, 5th and 6th lines out of every 13:
$ seq 50 | awk 'BEGIN{split("3 5 6",tmp); for (i in tmp) tgt[tmp[i]]=1} tgt[NR%13]'
3
5
6
16
18
19
29
31
32
42
44
45
The point is - selecting ranges of lines is a job for awk, definitely not sed.
awk '{m=NR%10} !(m==0 || m>=3)' file > tmp && mv tmp file
I need your help to formatting my data. I have a data like below
Ver1
12 45
Ver2
134 23
Ver3
2345 980
ver4
21 1
ver36
213141222 22
....
...etc
I need my data like the below format
ver1 12 45
ver2 134 23
ver3 2345 980
ver4 21 1
etc.....
Also i want the total count of col 2 and 3 at the end of the output. Im not sure the scripts, if you provide simple script (May AWK can, but not sure).if possible please share the detailed answer to learn and understand.
$ awk 'NR%2{printf $0" ";next;}
{col1+=$1; col2+=$2} 1;
END{print "TOTAL col1="col1, "col2="col2}' file
Ver1 12 45
Ver2 134 23
Ver3 2345 980
ver4 21 1
ver36 213141222 22
TOTAL col1=213143734 col2=1071
It merges every two lines as solved by Kent. It also sums the 1st and 2nd column into col1 and col2 vars. Finally, it prints the value in the END {} block.