Find overlapping ranges between different files - bash

I have two files each with a column forming ranges.
File 1
23241-24234
10023-12300
75432-82324
File 2
16722-17234
92000-94532
23600-25000
I am looking for ranges that overlap with a certain % (e.g. 50%) between the two files
In the previous example only the following will be printed (50% overlap):
23241-24234 23600-25000
I can do this using Python, but was wondering if there is a quicker bash command that would do the same thing.

In Python, I would write something like this:
f1='''\
23241-24234
10023-12300
75432-82324'''
f2='''\
16722-17234
92000-94532
23600-25000'''
f1ranges=[tuple(map(int, l.split('-'))) for l in f1.splitlines()]
for l in f2.splitlines():
b,e=map(int, l.split('-'))
s2=set(range(b,e))
for r in f1ranges:
s1=set(range(*r))
if len(s1 & s2)>len(s1)/2:
print r,(b,e)
Prints:
(23241, 24234) (23600, 25000)
It is hard to beat that with Bash utilities, but awk would be the only one to use.
The method I used in Python uses the shortcut of the intersection of a set to determine the length of the overlapping interval. You would need to replicate that set-type functionality or use arithmetic comparisons.
Here is an awk framework:
awk 'FNR==NR { f1[$0]; next }
{
split($0,a,"-")
for (e in f1) {
split(e,b,"-")
# add your range comparison logic here...
print a[1],a[2]," ",b[1],b[2], a[2]-b[1], b[2]-a[1]
}
} ' f1 f2

Convert it into a "fake" bed format and use bedtools intersect; https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
$ cat 1.bed
chr1 23241 24234
chr1 10023 12300
chr1 75432 82324
$ cat 2.bed
chr1 16722 17234
chr1 92000 94532
chr1 23600 25000
# sort both files
$ sort -k 1,1 -k2,2n 1.bed > 1.sort.bed
$ sort -k 1,1 -k2,2n 2.bed > 2.sort.bed
$ bedtools intersect -wa -wb -f 0.5 -a 1.sort.bed -b 2.sort.bed
chr1 23241 24234 chr1 23600 25000
you can parse the output and strip out the chr1 labels afterwards
obviously, bedtools is not a builtin bash program, but as you can see from the docs for the tools it has a huge amount of options that will likely be useful for you as soon as your needs become more complicated

Related

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.
I tried using bash script with for-loop.
for i in {0..39999}
do
cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done
Data have following form:
28
-1373.82296 frame 0 xyz file generated by terachem
Re 1.6345663991 0.9571586961 0.3920887712
N 0.7107677071 -1.0248027788 0.5007181135
N -0.3626961076 1.1948218124 -0.4621264246
C -1.1299268126 0.0792071086 -0.5595954110
C -0.5157993503 -1.1509115191 -0.0469223696
C 1.3354467762 -2.1017253883 1.0125736017
C 0.7611763218 -3.3742177216 0.9821756556
C -1.1378354025 -2.4089069492 -0.1199253156
C -0.4944655989 -3.5108477831 0.4043826684
C -0.8597552614 2.3604180994 -0.9043060625
C -2.1340008843 2.4846545826 -1.4451933224
C -2.4023114639 0.1449111237 -1.0888703147
C -2.9292779079 1.3528434658 -1.5302429615
H 2.3226814021 -1.9233467458 1.4602019023
H 1.3128699342 -4.2076373780 1.3768411246
H -2.1105470176 -2.5059031902 -0.5582958817
H -0.9564415355 -4.4988963635 0.3544299401
H -0.1913951275 3.2219343258 -0.8231465989
H -2.4436044324 3.4620639189 -1.7693069306
H -3.0306593902 -0.7362803011 -1.1626515622
H -3.9523215784 1.4136948699 -1.9142814745
C 3.3621999538 0.4972227756 1.1031860016
O 4.3763020637 0.2022266109 1.5735343064
C 2.2906331057 2.7428149541 0.0483795630
O 2.6669163864 3.8206298898 -0.1683800650
C 1.0351398442 1.4995168190 2.1137684156
O 0.6510904387 1.8559680025 3.1601927094
Cl 2.2433490373 0.2064711824 -1.9226174036
It works but it takes enormous amount of time,
In future I will be working with larger file. Is there faster way to do that?
The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:
awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output
This answer assumes the following form of data:
frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...
The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.
You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc
awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output
If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):
cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz
(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)
You could perhaps use 2 passes of grep, rather than thousands?
Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:
grep -A27 ' frame ' | grep -B6 '-----'
If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.
Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):
-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

Print lines after a pattern in several files

I have a file with the same pattern several times.
Something like:
time
12:00
12:32
23:22
time
10:32
1:32
15:45
I want to print the lines after the pattern, in the example time
in several files. The number of lines after the pattern is constant.
I found I can get the first part of my question with awk '/time/ {x=NR+3;next}(NR<=x){print}' filename
But I have no idea how to output each chunk into different files.
EDIT
My files are a bit more complex than my original question.
They have the following format.
4
gen
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
4
gen
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
I want to print the lines after
4
gen
EDIT 2
My expected output is x files x=# pattern.
From my second example, I want two files:
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
and
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
You can use this awk command:
awk '/time/{close(out); out="output" ++i; next} {print > out}' file
This awk command creates a variable out based on a fixed prefix output and an incrementing counter i which gets incremented every time we get a line time. All subsequent lines are redirected to this output file. Is is a good practice to close these file handles to avoid memory leak.
PS: If you want time line also in output then remove next in above command.
The revised "4/gen" requirements are somewhat ambiguous but the following script (which is just a variant of #anubhava's) conforms with those that are given and can easily be modified to deal with various edge cases:
awk '
/^ *4 *$/ {close(out); out=0; next}
/^ *gen *$/ && out==0 {out = "output." ++i; next}
out {print > out} '
I found another answer from anubhava here How to print 5 consecutive lines after a pattern in file using awk
and with head and for loop I solved my problem:
for i in {1..23}; do grep -A25 "gen" allTemp | tail -n 25 > xyz$i; head -n -27 allTemp > allTemp2; cp allTemp2 allTemp; done
I know that file allTemp has 23 occurrences of gen.
Head will remove the lines I printed to xyzi as well as the two lines I don't want and it will output a new file to allTemp2

How to split text files by number of rows that corresponds to another set of files?

Cut a file into several files according to numbers in a list:
$ wc -l all.txt
8500 all.txt
$ wc -l STS.*.txt
2000 STS.input.answers-forums.txt
1500 STS.input.answers-students.txt
2000 STS.input.belief.txt
1500 STS.input.headlines.txt
1500 STS.input.images.txt
How do I split my all.txt into the no. of lines of the STS.*.txt and then save them to the respective STS.output.*.txt?
I've been doing it manually as such:
$ sed '1,2000!d' all.txt > STS.output.answers-forums.txt
$ sed '2001,3500!d' all.txt > STS.output.answers-students.txt
$ sed '3501,5500!d' all.txt > STS.output.belief.txt
$ sed '5501,7000!d' all.txt > STS.output.headlines.txt
$ sed '7001,8500!d' all.txt > STS.output.images.txt
The all.txt input would look something like this:
$ head all.txt
2.3059
2.2371
2.1277
2.1261
2.0576
2.0141
2.0206
2.0397
1.9467
1.8518
Or sometimes all.txt looks like this:
$ head all.txt
2.3059 92.123
2.2371 1.123
2.1277 0.12452
2.1261123 213
2.0576 100
2.0141 0
2.02062 1
2.03972 34.123
1.9467 9.23
1.8518 9123.1
As for the STS.*.txt, they are just plain text lines, e.g.:
$ head STS.output.answers-forums.txt
The problem likely will mean corrective changes before the shuttle fleet starts flying again. He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.
The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650. The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.
"It's a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896. "It's a huge black eye," Arthur Sulzberger, the newspaper's publisher, said of the scandal.
Wish you'd posted some sample input for splitting an input file of, say, 10 lines into output files of say, 2, 3, and 5 lines instead of 8500 lines into.... as that would have given us something to test a solution against. Oh well, this might work but is untested of course:
awk '
ARGIND < (ARGC-1) { outfile[NR] = gensub(/input/,"output","",FILENAME); next }
{ print > outfile[FNR] }
' STS.input.* all.txt
The above used GNU awk for ARGIND and gensub().
It just creates an array that maps each line number across all "input" files to the name of the "output" file that that same line number of "all.txt" should be written to.
Any time you write a loop in shell just to manipulate text you have the wrong approach. The guys who created shell also created awk for shell to call to manipulate text so just do that.
I would suggest writing a loop:
for file in answers-forums answers-students belief headlines images; do
lines=$(wc -l < "STS.input.$file.txt")
sed "$(( total + 1 )),$(( total + lines ))!d" all.txt > "STS.output.$file.txt"
(( total += lines ))
done
total keeps a track of how many lines have been read so far. The sed command extracts the lines from total + 1 to total + lines, writing them to the corresponding output file.

using paste command in a loop

I am using Fedora, and bash to do some text manipulation with the files I have. I am trying to combine a large number of files, each one with two columns of data. From these files, I want to extract the data on the 2nd column of the files, and put it in a single file. Previously, I used the following script:
paste 0_0.dat 0_6.dat 0_12.dat | awk '{print $1, $2, $4}' >0.dat
But this is painfully hard as the number of files gets larger -- trying to do with 100 files. So I looked through the web to see if there's a way to achieve this in a simple way, but come up empty-handed.
I'd like to invoke a 'for' loop, if possible -- for example,
for i in $(seq 0 6 600)
do
paste 0_0.dat | awk '{print $2}'>>0.dat
done
but this does not work, of course, with paste command.
Please let me know if you have any recommendations on how to do what I'm trying to do ...
DATA FILE #1 looks like below (deliminated by a space)
-180 0.00025432
-179 0.000309643
-178 0.000189226
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
DATA FILE #2
-180 0.0002352
-179 0.000423452
-178 0.00019304
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
First column goes from -180 to 180, with increment of 1.
DESIRED
(n is the # of columns; and # of files)
-180 0.00025432 0.00025123 0.000235123 0.00023452 0.00023415 ... n
-179 0.000223432 0.0420504 0.2143450 0.002345123 0.00125235 ... n
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
179 0.002352534 ... n
180 0.001504992 ... n
Thanks,
join can get you your desired result.
join <(sort -r file1) <(sort -r file2)
Test:
[jaypal:~/Temp] cat file1
-180 0.00025432
-179 0.000309643
-178 0.000189226
[jaypal:~/Temp] cat file2
-180 0.0005524243
-179 0.0002424433
-178 0.0001833333
[jaypal:~/Temp] join <(sort -r file1) <(sort -r file2)
-180 0.00025432 0.0005524243
-179 0.000309643 0.0002424433
-178 0.000189226 0.0001833333
To do multiple files at once, you can use it with find command -
find . -type f -name "file*" -exec join '{}' +
How about this:
paste "$#" | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
This assumes that you don't run into a limit with paste (check how many open files it can have). The "$#" notation means 'all the arguments given, exactly as given'. The awk script simply prints $1 from each line of pasted output, followed by the even-numbered columns; followed by a newline. It doesn't validate that the odd-numbered columns all match; it would perhaps be sensible to do so, and you could code a vaguely similar loop to do so in awk. It also doesn't check that the number of fields on this line is the same as the number on the previous line; that's another reasonable check. But this does do the whole job in one pass over all the files - for an essentially arbitrary list of files.
I have 100 input files -- how do I use this code to open up these files?
You put my original answer in a script 'filter-data'; you invoke the script with the 101 file names generated by seq. The paste command pastes all 101 files together; the awk command selects the columns you are interested in.
filter-data $(seq --format="0_%g.dat" 0 6 600)
The seq command with the format will list you 101 file names; these are the 101 files that will be pasted.
You could even do without the filter-data script:
paste $(seq --format="0_%g.dat" 0 6 600) | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
I'd probably go with the more general script as the main script, and if need be I'd create a 'one-liner' that invokes the main script with the specific set of arguments currently of interest.
The other key point which might be a stumbling block: paste is not limited to 2 files only; it can paste as many files as you can have open (give or take about 3).
Based on my assumptions that you see in the comments above, you don't need paste. Try this
awk '{
arr[$1] = arr[$1] "\t" $2 };
END {for (x=-180;x<=180;x++) print x "\t" arr[x]
}' *.txt \
| sort -n
Note that we just take all of the values into an array based on the value in the first field, and append values based on the $1 key. After all data has been read in, The END section prints out the key and the value. I've added things like "x=", ":vals= " to help 'explain' what is happening. Remove those for completely clean tab-seperated data. Change '\t' to ':' or '|', or ... shudder ',' if you need to. Change the *.txt to what every your filespec is.
Be aware that all Unix command lines have limitations to the number and size (length of filenames, not the data inside), of filenames that can be processed in 1 invocation. Let us know if you get error messages about that.
The pipe to sort ensures that data is sorted by column1.
With my test data, the output was
-178 0.0001892261 0.0001892262 0.0001892263 0.000189226
-179 0.0003096431 0.0003096432 0.0003096433 0.000309643
-180 0.000254321 0.000254322 0.000254323 0.00025432
178 0.0001892261 0.0001892262 0.0001892263 0.000189226
179 0.0003096431 0.0003096432 0.0003096433 0.000309643
180 0.000254321 0.000254322 0.000254323 0.00025432
Based on 4 files of input.
I hope this helps.
P.S. Welcome to StackOverflow (S.O.) Please remeber to read the FAQs, http://tinyurl.com/2vycnvr , vote for good Q/A by using the gray triangles, http://i.imgur.com/kygEP.png , and to accept the answer that bes solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png
This might work for you:
echo *.dat | sed 's/\S*/<(cut -f2 &)/2g;s/^/paste /' | bash >all.dat

Resources