Given the four digit number 1234, there are six possible two digit subsequences (12, 13, 14, 23, 24, 34). Given some of the subsequences, is it possible to recover the original number?
Here's some example data. Each line lists some 3 digit subsequences of a different 6 digit number (to be found)
528, 508, 028, 502, 058, 528, 028, 528, 552, 050
163, 635, 635, 130, 163, 633, 130, 330, 635, 135
445, 444, 444, 444, 454, 444, 445,
011, 350, 601, 651, 601, 511, 511, 360, 601, 351
102, 021, 102, 221, 102, 100, 002, 021, 021, 121
332, 111, 313, 311, 132, 113, 132, 111, 112
362, 650, 230, 172, 120, 165, 372, 202, 702
103, 038, 138, 150, 110, 518, 510, 538, 108
343, 231, 431, 341, 203, 203, 401, 303, 031, 233
Edit: Sometimes the solution might not be unique (more than one number could have given the subsequences). In that case, it would be good to return one of them, or maybe even a list.
What you want to do is to find the Shortest common supersequence of all your subsequences. Clearly if you have all subsequences, including the original number, the SCS will be what you are looking for. Otherwise it can't be guaranteed, but there's a good chance.
Unfortunately there isn't a nice polynomial algorithm for this problem, but if you Google it you'll find there are a lot of approximation algorithms available. E.g. An ACO Algorithm for the Shortest Common Supersequene Problem which mentions there are three overall approaches:
Dynamic Programming or Branch'n'Bound. These are usually to slow except for very few strings or small alphabets.
Finding the SCS of the strings pairwise using Dynamic Programming, using heuristics to choose which strings to 'merge'.
The Majority Merge heuristic which might be the nicest one for your case.
The approach described in the paper.
Here is another nice article about the problem: http://www.update.uu.se/~shikaree/Westling/
Build a directed graph with each digit connected to the digit following it in each sequence.
Dealing with cycles:
A cycle implies an impossible scenario - a same character cannot have 2 locations (there can be characters with the same values at multiple positions, but not the exact same character - as a metaphor, you can have many people named Bob, but any given Bob can only be at one place). Some node must be split into multiple nodes. The chosen node should be split such that all incoming edges is in one of the new nodes and all outgoing edges is in the other, and there's a connection between the two.
There should be multiple nodes which can be picked to be split, with possibly only one being correct, you may need to explore all possibilities until you find one that works. If one doesn't work, you'll get a longer string than is allowed somewhere down the line.
It might be a better idea to leave getting rid of cycles completely until right before the topological sort (resolving them in the same way).
Dealing with nodes with the same value (as a result of cycle resolution):
If there are multiple nodes with the same value that can be chosen, let the outgoing edges go from the first one (the one that has a directed path to all the others) and the incoming edges to the last one (the one which all other ones has a directed path to). This obviously needs to be slightly modified if there are multiple digits with the same value in the same sequence.
Finding the actual string:
To determine the string, do a topological sort on the graph.
Example:
Assume we're looking for a 5-digit number and the input is:
528, 508, 028, 502, 058, 058
I know the duplication of 058 is somewhat trivial, but it's just for illustration.
For 528, create nodes for 5, 2 and 8, and connect 5 and 2 and 2 and 8.
5 -> 2 -> 8
For 508, create 0, connect 5 and 0 and 0 and 8.
5 -> 2 -> 8
\ /
> 0
For 028, connect 0 and 2.
5 ------> 2 -> 8
\ / /
> 0 -----
For 502, all connections are already there.
For 058, we get a cycle (5->0->5), so we have 2 choices:
Split 0 into 2 nodes:
/-----------\----\
/ v v
0 -> 5 ------> 2 -> 8
\
> 0
Split 5 into 2 nodes:
/-----------\
/ v
5 ------> 2 -> 5 -> 8
\ ^ ^
\ / /
> 0 --------
Let's assume we go with the latter.
For 058, we need an outgoing edge from the last 5 (the right 5 in this case) and an incoming edge from the first 5 (the left 5 in this case). These edges (5->0 and 5->8) already exists, so there's nothing to do.
A topological sort will give us 50258, which is our number.
Let logic programming do the work for you.
This is via core.logic in clojure.
Define what it means to be a subsequence
(defne subseqo [s1 s2]
([(h . t1) (h . t2)] (subseqo t1 t2))
([(h1 . t1) (h2 . t2)] (!= h1 h2) (subseqo s1 t2))
([() _]))
Run the constraints through the solver.
(defn recover6 [input-string]
(run* [q]
(fresh [a b c d e f]
(== q [a b c d e f])
(everyg (fn [s] (subseqo (seq s) q))
(re-seq #"\d+" input-string)))))
Examples (results are perceptually instantaneous at the REPL):
(recover6 "528, 508, 028, 502, 058, 528, 028, 528, 552, 050")
;=> ([\5 \0 \5 \2 \8 \0]
[\5 \0 \5 \2 \0 \8]
[\5 \0 \5 \0 \2 \8]
[\0 \5 \0 \5 \2 \8]
[\0 \5 \5 \0 \2 \8])
(recover6 "163, 635, 635, 130, 163, 633, 130, 330, 635, 135")
;=> ([\1 \6 \3 \5 \3 \0]
[\1 \6 \3 \3 \5 \0]
[\1 \6 \3 \3 \0 \5])
(recover6 "445, 444, 444, 444, 454, 444, 445")
;=> ([\4 \4 \5 \4 _0 _1]
... and many more
In the last example, the underscores indicate that _0 and _1 are free variables. They have not been constrained. It is easy enough to constrain any free variables to the set of digits.
Related
I'm trying to have the "range" of compass headings over the last X seconds. Example: Over the last minute, my heading has been between 120deg and 140deg on the compass. Easy enough right? I have an array with the compass headings over the time period, say 1 reading every second.
[ 125, 122, 120, 125, 130, 139, 140, 138 ]
I can take the minimum and maximum values and there you go. My range is from 120 to 140.
Except it's not that simple. Take for example if my heading has shifted from 10 degrees, to 350 degrees (ie it "passed" through North, changing -20deg.
Now my array might look something like this:
[ 9, 10, 6, 3, 358, 355, 350, 353 ]
Now the Min is 3 and max 358, which is not what I need :( I'm looking for the most "right hand" (clockwise) value, and most "left hand" (counter-clockwise) value.
Only way I can think of, is finding the largest arc along the circle that includes none of the values in my array, but I don't even know how I would do that.
Would really appreciate any help!
Problem Analysis
To summarize the problem, it sounds like you want to find both of the following:
The two readings that are closest together (for simplicity: in a clockwise direction) AND
Contain all of the other readings between them.
So in your second example, 9 and 10 are only 1° apart, but they do not contain all the other readings. Conversely, traveling clockwise from 10 to 9 would contain all of the other readings, but they are 359° apart in that direction, so they are not closest.
In this case, I'm not sure if using the minimum and maximum readings will help. Instead, I'd recommend sorting all of the readings. Then you can more easily check the two criteria specified above.
Here's the second example you provided, sorted in ascending order:
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
If we start from the beginning, we know that traveling from reading 3 to reading 358 will encompass all of the other readings, but they are 358 - 3 = 355° apart. We can continue scanning the results progressively. Note that once we circle around, we have to add 360 to properly calculate the degrees of separation.
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
*--------------------------> 358 - 3 = 355° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
-> *----------------------------- (360 + 3) - 6 = 357° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
----> *-------------------------- (360 + 6) - 9 = 357° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
-------> *----------------------- (360 + 9) - 10 = 359° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
----------> *------------------- (360 + 10) - 350 = 20° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
--------------> *-------------- (360 + 350) - 353 = 357° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
-------------------> *--------- (360 + 353) - 355 = 358° separation
[ 3, 6, 9, 10, 350, 353, 355, 358 ]
------------------------> *---- (360 + 355) - 358 = 357° separation
Pseudocode Solution
Here's a pseudocode algorithm for determining the minimum degree range of reading values. There are definitely ways it could be optimized if performance is a concern.
// Somehow, we need to get our reading data into the program, sorted
// in ascending order.
// If readings are always whole numbers, you can use an int[] array
// instead of a double[] array. If we use an int[] array here, change
// the "minimumInclusiveReadingRange" variable below to be an int too.
double[] readings = populateAndSortReadingsArray();
if (readings.length == 0)
{
// Handle case where no readings are provided. Show a warning,
// throw an error, or whatever the requirement is.
}
else
{
// We want to track the endpoints of the smallest inclusive range.
// These values will be overwritten each time a better range is found.
int minimumInclusiveEndpointIndex1;
int minimumInclusiveEndpointIndex2;
double minimumInclusiveReadingRange; // This is convenient, but not necessary.
// We could determine it using the
// endpoint indices instead.
// Check the range of the greatest and least readings first. Since
// the readings are sorted, the greatest reading is the last element.
// The least reading is the first element.
minimumInclusiveReadingRange = readings[array.length - 1] - readings[0];
minimumInclusiveEndpointIndex1 = 0;
minimumInclusiveEndpointIndex2 = array.length - 1;
// Potential to skip some processing. If the ends are 180 or less
// degrees apart, they represent the minimum inclusive reading range.
// The for loop below could be skipped.
for (int i = 1; i < array.length; i++)
{
if ((360.0 + readings[i-1]) - readings[i] < minimumInclusiveReadingRange)
{
minimumInclusiveReadingRange = (360.0 + readings[i-1]) - readings[i];
minimumInclusiveEndpointIndex1 = i;
minimumInclusiveEndpointIndex2 = i - 1;
}
}
// Most likely, there will be some different readings, but there is an
// edge case of all readings being the same:
if (minimumInclusiveReadingRange == 0.0)
{
print("All readings were the same: " + readings[0]);
}
else
{
print("The range of compass readings was: " + minimumInclusiveReadingRange +
" spanning from " + readings[minimumInclusiveEndpointIndex1] +
" to " + readings[minimumInclusiveEndpointIndex2]);
}
}
There is one additional edge case that this pseudocode algorithm does not cover, and that is the case where there are multiple minimum inclusive ranges...
Example 1: [0, 90, 180, 270] which has a range of 270 (90 to 0/360, 180 to 90, 270 to 180, and 0 to 270).
Example 2: [85, 95, 265, 275] which has a range of 190 (85 to 275 and 265 to 95)
If it's necessary to report each possible pair of endpoints that create the minimum inclusive range, this edge case would increase the complexity of the logic a bit. If all that matters is determining the value of the minimum inclusive range or it is sufficient to report just one pair that represents the minimum inclusive range, the provided algorithm should suffice.
Let's say I have a number x which is a power of two that means x = 2^i for some i.
So the binary representation of x has only one '1'. I need to find the index of that one.
For example, x = 16 (decimal)
x = 10000 (in binary)
here index should be 4. Is it possible to find the index in O(1) time by just using bit operation?
The following is an explanation of the logic behind the use of de Bruijn sequences in the O(1) code of the answers provided by #Juan Lopez and #njuffa (great answers by the way, you should consider upvoting them).
The de Bruijn sequence
Given an alphabet K and a length n, a de Bruijn sequence is a sequence of characters from K that contains (in no particular order) all permutations with length n in it [1], for example, given the alphabet {a, b} and n = 3, the following is a list all permutations (with repetitions) of {a, b} with length 3:
[aaa, aab, aba, abb, baa, bab, bba, bbb]
To create the associated de Bruijn sequence we construct a minimum string that contains all these permutations without repetition, one of such strings would be: babbbaaa
"babbbaaa" is a de Bruijn sequence for our alphabet K = {a, b} and n = 3, the notation to represent this is B(2, 3), where 2 is the size of K also denoted as k. The size of the sequence is given by kn, in the previous example kn = 23 = 8
How can we construct such a string? One method consist on building a directed graph where every node represents a permutation and has an outgoing edge for every letter in the alphabet, the transition from one node to another adds the edge letter to the right of the next node and removes its leftmost letter. Once the graph is built grab the edges in a Hamiltonian path over it to construct the sequence.
The graph for the previous example would be:
Then, take a Hamiltonian path (a path that visits each vertex exactly once):
Starting from node aaa and following each edge, we end up having:
(aaa) -> b -> a -> b -> b -> b -> a -> a -> a (aaa) = babbbaaa
We could have started from the node bbb in which case the obtained sequence would have been "aaababbb".
Now that the de Bruijn sequence is covered, let's use it to find the number of leading zeroes in an integer.
The de Bruijn algorihtm [2]
To find out the number of leading zeroes in an integer value, the first step in this algorithm is to isolate the first bit from right to left, for example, given 848 (11010100002):
isolate rightmost bit
1101010000 ---------------------------> 0000010000
One way to do this is using x & (~x + 1), you can find more info on how this expression works on the Hacker's Delight book (chapter 2, section 2-1).
The question states that the input is a power of 2, so the rightmost bit is isolated from the beginning and no effort is required for that.
Once the bit is isolated (thus converting it in a power of two), the second step consist on using a hash table approach along with its hash function to map the filtered input with its corresponding number of leading 0's, p.e., applying the hash function h(x) to 00000100002 should return the the index on the table that contains the value 4.
The algorithm proposes the use of a perfect hash function highlighting these properties:
the hash table should be small
the hash function should be easy to compute
the hash function should not produce collisions, i.e., h(x) ≠ h(y) if x ≠ y
To achieve this, we could use a de Bruijn sequence, with an alphabet of binary elements K = {0, 1}, with n = 6 if we want to solve the problem for 64 bit integers (for 64 bit integers, there are 64 possible power of two values and 6 bits are required to count them all). B(2, 6) = 64, so we need to find a de Bruijn sequence of length 64 that includes all permutations (with repetition) of binary digits with length 6 (0000002, 0000012, ..., 1111112).
Using a program that implements a method like the one described above you can generate a de Bruijn sequence that meets the requirement for 64 bits integers:
00000111111011011101010111100101100110100100111000101000110000102 = 7EDD5E59A4E28C216
The proposed hashing function for the algorithm is:
h(x) = (x * deBruijn) >> (k^n - n)
Where x is a power of two. For every possible power of two within 64 bits, h(x) returns a corresponding binary permutation, and we need to associate every one of these permutations with the number of leading zeroes to fill the table. For example, if x is 16 = 100002, which has 4 leading zeroes, we have:
h(16) = (16 * 0x7EDD5E59A4E28C2) >> 58
= 9141566557743385632 >> 58
= 31 (011111b)
So, at index 31 of our table, we store 4. Another example, let's work with 256 = 1000000002 which has 8 leading zeroes:
h(256) = (256 * 0x7EDD5E59A4E28C2) >> 58
= 17137856407927308800 (due to overflow) >> 58
= 59 (111011b)
At index 59, we store 8. We repeat this process for every power of two until we fill up the table. Generating the table manually is unwieldy, you should use a program like the one found here for this endeavour.
At the end we'd end up with the following table:
int table[] = {
63, 0, 58, 1, 59, 47, 53, 2,
60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20,
55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41,
50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12,
44, 24, 15, 8, 23, 7, 6, 5
};
And the code to calculate the required value:
// Assumes that x is a power of two
int numLeadingZeroes(uint64_t x) {
return table[(x * 0x7EDD5E59A4E28C2ull) >> 58];
}
What warranties that we are not missing an index for a power of two due to collision?
The hash function basically obtains every 6 bits permutation contained in the de Bruijn sequence for every power of two, the multiplication by x is basically just a shift to the left (multiplying a number by a power of two is the same as left shifting the number), then the right shift 58 is applied, isolating the 6 bits group one by one, no collision will appear for two different values of x (the third property of the desired hash function for this problem) thanks to the de Bruijn sequence.
References:
[1] De Bruijn Sequences - http://datagenetics.com/blog/october22013/index.html
[2] Using de Bruijn Sequences to Index a 1 in a Computer Word - http://supertech.csail.mit.edu/papers/debruijn.pdf
[3] The Magic Bitscan - http://elemarjr.net/2011/01/09/the-magic-bitscan/
The specifications of the problem are not entirely clear to me. For example, which operations count as "bit operations" and how many bits make up the input in question? Many processors have a "count leading zeros" or "find first bit" instruction exposed via intrinsic that basically provides the desired result directly.
Below I show how to find the bit position in 32-bit integer based on a De Bruijn sequence.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
/* find position of 1-bit in a = 2^n, 0 <= n <= 31 */
int bit_pos (uint32_t a)
{
static int tab[32] = { 0, 1, 2, 6, 3, 11, 7, 16,
4, 14, 12, 21, 8, 23, 17, 26,
31, 5, 10, 15, 13, 20, 22, 25,
30, 9, 19, 24, 29, 18, 28, 27};
// return tab [0x04653adf * a >> 27];
return tab [(a + (a << 1) + (a << 2) + (a << 3) + (a << 4) + (a << 6) +
(a << 7) + (a << 9) + (a << 11) + (a << 12) + (a << 13) +
(a << 16) + (a << 18) + (a << 21) + (a << 22) + (a << 26))
>> 27];
}
int main (void)
{
uint32_t nbr;
int pos = 0;
while (pos < 32) {
nbr = 1U << pos;
if (bit_pos (nbr) != pos) {
printf ("!!!! error: nbr=%08x bit_pos=%d pos=%d\n",
nbr, bit_pos(nbr), pos);
EXIT_FAILURE;
}
pos++;
}
return EXIT_SUCCESS;
}
You can do it in O(1) if you allow a single memory access:
#include <iostream>
using namespace std;
int indexes[] = {
63, 0, 58, 1, 59, 47, 53, 2,
60, 39, 48, 27, 54, 33, 42, 3,
61, 51, 37, 40, 49, 18, 28, 20,
55, 30, 34, 11, 43, 14, 22, 4,
62, 57, 46, 52, 38, 26, 32, 41,
50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16, 9, 12,
44, 24, 15, 8, 23, 7, 6, 5
};
int main() {
unsigned long long n;
while(cin >> n) {
cout << indexes[((n & (~n + 1)) * 0x07EDD5E59A4E28C2ull) >> 58] << endl;
}
}
It depends on your definitions. First let's assume there are n bits, because if we assume there is a constant number of bits then everything we could possibly do with them is going to take constant time so we could not compare anything.
First, let's take the widest possible view of "bitwise operations" - they operate on bits but not necessarily pointwise, and furthermore we'll count operations but not include the complexity of the operations themselves.
M. L. Fredman and D. E. Willard showed that there is an algorithm of O(1) operations to compute lambda(x) (the floor of the base-2 logarithm of x, so the index of the highest set bit). It contains quite some multiplications though, so calling it bitwise is a bit funny.
On the other hand, there is an obvious O(log n) operations algorithm using no multiplications, just binary search for it. But can do better, Gerth Brodal showed that it can be done in O(log log n) operations (and none of them are multiplications).
All the algorithms I referenced are in The Art of Computer Programming 4A, bitwise tricks and techniques.
None of these really qualify as finding that 1 in constant time, and it should be obvious that you can't do that. The other answers don't qualify either, despite their claims. They're cool, but they're designed for a specific constant number of bits, any naive algorithm would therefore also be O(1) (trivially, because there is no n to depend on). In a comment OP said something that implied he actually wanted that, but it doesn't technically answer the question.
And the answer is ... ... ... ... ... yes!
Just for fun, since you commented below one of the answers that i up to 20 would suffice.
(multiplications here are by either zero or one)
#include <iostream>
using namespace std;
int f(int n){
return
0 | !(n ^ 1048576) * 20
| !(n ^ 524288) * 19
| !(n ^ 262144) * 18
| !(n ^ 131072) * 17
| !(n ^ 65536) * 16
| !(n ^ 32768) * 15
| !(n ^ 16384) * 14
| !(n ^ 8192) * 13
| !(n ^ 4096) * 12
| !(n ^ 2048) * 11
| !(n ^ 1024) * 10
| !(n ^ 512) * 9
| !(n ^ 256) * 8
| !(n ^ 128) * 7
| !(n ^ 64) * 6
| !(n ^ 32) * 5
| !(n ^ 16) * 4
| !(n ^ 8) * 3
| !(n ^ 4) * 2
| !(n ^ 2);
}
int main() {
for (int i=1; i<1048577; i <<= 1){
cout << f(i) << " "; // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
}
}
I need to render a horizontal calendar and render events on it. So I get two dates and the width in pixels. I want to distribute the days between the two provided dates over those pixels and maintain a minimum distance between the visual points.
for instance, I have 365 days (each day should consume at least 10 pixels) and I need to distribute then over 300 pixels. So I need to "pack" them in groups so each pixel would represent multiple dates. How can I calculate this mathematically speaking?
i.e.
(days)
1/1 8/1 16/1 24/1 2/2 10/2 18/2 ......
in the above example for instance, how can I calculate that I need to "pack/skip" the 7 days?
What I need in the end is to produce an array with the dates (days) and the x offset where it should be positioned in the horizontal axis.
i.e.
1/1/2013 = 0
2/1/2013 = 0
3/1/2013 = 0
4/1/2013 = 0
5/1/2013 = 0
6/1/2013 = 0
7/1/2013 = 0
8/1/2013 = 10
9/1/2013 = 10
10/1/2013 = 10
....
You have 300 pixels to use. Each 'package' should be at least 10 pixels. This means you should have 300/10=30 packages. You have 365 which should be distributed over 30 packages so that's 365/30=12.17 days per package. Or simply 12.
The same logic can be used to calculate the amount of days needed in a package if you have a different amount of pixels to use.
I hope that this was what you were asking for.
Jannes
Edit: I have just read your edit so I will alter my reply a bit here.
If you have converted your date to a number between 1 and 365 you can simply calculate each element of your array days like this.
days[i]=floor(i/12)*10
Where the 12 came from above calculations.
date_width = 10
display_width = 300
date_range = 365
num_of_dates = display_len // date_len
date_offsets = [x * date_range // num_of_dates for x in range(num_of_dates)]
gives dates for every 10 "pixels"
[0, 12, 24, 36, 48, 60, 73, 85, 97, 109, 121, 133, 146, 158, 170, 182, 194, 206, 219, 231, 243, 255, 267, 279, 292, 304, 316, 328, 340, 352]
if seeing that you have about 12 days between data points you want to shift it up to 2 weeks
date_offset = 14
date_offsets = [x * date_offset for x in range(date_range//date_offset)]
date_positions = [display_width * o // date_range for o in date_offsets]
I am trying to teach myself bioinformatics, arriving to the party by way of computer science and high performance computing. (Essentially, I'm trying to learn the biology.) I've recently discovered BioPython and so far think it's great, but I am curious if anybody could help me identify why the translate() method used in BioPython to convert sequence data to ORF candidates and amino protein chains is behaving differently than expected.
The following is from this year's DNA60 challenge, and it's to find all of the ORF's in a sequence and sort them, convert them to amino chains, and then take the 25th amino acid from the longest top 15 chains to spell out a phrase.
Here's the challenge:
http://genomebiology.com/about/update/DNA60_TAGCGAC
So after doing some research, I settled on using the code directly out of the tutorial for finding and identifying ORF's using the translate method, found here:
http://www.bio-cloud.info/Biopython/en/ch16.html
Modifying it to print out the 25th amino acide for each chain, and sorting the output by chain length (using the linux command line tool "sort"), the output is entirely wrong.
Knowing what the answer was supposed to be, I could not figure out why this wasn't working. So I wrote my own script to do the ORF identification and and translation, sorted the output, and it worked! (Using NCBI table 1, min length of 25.)
So somehow, the ORF identification in the translate method is not working the way I think it should, and I was hoping somebody could tell me why. Below is my code for ORF identification in Python (and you pass in the reverse_complement for the second set of three frames)
def getORF(sequence, treshold, start_codons, stop_codons):
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
orfs = []
for j in range(0, 3):
for indx in range(j, len(sequence), 3):
current_codon = sequence[indx:indx+3]
if current_codon in start_codons and not start_codon_found:
start_codon_found = True
start_codon_index = indx
if current_codon in stop_codons and start_codon_found:
end_codon_index = indx
length = end_codon_index - start_codon_index + 3
if length >= treshold * 3:
orfs.append(start_codon_index)
if length % 3 != 0:
#print "it's going to complain"
#print len(sequence)-end_codon_index-3
snippet = Seq(sequence[start_codon_index:end_codon_index])
proteins = Seq.translate(snippet)
if len(proteins) >= 25:
print "%i [%s]" % (length/3, proteins[24])
start_codon_found = False
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
return len(orfs), orfs
Pretty straightforward. Here's the rest of it:
f = open('genome.fa')
seqrecords = list(SeqIO.parse(f, 'fasta'))
rec = seqrecords[0]
seq = str(rec.seq)
comp_seq = str(rec.seq.reverse_complement())
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
length_orf, orfs = getORF(seq, 25, start, stop)
length_orf_complement, orfs_complement = getComplementORF(comp_seq, 25, start, stop)
Do this once for each strand (initial and reverse_complement) then if you sort the output using the following command to give you the 15 longest:
python orf.py | sort -k1,1n | tail -15
the output is:
155 [E]
157 [F]
158 [I]
163 [L]
166 [F]
167 [Q]
170 [T]
171 [E]
173 [R]
175 [C]
176 [E]
187 [S]
201 [E]
211 [H]
234 [T]
This is the correct phrase. The output using the straight up translate is:
TPKSSSILLRPCQCVSDRKHVTRTAYNFFI...KLA - length 178, strand 1, frame 2 [T]
EAQVRFPVFSSDCPLMMLFSRRLLIGLVRR...GRD - length 181, strand -1, frame 0 [L]
KSGGSTREFRGMSVPEAVRFLKILGNICEQ...RNS - length 181, strand 1, frame 2 [L]
YSLGQQGPEGGVSFEVIAVVVHPKTERGSR...TLA - length 181, strand 1, frame 2 [K]
TVDFQRHSLIVVVARNHLLSTRVQAGLSRD...SWG - length 183, strand -1, frame 0 [Q]
RGRLPDYKTTRACAENTIELRFPPSVYISE...TSN - length 185, strand 1, frame 0 [P]
YYSRLKETPPTQPNPAIMGRRSDETALTRQ...RSF - length 191, strand -1, frame 0 [E]
GPYLWFSCLARGTCKTGDIDYRNSSVVDPY...RPT - length 199, strand 1, frame 0 [S]
LHNQQAQECDDFCMRCRHEVSYSLLNKDGF...LIM - length 199, strand 1, frame 0 [L]
PWLHWESSLGNIFTLRPWVHGFYKEPGCNK...CLF - length 199, strand -1, frame 1 [K]
TQPVQFGLYLTHMAGVGTTREGLTQGLMLY...WHI - length 212, strand -1, frame 0 [T]
VSMVANTFIPLSMGCRYITHSICVSRHMRY...LPV - length 212, strand 1, frame 1 [V]
AVWTSGIELAVQQGTRDVILKNGRQIREVS...QSL - length 219, strand 1, frame 0 [R]
ELTRLDITVLSLCNVSPRNVYGINNASASQ...TIR - length 223, strand -1, frame 0 [N]
TRSKSNGLSMEDNRPLFALRRYWDTTSGSS...KGW - length 242, strand -1, frame 2 [D]
You can see that the lengths don't even match up. What gives here? Am I missing something?
Part of the problem is that the "straight up translate" does not take start codons into account.
It just translates in each frame and splits on "*" (stop codon).
Try looking for the first "M"
(= translation of ATG) in each of your sequences and start the sequence at that point...
According to Marcin Ciura's Optimal (best known) sequence of increments for shell sort algorithm,
the best sequence for shellsort is 1, 4, 10, 23, 57, 132, 301, 701...,
but how can I generate such a sequence?
In Marcin Ciura's paper, he said:
Both Knuth’s and Hibbard’s sequences
are relatively bad, because they are
defined by simple linear recurrences.
but most algorithm books I found tend to use Knuth’s sequence: k = 3k + 1, because it's easy to generate. What's your way of generating a shellsort sequence?
Ciura's paper generates the sequence empirically -- that is, he tried a bunch of combinations and this was the one that worked the best. Generating an optimal shellsort sequence has proven to be tricky, and the problem has so far been resistant to analysis.
The best known increment is Sedgewick's, which you can read about here (see p. 7).
If your data set has a definite upper bound in size, then you can hardcode the step sequence. You should probably only worry about generality if your data set is likely to grow without an upper bound.
The sequence shown seems to grow roughly as an exponential series, albeit with quirks. There seems to be a majority of prime numbers, but with non-primes in the mix as well. I don't see an obvious generation formula.
A valid question, assuming you must deal with arbitrarily large sets, is whether you need to emphasise worst-case performance, average-case performance, or almost-sorted performance. If the latter, you may find that a plain insertion sort using a binary search for the insertion step might be better than a shellsort. If you need good worst-case performance, then Sedgewick's sequence appears to be favoured. The sequence you mention is optimised for average-case performance, where the number of comparisons outweighs the number of moves.
I would not be ashamed to take the advice given in Wikipedia's Shellsort article,
With respect to the average number of comparisons, the best known gap
sequences are 1, 4, 10, 23, 57, 132, 301, 701 and similar, with gaps
found experimentally. Optimal gaps beyond 701 remain unknown, but good
results can be obtained by extending the above sequence according to
the recursive formula h_k = \lfloor 2.25 h_{k-1} \rfloor.
Tokuda's sequence [1, 4, 9, 20, 46, 103, ...], defined by the simple formula h_k = \lceil h'_k
\rceil, where h'k = 2.25h'k − 1 + 1, h'1 = 1, can be recommended for
practical applications.
guessing from the pseudonym, it seems Marcin Ciura edited the WP article himself.
The sequence is 1, 4, 10, 23, 57, 132, 301, 701, 1750. For every next number after 1750 multiply previous number by 2.25 and round down.
Sedgewick observes that coprimality is good. This rings true: if there are separate ‘streams’ not much cross-compared until the gap is small, and one stream contains mostly smalls and one mostly larges, then the small gap might need to move elements far. Coprimality maximises cross-stream comparison.
Gonnet and Baeza-Yates advise growth by a factor of about 2.2; Tokuda by 2.25. It is well known that if there is a mathematical constant between 2⅕ and 2¼ then it must† be precisely √5 ≈ 2.236.
So start {1, 3}, and then each subsequent is the integer closest to previous·√5 that is coprime to all previous except 1. This sequence can be pre-calculated and embedded in code. There follow the values up to 2⁶⁴ ≈ eighteen quintillion.
{1, 3, 7, 16, 37, 83, 187, 419, 937, 2099, 4693, 10499, 23479, 52501, 117391, 262495, 586961, 1312481, 2934793, 6562397, 14673961, 32811973, 73369801, 164059859, 366848983, 820299269, 1834244921, 4101496331, 9171224603, 20507481647, 45856123009, 102537408229, 229280615033, 512687041133, 1146403075157, 2563435205663, 5732015375783, 12817176028331, 28660076878933, 64085880141667, 143300384394667, 320429400708323, 716501921973329, 1602147003541613, 3582509609866643, 8010735017708063, 17912548049333207, 40053675088540303, 89562740246666023, 200268375442701509, 447813701233330109, 1001341877213507537, 2239068506166650537, 5006709386067537661, 11195342530833252689}
(Obviously, omit those that would overflow the relevant array index type. So if that is a signed long long, omit the last.)
On average these have ≈1.96 distinct prime factors and ≈2.07 non-distinct prime factors; 19/55 ≈ 35% are prime; and all but three are square-free (2⁴, 13·19² = 4693, 3291992692409·23³ ≈ 4.0·10¹⁶).
I would welcome formal reasoning about this sequence.
† There’s a little mischief in this “well known … must”. Choosing ∉ℚ guarantees that the closest number that is coprime cannot be a tie, but rational with odd denominator would achieve same. And I like the simplicity of √5, though other possibilities include e^⅘, 11^⅓, π/√2, and √π divided by the Chow-Robbins constant. Simplicity favours √5.
I've found this sequence similar to Marcin Ciura's sequence:
1, 4, 9, 23, 57, 138, 326, 749, 1695, 3785, 8359, 18298, 39744, etc.
For example, Ciura's sequence is:
1, 4, 10, 23, 57, 132, 301, 701, 1750
This is a mean of prime numbers. Python code to find mean of prime numbers is here:
import numpy as np
def isprime(n):
''' Check if integer n is a prime '''
n = abs(int(n)) # n is a positive integer
if n < 2: # 0 and 1 are not primes
return False
if n == 2: # 2 is the only even prime number
return True
if not n & 1: # all other even numbers are not primes
return False
# Range starts with 3 and only needs to go up the square root
# of n for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
if n % x == 0:
return False
return True
# To apply a function to a numpy array, one have to vectorize the function
vectorized_isprime = np.vectorize(isprime)
a = np.arange(10000000)
primes = a[vectorized_isprime(a)]
#print(primes)
for i in range(2,20):
print(primes[0:2**i].mean())
The output is:
4.25
9.625
23.8125
57.84375
138.953125
326.1015625
749.04296875
1695.60742188
3785.09082031
8359.52587891
18298.4733887
39744.887085
85764.6216431
184011.130096
392925.738174
835387.635033
1769455.40302
3735498.24225
The gap in the sequence is slowly decreasing from 2.5 to 2.
Maybe this association could improve the Shellsort in the future.
I discussed this question here yesterday including the gap sequences I have found work best given a specific (low) n.
In the middle I write
A nasty side-effect of shellsort is that when using a set of random
combinations of n entries (to save processing/evaluation time) to test
gaps you may end up with either the best gaps for n entries or the
best gaps for your set of combinations - most likely the latter.
The problem lies in testing the proposed gaps such that valid conclusions can be drawn. Obviously, testing the gaps against all n! orderings that a set of n unique values can be expressed as is unfeasible. Testing in this manner for n=16, for example, means that 20,922,789,888,000 different combinations of n values must be sorted to determine the exact average, worst and reverse-sorted cases - just to test one set of gaps and that set might not be the best. 2^(16-2) sets of gaps are possible for n=16, the first being {1} and the last {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1}.
To illustrate how using random combinations might give incorrect results assume n=3 that can assume six different orderings 012, 021, 102, 120, 201 and 210. You produce a set of two random sequences to test the two possible gap sets, {1} and {2,1}. Assume that these sequences turn out to be 021 and 201. for {1} 021 can be sorted with three comparisons (02, 21 and 01) and 201 with (20, 21, 01) giving a total of six comparisons, divide by two and voilà, an average of 3 and a worst case of 3. Using {2,1} gives (01, 02, 21 and 01) for 021 and (21, 10 and 12) for 201. Seven comparisons with a worst case of 4 and an average of 3.5. The actual average and worst case for {1] is 8/3 and 3, respectively. For {2,1} the values are 10/3 and 4. The averages were too high in both cases and the worst cases were correct. Had 012 been one of the cases {1} would have given a 2.5 average - too low.
Now extend this to finding a set of random sequences for n=16 such that no set of gaps tested will be favored in comparison with the others and the result close (or equal) to the true values, all the while keeping processing to a minimum. Can it be done? Possibly. After all, everything is possible - but is it probable? I think that for this problem random is the wrong approach. Selecting the sequences according to some system may be less bad and might even be good.
More information regarding jdaw1's post:
Gonnet and Baeza-Yates advise growth by a factor of about 2.2; Tokuda by 2.25. It is well known that if there is a mathematical constant between 2⅕ and 2¼ then it must† be precisely √5 ≈ 2.236.
It is known that √5 * √5 is 5 so I think every other index should increase by a factor of five. So first index being 1 insertion sort, second being 3 then each other subsequent is of the factor 5. There follow the values up to 2⁶⁴ ≈ eighteen quintillion.
{1, 3,, 15,, 75,, 375,, 1 875,, 9 375,, 46 875,, 234 375,, 1 171 875,, 5 859 375,, 29 296 875,, 146 484 375,, 732 421 875,, 3 662 109 375,, 18 310 546 875,, 91 552 734 375,, 457 763 671 875,, 2 288 818 359 375,, 11 444 091 796 875,, 57 220 458 984 375,, 286 102 294 921 875,, 1 430 511 474 609 375,, 7 152 557 373 046 875,, 35 762 786 865 234 375,, 178 813 934 326 171 875,, 894 069 671 630 859 375,, 4 470 348 358 154 296 875,}
The values in the gaps can simply be calculated by taking the value before and multiply by √5 rounding to whole numbers giving the resulting array (using 2.2360679775 * 5 ^ n * 3):
{1, 3, 7, 15, 34, 75, 168, 375, 839, 1 875, 4 193, 9 375, 20 963, 46 875, 104 816, 234 375, 524 078, 1 171 875, 2 620 392, 5 859 375, 13 101 961, 29 296 875, 65 509 804, 146 484 375, 327 549 020, 732 421 875, 1 637 745 101, 3 662 109 375, 8 188 725 504, 18 310 546 875, 40 943 627 518, 91 552 734 375, 204 718 137 589, 457 763 671 875, 1 023 590 687 943, 2 288 818 359 375, 5 117 953 439 713, 11 444 091 796 875, 25 589 767 198 563, 57 220 458 984 375, 127 948 835 992 813, 286 102 294 921 875, 639 744 179 964 066, 1 430 511 474 609 375, 3 198 720 899 820 328, 7 152 557 373 046 875, 15 993 604 499 101 639, 35 762 786 865 234 375, 79 968 022 495 508 194, 178 813 934 326 171 875, 399 840 112 477 540 970, 894 069 671 630 859 375, 1 999 200 562 387 704 849, 4 470 348 358 154 296 875, 9 996 002 811 938 524 246}
(Obviously, omit those that would overflow the relevant array index type. So if that is a signed long long, omit the last.)