Related
Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
Important
This is super simplified but, in reality, my actual list-of-sequences
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there known algorithms for performing this kind of analysis for various combinations of N and M? I've looked at suffix trees but I'd have to roll my own custom version to even get close to what I need.
For the same dataset, I need to repeatedly query the dataset for various values or different combinations of target, N, and M (where target <= 10,000, N <= 100 and `M <= 100). How can I do this efficiently?
Extending on my comment. Here is a sketch how you could approach this using an out-of-the-box suffix array:
1) reverse and concatenate your lists with a stop symbol (I used 0 here).
[7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 5, 6, 0, 5, 4, 3, 2, 8, 9, 0, 5, 6, 12, 12, 0, 2, 4, 3, 8, 5, 0, 5, 1, 0, 6, 5, 12, 4, 1, 9, 5, 3, 8, 8, 2, 0, 2, 1, 4, 3, 7, 1, 7, 0, 1, 5, 6, 12, 12, 4, 9]
2) Build a suffix array
[53, 45, 24, 30, 12, 19, 33, 7, 32, 6, 47, 54, 51, 38, 44, 5, 46, 25, 16, 4, 15, 49, 27, 41, 37, 3, 14, 48, 26, 59, 29, 31, 40, 2, 13, 10, 20, 55, 35, 11, 1, 34, 21, 56, 52, 50, 0, 43, 28, 42, 17, 18, 39, 60, 9, 8, 23, 36, 58, 22, 57]
3) Build the LCP array. The LCP array will tell you how many numbers a suffix has in common with its neighbour in the suffix array. However, you need to stop counting when you encounter a stop symbol
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 0, 2, 1, 1, 2, 0, 1, 3, 2, 2, 1, 0, 1, 1, 1, 4, 1, 2, 4, 1, 0, 1, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0]
4) When a query comes in (target = 5, M= 4) you search for the first occurence of your target in the suffix array and scan the corresponding LCP-array until the starting number of suffixes changes. Below is the part of the LCP array that corresponds to all suffixes starting with 5.
[..., 1, 1, 1, 4, 1, 2, 4, 1, 0, ...]
This tells you that there are two sequences of length 4 that occur two times. Brushing over some details using the indexes you can find the sequences and revert them back to get your final results.
Complexity
Building up the suffix array is O(n) where n is the total number of elements in all lists and O(n) space
Building the LCP array is also O(n) in both time and space
Searching a target number in the suffix is O(log n) in average
The cost of scanning through the relevant subsequences is linear in the number of times the target occurs. Which should be 1/10000 on average according to your given parameters.
The first two steps happen offline. Querying is technically O(n) (due to step 4) but with a small constant (0.0001).
This question already has answers here:
Algorithm: optimal way to rearrange a list from one order to another?
(4 answers)
Closed 4 years ago.
Given two lists, for example:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [2, 4, 6, 7, 0, 1, 3, 5, 8, 9]
I wish to find a series of moves which will transform list a into list b, where each move is an operation:
move(from_index, to_index)
which moves the element at location from_index and places it at location to_index. So if:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
then the operation move(3,1) on the list a will transform a into:
a = [0, 3, 1, 2, 4, 5, 6, 7, 8, 9]
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [2, 4, 6, 7, 0, 1, 3, 5, 8, 9]
move(0, 8)
a = [1, 2, 3, 4, 5, 6, 7, 0, 8, 9]
move(0, 8)
a = [2, 3, 4, 5, 6, 7, 0, 1, 8, 9]
move(1, 8)
a = [2, 4, 5, 6, 7, 0, 1, 3, 8, 9]
move(2, 8)
a = [2, 4, 6, 7, 0, 1, 3, 5, 8, 9]
a==b
Hopefully that's what you're looking for.
Basically, start with the left- most element and move it to where it should be. For example, I took 0 and placed it right after the value that it is supposed to eventually end up behind, which is 7. I continued moving from left to right until all of the elements were in the desired order.
I'd iterate over the second sequence (the sorted list) and swap items in the first. I wrote this pseudo-code in python:
>>> a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> b = [2, 4, 6, 7, 0, 1, 3, 5, 8, 9]
>>> def swap(seq, i, j):
... a = seq[i]
... seq[i] = seq[j]
... seq[j] = a
...
>>> for index_in_b, value in enumerate(b):
... index_in_a = a.index(value)
... if index_in_b != index_in_a:
... swap(a, index_in_a, index_in_b)
... print('move {} to {}'.format(index_in_a, index_in_b))
move 0 to 2
move 1 to 4
move 2 to 6
move 3 to 7
move 4 to 6
move 5 to 6
move 6 to 7
In this case I'm moving the items in the first sequence by swapping them.
Update
We can slightly improve the performance in python by removing the move inside swap function and also removing the function call. Here is a performance comparison:
import timeit
s1 = """
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [2, 4, 6, 7, 0, 1, 3, 5, 8, 9]
def swap(seq, i, j):
a = seq[i]
seq[i] = seq[j]
seq[j] = a
for index_in_b, value in enumerate(b):
index_in_a = a.index(value)
if index_in_b != index_in_a:
swap(a, index_in_a, index_in_b)"""
s2 = """
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [2, 4, 6, 7, 0, 1, 3, 5, 8, 9]
for index_in_b, value in enumerate(b):
index_in_a = a.index(value)
if index_in_b != index_in_a:
a[index_in_a], a[index_in_b] = a[index_in_b], a[index_in_a]"""
# on an i7 macbook pro
timeit.timeit(s1)
4.087386846542358
timeit.timeit(s2)
3.5381240844726562
Slightly better, but for sure there are better ways to achieve this.
Quite often, I have to generate sequences of numbers in some semi-random way, which means that it is not totally random, but has to have some other property. For example we need a random sequence of 1,2,3 and 4s, but no number must be repeated three times in a row. These are usually not very complicated to do, but I ran into a tricky one: I need to generate a semi-random sequence that is a bit over 400 long, is composed of 1,2,3 and 4s, each number must appear the same amount of times (or if the sum is not divisible by four than as close as you can get it) and they must not repeat 3 times in a row (so 1,3,4,4,4,2 is not ok ).
I tried to methods:
Create a list which has the desired length and number of numbers; shuffle; check if ok for consecutive numbers if not, shuffle again.
Create a list which has the desired length and number of numbers; generate all permutations and select which are ok; save these for later and randomly select one of them when needed.
Method number one runs for minutes before yielding any sequence that is ok and method number two generates so many permutations my jupter notebook gave up.
Here's the python code for the first one
from random import shuffle
v = []
for x in range(108):
v += [1,2,3,4]
shouldicontinue = 1
while shouldicontinue:
shuffle(v)
shouldicontinue = 0
for h in range(len(v)-1):
if v[h] == v[h+1] and v[h] == v[h+2]:
shouldicontinue = 1
break
else:
pass
and the second one
from random import shuffle
import itertools
v = []
for x in range(108):
v += [1,2,3,4]
good = []
for l in itertools.permutations(v):
notok = 0
for h in range(len(v)-1):
if v[h] == v[h+1] and v[h] == v[h+2]:
notok = 1
break
else:
pass
if not notok:
good.append(v)
I'm looking for a way to solve this problem in an efficient way, i.e.: if it runs in real time, it doesn't need more than say a minute to generate on slower computers or if it is prepared in advance in someway (like the idea of method 2), it can be prepared on some moderate level computer in a few hours.
Before you can check all the permutations of a >400 length list, the universe will likely have died. Thus you need another approach.
Here, I recommend trying to insert the elements in the list at random, but shifting to the next index when the insertion would break one of the requirements.
Cycling through your elements, 1 to 4 in your case, should ensure an insertion is always possible.
from itertools import cycle, islice
from random import randint
def has_repeated(target, n, lst):
"""A helper to check if insertion would break the max repetition requirement"""
count = 0
for el in lst:
count += el == target
if count == n:
return True
return False
def sequence(length, max_repeat, elements=(1, 2, 3, 4)):
# Iterator that will yield our elements in cycle
values = islice(cycle(elements), length)
seq = []
for value in values:
# Pick an insertion index at random
init_index = randint(0, len(seq))
# Loop over indices from that index until a legal position is found
for shift in range(len(seq) + 1):
index = init_index - shift
slice_around_index = seq[max(0, index - max_repeat):index + max_repeat]
# If the insertion would cause no forbidden subsequence, insert
if not has_repeated(value, max_repeat, slice_around_index):
seq.insert(index, value)
break
# This will likely never happen, except if a solution truly does not exist
else:
raise ValueError('failed to generate the sequence')
return seq
Sample
Here is some sample output to check the result is correct.
for _ in range(10):
print(sequence(25, 2))
Output
[4, 1, 4, 1, 3, 2, 1, 2, 4, 1, 4, 2, 1, 2, 2, 4, 3, 3, 1, 4, 3, 1, 2, 3, 3]
[3, 1, 3, 2, 2, 4, 1, 2, 2, 4, 3, 4, 1, 3, 4, 3, 2, 4, 4, 1, 1, 2, 1, 1, 3]
[1, 3, 2, 4, 1, 3, 4, 4, 3, 2, 4, 1, 1, 3, 1, 2, 4, 2, 3, 1, 1, 2, 4, 3, 2]
[1, 3, 2, 4, 1, 2, 2, 1, 2, 3, 4, 3, 2, 4, 2, 4, 1, 1, 3, 1, 3, 4, 1, 4, 3]
[4, 1, 4, 4, 1, 1, 3, 1, 2, 2, 3, 2, 4, 2, 2, 3, 1, 3, 4, 3, 2, 1, 3, 1, 4]
[2, 3, 3, 1, 3, 3, 1, 2, 1, 2, 1, 2, 3, 4, 4, 1, 3, 4, 4, 2, 1, 1, 4, 4, 2]
[3, 2, 1, 4, 3, 2, 3, 1, 4, 1, 1, 2, 3, 3, 2, 2, 4, 1, 1, 2, 4, 1, 4, 3, 4]
[4, 4, 3, 1, 4, 1, 2, 2, 4, 4, 3, 2, 2, 3, 3, 1, 1, 2, 1, 1, 4, 1, 2, 3, 3]
[1, 4, 1, 4, 4, 2, 4, 1, 1, 2, 1, 2, 2, 3, 3, 2, 2, 3, 1, 4, 4, 3, 3, 1, 3]
[4, 3, 2, 1, 4, 1, 1, 2, 2, 3, 3, 1, 4, 4, 1, 3, 2, 3, 4, 2, 1, 1, 4, 2, 3]
Efficiency-wise, it takes around 10ms to generate a list of length 10,000 with he same requirements. Hinting that this might be an efficient enough solution for most purpose.
I think it should be possible (with about 4 gigabytes of memory and 1 minute of precomputation) to generate uniformly distributed random sequences faster than 1 second per random sequence.
The idea is to prepare a cache of results for the question "How many sequences with exactly a 1s, b 2s, c 3s, d 4s are there which end with count copies of a particular digit?".
Once you have this cache, then you can compute how many sequences (N) there are that satisfy your constraint, and can generate one at random by picking a random number n between 1 and N and using the cache to generate the n^th sequence.
To save memory in the cache you can use a couple of tricks:
The answer is symmetric in a/b/c/d so you only need to store results with a>=b>=c>=d
The count of the last digit will always be 1 or 2 in legal sequences
These tricks should mean the cache only needs to hold about 40 million results.
import random
rc = random.choices([1,2,3,4])
for _ in range(22):
if rc[-1] == 1:
rc = rc + random.choices([2,3,4])
rc = rc + random.choices([1,2,3,4])
if rc[-1] == 2:
rc = rc + random.choices([1,3,4])
rc = rc + random.choices([1,2,3,4])
if rc[-1] == 3:
rc = rc + random.choices([2,1,4])
rc = rc + random.choices([1,2,3,4])
if rc[-1] == 4:
rc = rc + random.choices([2,3,1])
rc = rc + random.choices([1,2,3,4])
print(rc)
Using ElasticSearch I'm trying to use the minimum_should_match option on a Terms Query to find documents that have a list of longs that is X% similar to the list of longs I'm querying with.
e.g:
{
"filter": {
"fquery": {
"query": {
"terms": {
"mynum": [1, 2, 3, 4, 5, 6, 7, 8, 9, 13],
"minimum_should_match": "90%",
"disable_coord": False
}
}
}
}
}
will match two documents with a mynum list of:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
and:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]
This works and is correct since the first document has a 10 at the end while the query contained a 13 and the second document contained an 11 where again the query contained a 13.
Which means that 1 ou of 10 numbers in my query's list is different in the returned document and amounts to the allowed 90% similarity (minimum_should_match) value in the query.
Now the issue that I have is that I would like the behaviour to be different in the sense that since the second document is longer and has 11 numbers in place of 10, the difference level should ideally have been higher since it has actually two values 11 and 12 that are not in the query's list. e.g:
Instead of computing the intersection of:
(list1) [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]
with:
(list2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]
which is a 10% difference
it should say that since list2 is longer than list1, the intersection should be:
(list2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12]
with:
(list1) [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]
which is a 12% difference
Is this possible ?
If not, how could I weight in the length of the list besides using a dense vector rather than a sparse one ? e.g:
using
[1, 2, 3, 4, 5, 6, 7, 8, 9, , , , 13]
rather than:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 13]
i need to calculate possible number of outcomes with detail screens.
the detail are: we have 1 textbox in which there has to enter any number from 0 to 7. There are 13 categories of the outcomes but average of all outcomes should be equal to the number entered in the texbox.
for example : textbox : __enter a number from 1 to 7__(if 3)______.
categories 1: 1, 2, 3, 4, 5, 6, 7
categories 2: 1, 2, 3, 4, 5, 6, 7
categories 3: 1, 2, 3, 4, 5, 6, 7
categories 4: 1, 2, 3, 4, 5, 6, 7
categories 5: 1, 2, 3, 4, 5, 6, 7
categories 6: 1, 2, 3, 4, 5, 6, 7
categories 7: 1, 2, 3, 4, 5, 6, 7
categories 8: 1, 2, 3, 4, 5, 6, 7
categories 9: 1, 2, 3, 4, 5, 6, 7
categories 10: 1, 2, 3, 4, 5, 6, 7
categories 11: 1, 2, 3, 4, 5, 6, 7
categories 12: 1, 2, 3, 4, 5, 6, 7
categories 13: 1, 2, 3, 4, 5, 6, 7
the average shud be 3. This is one possibility i need number of possibilities with screens like this.
can any one help me out in this i guess this to be deal with some probability distributions.
As a Haskell list comprehension: (with only 3 parameters, but you get the point)
lists input = [ [a,b,c] | a <- [0..7],b <- [0..7],c <- [0..7], avg([a,b,c]) == input]
then you select any random list from this list of lists