Algorithm Complexity: Time and Space - algorithm

I have two solutions to one coding problem. Both of the solutions work fine. I just need to find the time/space complexity for both of them.
Question: Given a string or words, return the largest set of unique words in the given string that are anagrams of each other.
Example:
Input: 'I am bored and robed, nad derob'
Correct output: {bored, robed, derob}
Wrong output: {and, nad}
Solution 1:
In the first solution, I iterate over the given string of words, take each word, sort the characters in it, add it to a dictionary as a key. And the original word (not sorted) is being added to a set of words as a value for that key. Sort helps figure out the words that are anagrams of each other. At each iteration, I also keep track of the key that has the longest set of words as its value. At the end, I return the key that has the longest set.
Solution 2:
In the second solution, I do almost the same. The only difference is that I calculate a prime factor for each word to use it as a key in my dictionary. In other words, I don't sort the characters of the word.
def solution_two(string):
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
d = defaultdict(set)
longest_set = ''
for word in string.replace(",", "").replace(".", "").split():
key = 1
for ch in word:
key *= primes[ord(ch.lower()) - 97]
d[key].add(word)
if len(d[key]) > len(d[longest_set ]):
longest_set = key
return d[longest_set]
My thoughts:
I think that the runtime of the first solution is O(n), where n is the number of words in the string. But if I sort each word of the string, wouldn't it make the runtime O(n) * O(n log n)?
As for the second one, I think that it has the linear runtime too. But I have the second loop inside the first one where I iterate through each character of a word...
I am also confused about the space complexity for both of the solutions.
Any guidance would be greatly appreciated. Thank you.

ALG1= time is O(n)* O(awl log(awl)) and space O(n)*O(awl) but be careful because this looks pretty good but you need to consider that if awl is much smaller than n you get an ~O(n) time while if your awl is bigger you get an O(awl log(awl)) meanwhile worst case of all is if n=awl –
ALG2= time O(n)*O(awl), where O(n) is for iterating the given string and O(awl) is calculating the prime factor for each word (i.e. I take a word, find a prime for each char in the word in the list primes and multiply them) and space O(n)
same considerations on n and awl as the previous algorithm
So if i am correct your second algorithm in the WC has a complexity of n² better than the other one also in space!
Just to recap :)

Related

Splitting a list of elements summing up to a cutoff value

Recently I got into a very (basic) problem. My feeling is that it must have a very theoretical solution which I am not aware of. The problem statement is this:
Let [x_1, x_2, ..., x_n] is a list of positive integers. I need to make the minimum split of these numbers such that sum of those numbers in each split should not exceed N.
For example: The list is [67, 56, 12345, 555555, 555555, 555555] and N=1000000, then the one minimum splits would be [[555555, 12345, 67, 56], [555555], [555555]].
One solution I thought is like this:
sort the list.
Take max value and add minimum values one by one so that the some do not exceed N.
Remove elements used in 2 from the list and try 2 and 3 again on the modified list until the list exhaust.
The problem is that I am not sure it will provide the minimum split.
Thanks for any solution.

Algorithm to find sentences in text

Friends, I am looking for a good algorithm to search for given verbose phrases in a large text. For simplicity, I consider the text tokenized and all the words already found in it. Thus, if I have a phrase of three words (in fact there may be more words), I first look for the positions of each of these words in the text, so an array of integers is associated with each of the three words in the phrase. Not necessarily these arrays have the same length.
May be some example will be good here. Assume we need to find a phrase "all white cats" from this text:
...this is just a dummy text about cats. In this text I want to write the phrase that all cats are white but the fact is not all cats are white. But in case there are some white cats, anyway, we need to write about them. All facts about the cats...
If we assume "this" word have a number 30, then we can create those numbers for each word from the initial phrase:
all: 48, 57, 76
white: 51, 60, 67
cats: 37, 49, 58, 68, 80
As you can see, we can combine those words in different phases and each "phrase" will have its own "quality". Quality can be calculated as the sum of distances from each word to the virtual "phrase center".
"all cats are white" are two good phrases with a quality of 3.33. All other words can be combined with phrases, but they will be low-quality.
My question is to find a good algorithm to make a list of all phrases, each phrase will have a center coordinate and numbers of words. I know it can be done by direct calculation of the distance between each word to each word, but it can take ages if we have enough big text and enough long phrases.
To simplify, I think to limit the lookup distance (let's say 5 words) from each word.
But next, I can't imagine how to calculate this faster.
I feel there is a ready algorithm for this, but can't find one.
Thanks!
Let's prepare an intermediate data structure of sorted positions with corresponding words (see pos_words below). For each triplet of subsequent words we check that all required words are present, and for valid triplets we calculate the score/quality value.
See the model implementation in Python:
def calculate_score(data):
def score(positions):
center = sum(positions) / len(positions)
return sum(abs(p - center) for p in positions)
word_set = set(data)
word_count = len(word_set)
pos_words = {p: word for word, positions in data.items() for p in positions}
positions = sorted(pos_words)
return [
(positions[i], score(positions[i:i+word_count]))
for i in range(len(positions) - word_count + 1)
if set(pos_words[positions[i+j]] for j in range(word_count)) == word_set
]
data = {
"all": [48, 57, 76],
"white": [51, 60, 67],
"cats": [37, 49, 58, 68, 80],
}
print(calculate_score(data))
The result contains positions of the first word of the triplet together with calculated scores.
[(48, 3.3333333333333357),
(49, 9.333333333333336),
(51, 8.666666666666664),
(57, 3.3333333333333357),
(67, 11.333333333333329)]

Finding set of products formed by two lists

Given two lists, say A = [1, 3, 2, 7] and B = [2, 3, 6, 3]
Find set of all products that can be formed by multiplying a number in A with a number in B. (By set, I mean I do not want duplicates). I looking for the fastest running time possible. Hash functions are not allowed.
First approach would be brute force, where the we multiple every number from A with every number in B and if we find a product that is not already in the list, then add it to the list. Finding all possible products will cost O(n^2) and to verify if the product is already present in the list, it will cost me O(n^2). So the total comes to O(n^4).
I am looking to optimize this solution. First thing that comes to my mind is to remove duplicates in list B. In my example, I have 3 as a duplicate. I do not need to compute the product of all elements from A with the duplicate 3 again. But this doesn't still help reducing the overall runtime though.
I am guessing the fastest possible run time can be O(n^2) if all the numbers in A and B combined are unique AND prime. That way it is guaranteed that there will be no duplicates and I do not need to verify if my product is already present in the list. So I am thinking if we can pre-process our input list such that it will guarantee unique product values (One way to pre-process is to remove duplicates in list B like I mentioned above).
Is this possible in O(n^2) time and will it make a difference if I only care about the number of unique possible products instead of the actual products?
for i = 1 to A.length:
for j = 1 to B.length:
if (A[i] * B[j]) not already present in list: \\ takes O(n^2) time to verify this
Add (A[i] * B[j]) to list
end if
end for
end for
print list
Expected result for the above input: 2, 3, 6, 9, 18, 4, 12, 14, 21, 42
EDIT:
I can think of a O(n^2 log n) solution:
1) I generate all possible product values without worrying about duplicates \ This is O(n^2)
2) Sort these product values \ this will be O(n^2 log n) because we have n^2 numbers to sort
3) Remove the duplicates in linear time since the elements are now sorted
Use sets to eliminate duplicates.
A=[3, 6, 6, 8]
B=[7, 8, 56, 3, 2, 8]
setA = set(A)
setB = set(B)
prod=set() #empty set
[prod.add(i*j) for i in setA for j in setB]
print(prod)
{64, 448, 6, 168, 9, 42, 12, 16, 48, 18, 336, 21, 24, 56}
Complexity is O(n^2).
Another way is the following.
O(n^3) complexity
prod=[]
A=[1,2,2,3]
B=[5,6,6,7]
for i in A:
for j in B:
if prod==[]:
prod.append(i*j)
continue
for k in range(len(prod)):
if i*j < prod[k]:
prod.insert(k,i*j)
break
elif i*j == prod[k]:
break
if k==len(prod)-1:
prod.append(i*j)
print(prod)
Yet another way. This could be using hash functions internally.
from toolz import unique
A=[1,2,2,3]
B=[5,5,7,8]
print(list(unique([i*j for i in A for j in B])))

Algorithm to find similar strings in a list of many strings

I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
matching_strings(l)
# Output
# [["moose","mouse"],["rat", "cat"]]
I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.
UPDATE:
Thanks to Collapsar's help, here is a solution in Python
import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]
for string in l:
fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for letter in string:
fv[alphabet[letter]]+=1
fvlist.append(fv)
fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))
However, the sorted vectors are returned in the following order:
"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"
Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions
One way to do that (with complexity O(n k^2), where n is number of strings and k is the longest string) is to convert every string into a set of masks like this:
rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?
This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.
Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.
Here's an example code, with an extra cats string in it.
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}
def add(mask, s):
if mask not in d:
d[mask] = []
d[mask].append(s)
for s in l:
for pos in range(len(s)):
add(s[:pos] + '?' + s[pos + 1:], s)
add(s[:pos] + '?' + s[pos:], s)
add(s + '?', s)
for k, v in d.items():
if len(v) > 1:
print v
Outputs
['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']
1st step, you must index your list with any fuzzy search indexing.
2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.
About fuzzy indexing:
Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".
Basic idea: To split strings by trigrams, and search maximal intersection scores.
For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.
Thereafter, for each trigram built index, in which term it is present.
So, this is list of term_ID for each trigram.
When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.
It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s
You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz
The naive implementation amounts to setting up a boolean matrix indexed by the strings (i.e. their position in the sorted list) and comparing each pair of strings, setting the corresponding matrix element to true iff the strings are 'similar' wrt your criterion. This will run in O(n^2).
You might be better off by transforming your strings into tuples of character frequencies ( e.g. 'moose' -> (0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,0,0,0,1,0,0,0,0,0,0,0) where the i-th vector component represents the i-th letter in the alphabet). Note that the frequency vectors will differ in 'few' components only ( e.g. for D-L distance 1 in at most 2 components, the respective differences being +1,-1 ).
Sort your transformed data. Candidates for the pairs you wish to generate will be adjacent or at least 'close' to each other in your sorted list of transformed values. You check the candidates by comparing each list entry with at most k of its successors in the list, k being a small integer (actually comparing the corresponding strings, of course). This algorithm will run in O(n log n).
You have to trade off between the added overhead of transformation / sorting (with complex comparison operations depending on the representation you choose for the frequency vectors ) and the reduced number of comparisons. The method does not consider the intra-word position of characters but only their occurrence. Depending on the actual set of strings there'll be many candidate pairs that do not turn into actually 'similar' pairs.
As your data appears to consist of English lexemes, a potential optimisation would be to define character classes ( e.g. vowels/consonants, 'e'/other vowels/syllabic consonants/non-syllabic consonants ) instead of individual characters.
Additional optimisation:
Note that precisely the pairs of strings in your data set that are permutations of each other (e.g. [art,tar]) will produce identical values under the given transformation. so if you limit yourself to a D-L distance of 1 and if you do not consider the transposition of adjacent characters as a single edit step, never pick list items with identical transformation values as candidates.

Recovering element of an array, given sums of items at indexes matching bitmasks

Suppose there was an array E of 2^n elements. For example:
E = [2, 3, 5, 7, 11, 13, 17, 19]
Unfortunately, someone has come along and scrambled the array. They took all elements whose index in binary is of the form 1XX, and added them into the elements at index 0XX (i.e. they did E[0] += E[1], E[2] += E[3], etc. Then they did the same thing for indexes like X1X into X0X, and for XX1 into XX0.
More specifically, they ran this pseudo-code over the array:
def scramble(e):
n = lg_2(len(e))
for p in range(n):
m = 1 << p
for i in range(len(e)):
if (i & m) != 0:
e[i - m] += e[i]
In terms of our example, this causes:
E_1 = [2+3, 3, 5+7, 7, 11+13, 13, 17+19, 19]
E_1 = [5, 3, 12, 7, 24, 13, 36, 19]
E_2 = [5+12, 3+7, 12, 7, 24+36, 13+19, 36, 19]
E_2 = [17, 10, 12, 7, 60, 32, 36, 19]
E_3 = [17+60, 10+32, 12+36, 7+19, 60, 32, 36, 19]
E_3 = [77, 42, 48, 26, 60, 32, 36, 19]
You're given the array after it's been scrambled (i.e. your input is E_3). Your goal is to recover the original first element of E, (i.e. the number 2).
One way to get the 2 back is undo all the scrambling. Run the scrambling code, but with the += replaced by a -=. However, doing that is very expensive. It takes n 2^n time. Is there a faster way?
Alternate Form
Stated another way, I give you an array S where the element at index i is the sum of all elements with an index j satisfying (j & i) == ifrom a list E. For example, S[101110] is E[101110] + E[111110] + E[101111] + E[111111]). How expensive is it to recover an element of E, given S?
The item at 111111... is easy, because S[111111...] = E[111111...], but S[000000...] depends on a all the elements from E in a non-uniform way so it seems to be harder to get back.
Extended
What if we don't just want to recover the original items, but want to recover sums of the original items that have match a mask that can specify must-be-1, no-constraint, and must-be-0? Is this harder?
Call the number of items in the array N, and the size of the bitmasks being used B so N = 2^B.
You can't do better than O(N).
The example solution in the question, which just runs the scrambling in reverse, takes O(N B) time. We can reduce that to O(N) by discarding items that won't contribute to the actual value we read at the end. This makes the unscrambling much simpler, actually: just iteratively subtract the last half of the array from the first half, then discard the last half, until you have one item left.
def unscrambleFirst(S):
while len(S) > 1:
h = len(S)/2
for i in range(h):
S = S[:h] - S[h:] #item-by-item subtraction
return S[0]
It's not possible to go faster than O(N). We can prove it with linear algebra.
The original array has N independent items, i.e. it is a vector with N degrees of freedom.
The scrambling operation only uses linear operations, and so is equivalent to multiplying that vector by a matrix. (The matrix is [[1, 1], [0, 1]] tiled inside of itself B times; it ends up looking like a Sierpinski triangle).
The scrambling operation matrix is invertible (that's why we can undo the scrambling).
Therefore the scrambled vector must still have N degrees of freedom.
But our O(N) solution is a linear combination of every element of the scrambled vector.
And since the elements of the scrambled vector must all be linearly independent for there to be N degrees of freedom in it, we can't rewrite the usage of any one element with usage of the others.
Therefore we can't change which items we rely on, and we know that we rely on all of them in one case so it must be all of them in all cases.
Hopefully that's clear enough. The scrambling distributes the first item in a way that requires you to look at every item to get it back.

Resources