Range Summation of Array - algorithm

We are giving Array A and two type of query
Change F[x] = V
Find the Summation from [1.R] with d = x .i.e sum (1, 1+x, 1+2x, ...)
I know fro x = 1 it can be done with BIT or Segment Tree, but how do if d > 1?

Related

Finding a Summation in range

I have given an Array A and B. Where B contains the indexes of A.
Let
A = [2,3,5,6,7,8,9]
B = [1,3,1,4,5,2,6]
I have given Q queries where i have to find the Sum in the Range L to R using B.
Example L=2 , R=4
Sum = A[B[2]]+A[B[3]]+ A[B[4]]
Sum = A[3] + A[1]+A[4] = 5+2+6=13
Now a Query for update i.e
U 2 10
A[2] =10 // previously 3
S 6 7
Sum = A[B[6]]+A[B[7]] = A[2]+A[6]=10+8=18
Is there any solution better than O(Q*N) , any data structure which support both update and summation in this case. (Segment Tree and BITS like)
Constraints:
N,Q,Value<10^6

Haskell Performance Optimization

I am writing code to find nth Ramanujan-Hardy number. Ramanujan-Hardy number is defined as
n = a^3 + b^3 = c^3 + d^3
means n can be expressed as sum of two cubes.
I wrote the following code in haskell:
-- my own implementation for cube root. Expected time complexity is O(n^(1/3))
cube_root n = chelper 1 n
where
chelper i n = if i*i*i > n then (i-1) else chelper (i+1) n
-- It checks if the given number can be expressed as a^3 + b^3 = c^3 + d^3 (is Ramanujan-Hardy number?)
is_ram n = length [a| a<-[1..crn], b<-[(a+1)..crn], c<-[(a+1)..crn], d<-[(c+1)..crn], a*a*a + b*b*b == n && c*c*c + d*d*d == n] /= 0
where
crn = cube_root n
-- It finds nth Ramanujan number by iterating from 1 till the nth number is found. In recursion, if x is Ramanujan number, decrement n. else increment x. If x is 0, preceding number was desired Ramanujan number.
ram n = give_ram 1 n
where
give_ram x 0 = (x-1)
give_ram x n = if is_ram x then give_ram (x+1) (n-1) else give_ram (x+1) n
In my opinion, time complexity to check if a number is Ramanujan number is O(n^(4/3)).
On running this code in ghci, it is taking time even to find 2nd Ramanujan number.
What are possible ways to optimize this code?
First a small clarification of what we're looking for. A Ramanujan-Hardy number is one which may be written two different ways as a sum of two cubes, i.e. a^3+b^3 = c^3 + d^3 where a < b and a < c < d.
An obvious idea is to generate all of the cube-sums in sorted order and then look for adjacent sums which are the same.
Here's a start - a function which generates all of the cube sums with a given first cube:
cubes a = [ (a^3+b^3, a, b) | b <- [a+1..] ]
All of the possible cube sums in order is just:
allcubes = sort $ concat [ cubes 1, cubes 2, cubes 3, ... ]
but of course this won't work since concat and sort don't work
on infinite lists.
However, since cubes a is an increasing sequence we can sort all of
the sequences together by merging them:
allcubes = cubes 1 `merge` cubes 2 `merge` cubes 3 `merge` ...
Here we are taking advantage of Haskell's lazy evaluation. The definition
of merge is just:
merge [] bs = bs
merge as [] = as
merge as#(a:at) bs#(b:bt)
= case compare a b of
LT -> a : merge at bs
EQ -> a : b : merge at bt
GT -> b : merge as bt
We still have a problem since we don't know where to stop. We can solve that
by having cubes a initiate cubes (a+1) at the appropriate time, i.e.
cubes a = ...an initial part... ++ (...the rest... `merge` cubes (a+1) )
The definition is accomplished using span:
cubes a = first ++ (rest `merge` cubes (a+1))
where
s = (a+1)^3 + (a+2)^3
(first, rest) = span (\(x,_,_) -> x < s) [ (a^3+b^3,a,b) | b <- [a+1..]]
So now cubes 1 is the infinite series of all the possible sums a^3 + b^3 where a < b in sorted order.
To find the Ramanujan-Hardy numbers, we just group adjacent elements of the list together which have the same first component:
sameSum (x,a,b) (y,c,d) = x == y
rjgroups = groupBy sameSum $ cubes 1
The groups we are interested in are those whose length is > 1:
rjnumbers = filter (\g -> length g > 1) rjgroups
Thre first 10 solutions are:
ghci> take 10 rjnumbers
[(1729,1,12),(1729,9,10)]
[(4104,2,16),(4104,9,15)]
[(13832,2,24),(13832,18,20)]
[(20683,10,27),(20683,19,24)]
[(32832,4,32),(32832,18,30)]
[(39312,2,34),(39312,15,33)]
[(40033,9,34),(40033,16,33)]
[(46683,3,36),(46683,27,30)]
[(64232,17,39),(64232,26,36)]
[(65728,12,40),(65728,31,33)]
Your is_ram function checks for a Ramanujan number by trying all values for a,b,c,d up to the cuberoot, and then looping over all n.
An alternative approach would be to simply loop over values for a and b up to some limit and increment an array at index a^3+b^3 by 1 for each choice.
The Ramanujan numbers can then be found by iterating over non-zero values in this array and returning places where the array content is >=2 (meaning that at least 2 ways have been found of computing that result).
I believe this would be O(n^(2/3)) compared to your method that is O(n.n^(4/3)).

find all indices of multiple value pairs in a matrix

Suppose I have a matrix A, containing possible value pairs and a matrix B, containing all value pairs:
A = [1,1;2,2;3,3];
B = [1,1;3,4;2,2;1,1];
I would like to create a matrix C that contains all pairs that are allowed by A (i.e. C = [1,1;2,2;1,1]).
Using C = ismember(A,B,'rows') will only show the first occurence of 1,1, but I need both.
Currently I use a for-loop to create C, which looks like:
TFtot = false(size(B(:,1,1),1);
for i = 1:size(a(:,1),1)
TF1 = A(i,1) == B(:,1) & A(i,2) = B(:,2);
TFtot = TF1 | TFtot;
end
C = B(TFtot,:);
I would like to create a faster approach, because this loop currently greatly slows down the algorithm.
You're pretty close. You just need to swap B and A, then use this output to index into B:
L = ismember(B, A, 'rows');
C = B(L,:);
How ismember works in this particular case is that it outputs a logical vector that has the same number of rows as B where the ith value in B tells you whether we have found this ith row somewhere in A (logical 1) or if we haven't found this row (logical 0).
You want to select out those entries in B that are seen in A, and so you simply use the output of ismember to slice into B to extract out the affected rows, and grab all of the columns.
We get for C:
>> C
C =
1 1
2 2
1 1
Here's an alternative using bsxfun:
C = B(all(any(bsxfun(#eq, B, permute(A, [3 2 1])),3),2),:);
Or you could use pdist2 (Statistics Toolbox):
B(any(~pdist2(A,B),1),:);
Using matrix-multiplication based euclidean distance calculations -
Bt = B.'; %//'
[m,n] = size(A);
dists = [A.^2 ones(size(A)) -2*A ]*[ones(size(Bt)) ; Bt.^2 ; Bt];
C = B(any(dists==0,1),:);

Enumerate matrix combinations with fixed row and column sums

I'm attempting to find an algorithm (not a matlab command) to enumerate all possible NxM matrices with the constraints of having only positive integers in each cell (or 0) and fixed sums for each row and column (these are the parameters of the algorithm).
Exemple :
Enumerate all 2x3 matrices with row totals 2, 1 and column totals 0, 1, 2:
| 0 0 2 | = 2
| 0 1 0 | = 1
0 1 2
| 0 1 1 | = 2
| 0 0 1 | = 1
0 1 2
This is a rather simple example, but as N and M increase, as well as the sums, there can be a lot of possibilities.
Edit 1
I might have a valid arrangement to start the algorithm:
matrix = new Matrix(N, M) // NxM matrix filled with 0s
FOR i FROM 0 TO matrix.rows().count()
FOR j FROM 0 TO matrix.columns().count()
a = target_row_sum[i] - matrix.rows[i].sum()
b = target_column_sum[j] - matrix.columns[j].sum()
matrix[i, j] = min(a, b)
END FOR
END FOR
target_row_sum[i] being the expected sum on row i.
In the example above it gives the 2nd arrangement.
Edit 2:
(based on j_random_hacker's last statement)
Let M be any matrix verifying the given conditions (row and column sums fixed, positive or null cell values).
Let (a, b, c, d) be 4 cell values in M where (a, b) and (c, d) are on the same row, and (a, c) and (b, d) are on the same column.
Let Xa be the row number of the cell containing a and Ya be its column number.
Example:
| 1 a b |
| 1 2 3 |
| 1 c d |
-> Xa = 0, Ya = 1
-> Xb = 0, Yb = 2
-> Xc = 2, Yc = 1
-> Xd = 2, Yd = 2
Here is an algorithm to get all the combinations verifying the initial conditions and making only a, b, c and d varying:
// A matrix array containing a single element, M
// It will be filled with all possible combinations
matrices = [M]
I = min(a, d)
J = min(b, c)
FOR i FROM 1 TO I
tmp_matrix = M
tmp_matrix[Xa, Ya] = a - i
tmp_matrix[Xb, Yb] = b + i
tmp_matrix[Xc, Yc] = c - i
tmp_matrix[Xd, Yd] = d + i
matrices.add(tmp_matrix)
END FOR
FOR j FROM 1 TO J
tmp_matrix = M
tmp_matrix[Xa, Ya] = a + j
tmp_matrix[Xb, Yb] = b - j
tmp_matrix[Xc, Yc] = c + j
tmp_matrix[Xd, Yd] = d - j
matrices.add(tmp_matrix)
END FOR
It should then be possible to find every possible combination of matrix values:
Apply the algorithm on the first matrix for every possible group of 4 cells ;
Recursively apply the algorithm on each sub-matrix obtained by the previous iteration, for every possible group of 4 cells except any group already used in a parent execution ;
The recursive depth should be (N*(N-1)/2)*(M*(M-1)/2), each execution resulting in ((N*(N-1)/2)*(M*(M-1)/2) - depth)*(I+J+1) sub-matrices. But this creates a LOT of duplicate matrices, so this could probably be optimized.
Are you needing this to calculate Fisher's exact test? Because that requires what you're doing, and based on that page, it seems there will in general be a vast number of solutions, so you probably can't do better than a brute force recursive enumeration if you want every solution. OTOH it seems Monte Carlo approximations are successfully used by some software instead of full-blown enumerations.
I asked a similar question, which might be helpful. Although that question deals with preserving frequencies of letters in each row and column rather than sums, some results can be translated across. E.g. if you find any submatrix (pair of not-necessarily-adjacent rows and pair of not-necessarily-adjacent columns) with numbers
xy
yx
Then you can rearrange these to
yx
xy
without changing any row or column sums. However:
mhum's answer proves that there will in general be valid matrices that cannot be reached by any sequence of such 2x2 swaps. This can be seen by taking his 3x3 matrices and mapping A -> 1, B -> 2, C -> 4 and noticing that, because no element appears more than once in a row or column, frequency preservation in the original matrix is equivalent to sum preservation in the new matrix. However...
someone's answer links to a mathematical proof that it actually will work for matrices whose entries are just 0 or 1.
More generally, if you have any submatrix
ab
cd
where the (not necessarily unique) minimum is d, then you can replace this with any of the d+1 matrices
ef
gh
where h = d-i, g = c+i, f = b+i and e = a-i, for any integer 0 <= i <= d.
For a NXM matrix you have NXM unknowns and N+M equations. Put random numbers to the top-left (N-1)X(M-1) sub-matrix, except for the (N-1, M-1) element. Now, you can find the closed form for the rest of N+M elements trivially.
More details: There are total of T = N*M elements
There are R = (N-1)+(M-1)-1 randomly filled out elements.
Remaining number of unknowns: T-S = N*M - (N-1)*(M-1) +1 = N+M

algorithm to find closest string using same characters

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.
The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.
Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?
I know there are a lot of geeks out there that are into the algorithms. So please help!
Thanks!
If you are after substrings, a Trie or Patrica trie might be a good starting point.
If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...
This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.
If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.
Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.
L = { "UUU", "UVW", "WUU" }
The histograms are the following.
H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }
A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).
H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }
The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).
Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.
d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)
For the example we obtain.
d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828
Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).
Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.
d'(h1', h2') = a'x' + b'y' + c'z'
For the example we obtain.
d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200
Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).
Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.
Sounds like you need a trie. Tries are used to search for words similar to the way a spell checker will work. So if the String S has the characters in the same order as the Strings in L then this may work for you.
If however, the order of the characters in S is not relevant - like a set of scrabble tiles and you want to search for the longest word - then this is not your solution.
What you want is a BK-Tree. It's a bit unintuitive, but very cool - and it makes it possible to search for elements within a levenshtein (edit) distance threshold in O(log n) time.
If you care about ordering in your input strings, use them as is. If you don't you can sort the individual characters before inserting them into the BK-Tree (or querying with them).
I believe what you're looking for can be found here: Fuzzy Logic Based Search Technique
It's pretty heavy, but so is what you're asking for. It talks about word similarities, and character misplacement.
i.e:
L I N E A R T R N A S F O R M
L I N A E R T R A N S F O R M
L E N E A R T R A N S F R M
it seems to me that the order of the characters is not important in your problem, but you are searching for "near-anagrams" of the word S.
If that's so, then you can represent every word in the set L as an array of 26 integers (assuming your alphabet has 26 letters). You can represent S similarly as an array of 26 integers; now to find the best match you just run once through the set L and calculate a distance metric between the S-vector and the current L-vector, however you want to define the distance metric (e.g. euclidean / sum-of-squares or Manhattan / sum of absolute differences). This is O(n) algorithm because the vectors have constant lengths.
Here is a T-SQL function that has been working great for me, gives you the edit distance:
Example:
SELECT TOP 1 [StringValue] , edit_distance([StringValue, 'Input Value')
FROM [SomeTable]
ORDER BY edit_distance([StringValue, 'Input Value')
The Function:
CREATE FUNCTION edit_distance(#s1 nvarchar(3999), #s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE #s1_len int, #s2_len int, #i int, #j int, #s1_char nchar, #c int, #c_temp int,
#cv0 varbinary(8000), #cv1 varbinary(8000)
SELECT #s1_len = LEN(#s1), #s2_len = LEN(#s2), #cv1 = 0x0000, #j = 1, #i = 1, #c = 0
WHILE #j <= #s2_len
SELECT #cv1 = #cv1 + CAST(#j AS binary(2)), #j = #j + 1
WHILE #i <= #s1_len
BEGIN
SELECT #s1_char = SUBSTRING(#s1, #i, 1), #c = #i, #cv0 = CAST(#i AS binary(2)), #j = 1
WHILE #j <= #s2_len
BEGIN
SET #c = #c + 1
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j-1, 2) AS int) +
CASE WHEN #s1_char = SUBSTRING(#s2, #j, 1) THEN 0 ELSE 1 END
IF #c > #c_temp SET #c = #c_temp
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j+1, 2) AS int)+1
IF #c > #c_temp SET #c = #c_temp
SELECT #cv0 = #cv0 + CAST(#c AS binary(2)), #j = #j + 1
END
SELECT #cv1 = #cv0, #i = #i + 1
END
RETURN #c
END

Resources