Computing partial sums in OpenCL - parallel-processing

Computing partial sums in OpenCL - parallel-processing

A 1D dataset is divided into segments, each work item processes one segment. It read a number of elements from the segment? The number of elements is not known beforehand and differs for each segment.
For example:
+----+----+----+----+----+----+----+----+----+ <-- segments
A BCD E FG HIJK L M N <-- elements in this segment
After all segments have been processes they should write the elements in contiguously output memory, like
A B C D E F G H I J K L M N
So the absolute output position of the elements from one segment depends on the number of elements in the previous segments. E is at position 4 because segment contains 1 element (A) and segment 2 contains 3 elements.
The OpenCL kernel writes the number of elements for each segment into a local/shared memory buffer, and works like this (pseudocode)
kernel void k(
constant uchar* input,
global int* output,
local int* segment_element_counts
) {
int segment = get_local_id(0);
int count = count_elements(&input[segment * segment_size]);
segment_element_counts[segment] = count;
barrier(CLK_LOCAL_MEM_FENCE);
ptrdiff_t position = 0;
for(int previous_segment = 0; previous_segment < segment; ++previous_segment)
position += segment_element_counts[previous_segment];
global int* output_ptr = &output[position];
read_elements(&input[segment * segment_size], output_ptr);
}
So each work item has to calculate a partial sum using a loop, where the work items with larger id do more iterations.
Is there a more efficient way to implement this (each work item calculate a partial sum of a sequence, up to its index), in OpenCL 1.2? OpenCL 2 seems to provide work_group_scan_inclusive_add for this.

You can do N partial (prefix) sums in log2(N) iterations using something like this:
offsets[get_local_id(0)] = count;
barrier(CLK_LOCAL_MEM_FENCE);
for (ushort combine = 1; combine < total_num_segments; combine *= 2)
{
if (get_local_id(0) & combine)
{
offsets[get_local_id(0)] +=
offsets[(get_local_id(0) & ~(combine * 2u - 1u)) | (combine - 1u)];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
Given segment element counts of
a b c d
The successive iterations will produce:
a b+a c d+c
and
a b+a c+(b+a) (d+c)+(b+a)
Which is the result we want.
So in the first iteration, we've divided the segment element counts into groups of 2, and sum within them. Then we merge 2 groups at a time into 4 elements, and propagate the result from the first group into the second. We grow the groups again to 8, and so on.
The key observation is that this pattern also matches the binary representation of the index of each segment:
0: 0b00 1: 0b01 2: 0b10 3: 0b11
Index 0 performs no sums. Both indices 1 and 3 perform a sum in the first iteration (bit 0/LSB = 1), whereas indices 2 and 3 perform a sum in the second iteration (bit 1 = 1). That explains this line:
if (get_local_id(0) & combine)
The other statement that really needs an explanation is of course
offsets[get_local_id(0)] +=
offsets[(get_local_id(0) & ~(combine * 2u - 1u)) | (combine - 1u)];
Calculating the index at which we find the previous prefix sum we want to accumulate onto our work-item's sum is a little tricky. The subexpression (combine * 2u - 1u) takes the value (2n-1) on each iteration (for n starting at 1):
1 = 0b001
3 = 0b011
7 = 0b111
…
By bitwise-masking these bit suffixes off (i.e. i & ~x) the work-item index, this gives you the index of the first item in the current group.
The (combine - 1u) subexpression then gives you the index within the current group of the last item of the first half. Putting the two together gives you the overall index of the item you want to accumulate into the current segment.
There is one slight ugliness in the result: it's shifted to the left by one: so segment 1 needs to use offsets[0], and so on, while segment 0's offset is of course 0. You can either over-allocate the offsets array by 1 and perform the prefix sums on the subarray starting at index 1 and initialise index 0 to 0, or use a conditional.
There are probably profiling-driven micro-optimisations you can make to the above code.

Related

Fast O(1) algorithm to evenly distribute continuous numbers into continuous subdivisions and output a specific subdivision boundary

I have a subset of natural numbers specified by a given max number. For example if the given is 7 then the list is
1,2,3,4,5,6,7
Now I am given another input, the number of subdivisions to evenly divide the list. For any remainder, one extra number is added to each subdivision starting from beginning. If this number is 3, then the subdivided list would be
[1,2,3][4,5][6,7]
Finally a third input, the "subdivision order (between 1 and the subdivision number)" is given. In the above example if the order is 1 then the output is [1,2,3], if the order is 2 then the output is [4,5]
The trivial dumb way would be first do 7/3=2 and calculate the remainder 7-2*3=1, then generate the first group by assigning 1,2 first and then since the first group order is no bigger the remainder, add one element get 1,2,3. Then generate the second group etc.
However it seems to me there must be a way to directly get a middle group without the need to generate all the previous group. i.e. get [6,7] given the input max_num=7, subdivision_num=3, subdivision_order=3 without going through a for loop.
Now the actual subdivision output needed is indicated by only the smallest and the largest number (i.e. the output for 7,3,1 would be 1,3), so the latter would imply a worst case O(1) algorithm while the trivial dumb way has worst case O(n) where n is the subdivision number.
It seems not so hard but I have struggle for a while not able to come up with the "direct O(1)" algorithm. Any help would be appreciated.

Not considering the time to generate the list and assuming branching and arithmetic operations to take constant time, we can do it in O(1).
parts = max_num // subdivisions_num
rems = max_num % subdivisions_num
if subdivision_order < rems:
startindex = (parts+1) * (subdivision_order - 1)
length = parts + 1
print(numlist[startindex:startindex+length])
else:
startindex = (parts+1) * rem + parts * (subdivision_order - rem - 1)
length = parts
print(numlist[startindex:startindex+length])
I think you don't need to divide list into sublist. You can just calculate the start and length of the subarray.
In this case, if ~subdivision_order~ is less than the remainder of ~max_num~ and ~subdivision_num~, you calculate the start index by just multiplying the ~subdivision_order - 1~ (in case of 0-index) with ~max_num // subdivision_num~ and length will be ~max_num // subdivision_num + 1~.

If I understand your question correctly, you're given three values
maximum
segments
segment index (one based)
and you want to return the minimum and maximum of the segment range.
Let's work through an example with slightly larger numbers
maximum = 107
segments = 10
segment index = 4
Ok, doing the math
107 (maximum) / 10 (segments) = 10 values in a segment with 7 left over.
So, the first 7 segments have 11 values and the last 3 segments have 10 values. The remainders go in the first segments.
So 3 * 11 = 33. 3 is the zero-based segment index and the first 3 segments have 11 values,
The 4th segment also has 11 values, so you would return 33 + 1 and 33 + 11, or 34 and 44. The only "trick" is to make sure that you differentiate between the segments with a remainder and the segments without a remainder.
You need to calculate four numbers. The number of segments with a remainder, the length of the segment with a remainder, the number of segments without a remainder, and the length of the segment without a remainder.
In the example I gave that would be 7 & 11, 3 & 10. Then you add the counts of the prior segments and the count of the wanted segment. You do this with two multiplications.
The first multiplication is the number of remainder segments times the length of the remainder segments. The second multiplication is the number of non-remainder segments times the length of the non-remainder segments.
Again, using the example I gave, that would be
3 * 11 + 0 * 10 = 33
where 3 is the zero-based segment index, 11 is the length of a remainder segment, 0 is the number of non-remainder segments and 10 is the length of the non-remainder segment.
The complexity is O(1).

Here is an implementation of the algorithm in a JavaScript snippet. You can input the parameters interactively to test it.
function partition(size, partitionCount, partitionPosition) {
if (partitionCount < 1 || partitionCount > size) return null; // Out of range
if (partitionPosition < 1 || partitionPosition > partitionCount) return null; // Out of range
// Get the largest partition size
let partitionSize = Math.ceil(size / partitionCount);
// Determine how many partitions are that large
let largerPartitionCount = partitionCount - (partitionSize - size % partitionSize) % partitionSize;
// Convert 1-based position to 0-based index
let partitionIndex = partitionPosition - 1;
// Derive the first and last value of the requested partition
let first = partitionIndex * partitionSize + 1 - Math.max(0, partitionIndex - largerPartitionCount);
let last = first + partitionSize - 1 - (partitionIndex >= largerPartitionCount ? 1 : 0);
return [first, last];
}
// I/O management
let inputs = document.querySelectorAll("input");
let output = document.querySelector("span");
document.addEventListener("change", refresh);
function refresh() {
let [size, partitionCount, partitionPosition] = Array.from(inputs, input => +input.value);
let result = partition(size, partitionCount, partitionPosition);
output.textContent = JSON.stringify(result);
}
refresh();
input { width: 3em }
Array size: <input type="number" value="7" ><br>
Number of partitions: <input type="number" value="3" ><br>
Partition to return: <input type="number" value="1" ><br>
<br>
Returned partition: <span></span>

Number of distinct sequences of fixed length which can be generated using a given set of numbers

I am trying to find different sequences of fixed length which can be generated using the numbers from a given set (distinct elements) such that each element from set should appear in the sequence. Below is my logic:
eg. Let the set consists of S elements, and we have to generate sequences of length K (K >= S)
1) First we have to choose S places out of K and place each element from the set in random order. So, C(K,S)*S!
2) After that, remaining places can be filled from any values from the set. So, the factor
(K-S)^S should be multiplied.
So, overall result is
C(K,S)S!((K-S)^S)
But, I am getting wrong answer. Please help.
PS: C(K,S) : No. of ways selecting S elements out of K elements (K>=S) irrespective of order. Also, ^ : power symbol i.e 2^3 = 8.
Here is my code in python:
# m is the no. of element to select from a set of n elements
# fact is a list containing factorial values i.e. fact[0] = 1, fact[3] = 6& so on.
def ways(m,n):
res = fact[n]/fact[n-m+1]*((n-m)**m)
return res

What you are looking for is the number of surjective functions whose domain is a set of K elements (the K positions that we are filling out in the output sequence) and the image is a set of S elements (your input set). I think this should work:
static int Count(int K, int S)
{
int sum = 0;
for (int i = 1; i <= S; i++)
{
sum += Pow(-1, (S-i)) * Fact(S) / (Fact(i) * Fact(S - i)) * Pow(i, K);
}
return sum;
}
...where Pow and Fact are what you would expect.
Check out this this math.se question.
Here's why your approach won't work. I didn't check the code, just your explanation of the logic behind it, but I'm pretty sure I understand what you're trying to do. Let's take for example K = 4, S = {7,8,9}. Let's examine the sequence 7,8,9,7. It is a unique sequence, but you can get to it by:
Randomly choosing positions 1,2,3, filling them randomly with 7,8,9 (your step 1), then randomly choosing 7 for the remaining position 4 (your step 2).
Randomly choosing positions 2,3,4, filling them randomly with 8,9,7 (your step 1), then randomly choosing 7 for the remaining position 1 (your step 2).
By your logic, you will count it both ways, even though it should be counted only once as the end result is the same. And so on...

Different way to index threads in CUDA C

I have 9x9 matrix and i flattened it into a vector of 81 elements; then i defined a grid of 9 blocks with 9 threads each for a total of 81 threads; here's a picture of the grid
Then i tried to verify what was the index related to the the thread (0,0) of block (1,1); first i calculated the i-th column and the j-th row like this:
i = blockDim.x*blockId.x + threadIdx.x = 3*1 + 0 = 3
j = blockDim.y*blockId.y + threadIdx.y = 3*1 + 0 = 3
therefore the index is:
index = N*i + j = 9*3 +3 = 30
As a matter of fact thread (0,0) of block (1,1) is actually the 30th element of the matrix;
Now here's my problem: let's say a choose a grid with 4 blocks (0,0)(1,0)(0,1)(1,1) with 4 threads each (0,0)(1,0)(0,1)(1,1)
Let's say i keep the original vector with 81 elements; what should i do to get the index of a generic element of the vector by using just 4*4 = 16 threads? i have tried the formulas written above but they don't seem to apply.
My goal is that every thread handles a single element of the vector...

A common way to have a smaller number of threads cover a larger number of data elements is to use a "grid-striding loop". Suppose I had a vector of length n elements, and I had some smaller number of threads, and I wanted to take every element, add 1 to it, and store it back in the original vector. That code could look something like this:
__global__ void my_inc_kernel(int *data, int n){
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) + (threadIdx.x+blockDim.x*blockIdx.x);
while(idx < n){
data[idx]++;
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);}
}
(the above is coded in browser, not tested)
The only complicated parts above are the indexing parts. The initial calculation of idx is just a typical creation/assignment of a globally unique id (idx) to each thread in a 2D threadblock/grid structure. Let's break it down:
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) +
(width of grid in threads)*(thread y-index)
(threadIdx.x+blockDim.x*blockIdx.x);
(thread x-index)
The amount added to idx on each pass of the while loop is the size of the 2D grid in total threads. Therefore, each iteration of the while loop does one "grid's width" of elements at a time, and then "strides" to the next grid-width, to process the next group of elements. Let's break that down:
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);
(width of grid in threads)*(height of grid in threads)
This methodology does not require that the total number of elements be evenly divisible the number of threads. The conditional check of the while-loop handles all cases of relationship between vector size and grid size.
This particular grid-striding loop methodology has the additional benefit (in terms of mapping elements to threads) that it tends to naturally promote coalesced access. The reads and writes to data vector in the code above will coalesce perfectly, due to the behavior of the grid-striding loop. You can enhance coalescing behavior in this case by choosing blocks that are a whole-number multiple of 32, but that is not central to your question.

number to unique permutation mapping of a sequence containing duplicates

I am looking for an algorithm that can map a number to a unique permutation of a sequence. I have found out about Lehmer codes and the factorial number system thanks to a similar question, Fast permutation -> number -> permutation mapping algorithms, but that question doesn't deal with the case where there are duplicate elements in the sequence.
For example, take the sequence 'AAABBC'. There are 6! = 720 ways that could be arranged, but I believe there are only 6! / (3! * 2! * 1!) = 60 unique permutation of this sequence. How can I map a number to a permutation in these cases?
Edit: changed the term 'set' to 'sequence'.

From Permutation to Number:
Let K be the number of character classes (example: AAABBC has three character classes)
Let N[K] be the number of elements in each character class. (example: for AAABBC, we have N[K]=[3,2,1], and let N= sum(N[K])
Every legal permutation of the sequence then uniquely corresponds to a path in an incomplete K-way tree.
The unique number of the permutation then corresponds to the index of the tree-node in a post-order traversal of the K-ary tree terminal nodes.
Luckily, we don't actually have to perform the tree traversal -- we just need to know how many terminal nodes in the tree are lexicographically less than our node. This is very easy to compute, as at any node in the tree, the number terminal nodes below the current node is equal to the number of permutations using the unused elements in the sequence, which has a closed form solution that is a simple multiplication of factorials.
So given our 6 original letters, and the first element of our permutation is a 'B', we determine that there will be 5!/3!1!1! = 20 elements that started with 'A', so our permutation number has to be greater than 20. Had our first letter been a 'C', we could have calculated it as 5!/2!2!1! (not A) + 5!/3!1!1! (not B) = 30+ 20, or alternatively as
60 (total) - 5!/3!2!0! (C) = 50
Using this, we can take a permutation (e.g. 'BAABCA') and perform the following computations:
Permuation #= (5!/2!2!1!) ('B') + 0('A') + 0('A')+ 3!/1!1!1! ('B') + 2!/1!
= 30 + 3 +2 = 35
Checking that this works: CBBAAA corresponds to
(5!/2!2!1! (not A) + 5!/3!1!1! (not B)) 'C'+ 4!/2!2!0! (not A) 'B' + 3!/2!1!0! (not A) 'B' = (30 + 20) +6 + 3 = 59
Likewise, AAABBC =
0 ('A') + 0 'A' + '0' A' + 0 'B' + 0 'B' + 0 'C = 0
Sample implementation:
import math
import copy
from operator import mul
def computePermutationNumber(inPerm, inCharClasses):
permutation=copy.copy(inPerm)
charClasses=copy.copy(inCharClasses)
n=len(permutation)
permNumber=0
for i,x in enumerate(permutation):
for j in xrange(x):
if( charClasses[j]>0):
charClasses[j]-=1
permNumber+=multiFactorial(n-i-1, charClasses)
charClasses[j]+=1
if charClasses[x]>0:
charClasses[x]-=1
return permNumber
def multiFactorial(n, charClasses):
val= math.factorial(n)/ reduce(mul, (map(lambda x: math.factorial(x), charClasses)))
return val
From Number to Permutation:
This process can be done in reverse, though I'm not sure how efficiently:
Given a permutation number, and the alphabet that it was generated from, recursively subtract the largest number of nodes less than or equal to the remaining permutation number.
E.g. Given a permutation number of 59, we first can subtract 30 + 20 = 50 ('C') leaving 9. Then we can subtract 'B' (6) and a second 'B'(3), re-generating our original permutation.

Here is an algorithm in Java that enumerates the possible sequences by mapping an integer to the sequence.
public class Main {
private int[] counts = { 3, 2, 1 }; // 3 Symbols A, 2 Symbols B, 1 Symbol C
private int n = sum(counts);
public static void main(String[] args) {
new Main().enumerate();
}
private void enumerate() {
int s = size(counts);
for (int i = 0; i < s; ++i) {
String p = perm(i);
System.out.printf("%4d -> %s\n", i, p);
}
}
// calculates the total number of symbols still to be placed
private int sum(int[] counts) {
int n = 0;
for (int i = 0; i < counts.length; i++) {
n += counts[i];
}
return n;
}
// calculates the number of different sequences with the symbol configuration in counts
private int size(int[] counts) {
int res = 1;
int num = 0;
for (int pos = 0; pos < counts.length; pos++) {
for (int den = 1; den <= counts[pos]; den++) {
res *= ++num;
res /= den;
}
}
return res;
}
// maps the sequence number to a sequence
private String perm(int num) {
int[] counts = this.counts.clone();
StringBuilder sb = new StringBuilder(n);
for (int i = 0; i < n; ++i) {
int p = 0;
for (;;) {
while (counts[p] == 0) {
p++;
}
counts[p]--;
int c = size(counts);
if (c > num) {
sb.append((char) ('A' + p));
break;
}
counts[p]++;
num -= c;
p++;
}
}
return sb.toString();
}
}
The mapping used by the algorithm is as follows. I use the example given in the question (3 x A, 2 x B, 1 x C) to illustrate it.
There are 60 (=6!/3!/2!/1!) possible sequences in total, 30 (=5!/2!/2!/1!) of them have an A at the first place, 20 (=5!/3!/1!/1!) have a B at the first place, and 10 (=5!/3!/2!/0!) have a C at the first place.
The numbers 0..29 are mapped to all sequences starting with an A, 30..49 are mapped to the sequences starting with B, and 50..59 are mapped to the sequences starting with C.
The same process is repeated for the next place in the sequence, for example if we take the sequences starting with B we have now to map numbers 0 (=30-30) .. 19 (=49-30) to the sequences with configuration (3 x A, 1 x B, 1 x C)

A very simple algorithm to mapping a number for a permutation consists of n digits is
number<-digit[0]*10^(n-1)+digit[1]*10^(n-2)+...+digit[n]*10^0
You can find plenty of resources for algorithms to generate permutations. I guess you want to use this algorithm in bioinformatics. For example you can use itertools.permutations from Python.

Assuming the resulting number fits inside a word (e.g. 32 or 64 bit integer) relatively easily, then much of the linked article still applies. Encoding and decoding from a variable base remains the same. What changes is how the base varies.
If you're creating a permutation of a sequence, you pick an item out of your bucket of symbols (from the original sequence) and put it at the start. Then you pick out another item from your bucket of symbols and put it on the end of that. You'll keep picking and placing symbols at the end until you've run out of symbols in your bucket.
What's significant is which item you picked out of the bucket of the remaining symbols each time. The number of remaining symbols is something you don't have to record because you can compute that as you build the permutation -- that's a result of your choices, not the choices themselves.
The strategy here is to record what you chose, and then present an array of what's left to be chosen. Then choose, record which index you chose (packing it via the variable base method), and repeat until there's nothing left to choose. (Just as above when you were building a permuted sequence.)
In the case of duplicate symbols it doesn't matter which one you picked, so you can treat them as the same symbol. The difference is that when you pick a symbol which still has a duplicate left, you didn't reduce the number of symbols in the bucket to pick from next time.
Let's adopt a notation that makes this clear:
Instead of listing duplicate symbols left in our bucket to choose from like c a b c a a we'll list them along with how many are still in the bucket: c-2 a-3 b-1.
Note that if you pick c from the list, the bucket has c-1 a-3 b-1 left in it. That means next time we pick something, we have three choices.
But on the other hand, if I picked b from the list, the bucket has c-2 a-3 left in it. That means next time we pick something, we only have two choices.
When reconstructing the permuted sequence we just maintain the bucket the same way as when we were computing the permutation number.
The implementation details aren't trivial, but they're straightforward with standard algorithms. The only thing that might heckle you is what to do when a symbol in your bucket is no longer available.
Suppose your bucket was represented by a list of pairs (like above): c-1 a-3 b-1 and you choose c. Your resulting bucket is c-0 a-3 b-1. But c-0 is no longer a choice, so your list should only have two entries, not three. You could move the entire list down by 1 resulting in a-3 b-1, but if your list is long this is expensive. A fast an easy solution: move the last element of the bucket into the removed location and decrease your bucket size: c0 a-3 b-1 becomes b-1 a-3 <empty> or just b-1 a-3.
Note that we can do the above because it doesn't matter what order the symbols in the bucket are listed in, as long as it's the same way when we encode or decode the number.

As I was unsure of the code in gbronner's answer (or of my understanding), I recoded it in R as follows
ritpermz=function(n, parclass){
return(factorial(n) / prod(factorial(parclass)))}
rankum <- function(confg, parclass){
n=length(confg)
permdex=1
for (i in 1:(n-1)){
x=confg[i]
if (x > 1){
for (j in 1:(x-1)){
if(parclass[j] > 0){
parclass[j]=parclass[j]-1
permdex=permdex + ritpermz(n-i, parclass)
parclass[j]=parclass[j]+1}}}
parclass[x]=parclass[x]-1
}#}
return(permdex)
}
which does produce a ranking with the right range of integers

Find the two repeating elements in a given array

You are given an array of n+2 elements. All elements of the array are in range 1 to n. And all elements occur once except two numbers which occur twice. Find the two repeating numbers.
For example, array = {4, 2, 4, 5, 2, 3, 1} and n = 5
Guys I know 4 probable solutions to this problem but recently i encountered a solution which i am not able to interpret .Below is an algorithm for the solution
traverse the list for i= 1st to n+2 elements
{
check for sign of A[abs(A[i])] ;
if positive then
make it negative by A[abs(A[i])]=-A[abs(A[i])];
else // i.e., A[abs(A[i])] is negative
this element (ith element of list) is a repetition
}
Example: A[] = {1,1,2,3,2}
i=1 ; A[abs(A[1])] i.e,A[1] is positive ; so make it negative ;
so list now is {-1,1,2,3,2}
i=2 ; A[abs(A[2])] i.e., A[1] is negative ; so A[i] i.e., A[2] is repetition,
now list is {-1,1,2,3,2}
i=3 ; list now becomes {-1,-1,2,3,2} and A[3] is not repeated
now list becomes {-1,-1,-2,3,2}
i=4 ;
and A[4]=3 is not repeated
i=5 ; we find A[abs(A[i])] = A[2] is negative so A[i]= 2 is a repetition,
This method modifies the original array.
How this algorithm is producing proper results i.e. how it is working.Guys don't take this as an Homework Question as this question has been recently asked in Microsoft's interview.

You are given an array of n+2
elements. All elements of the array
are in range 1 to n. And all elements
occur once except two numbers which
occur twice
Lets modify this slightly, and go with just n, not n+2, and the first part of the problem statement, it becomes
You are given an array of n
elements. All elements of the array
are in range 1 to n
So now you know you have an array, the numbers in the array start at 1 and go up by one for every item in the array. So if you have 10 items, the array will contain the numbers 1 to 10. 5 items, you have 1 to 5 and so forth.
It follows that the numbers stored in the array can be used to index the array. i.e. you can always say A[A[i]] where i <= size of A. e.g. A={5,3,4,1,2}; print A[A[2]]
Now, lets add in one duplicate number.
The algorithm takes the value of each number in the array, and visits that index. We know if we visit the same index twice, we know we have found a duplicate.
How do we know if we visit the same index twice?
Yup, we change the sign of the number in each index we visit, if the sign has already changed, we know we've already been here, ergo, this index (not the value stored at the index) is a duplicate number.
You could achieve the same result by keeping a second array of booleans, initialised to false. That algroithm becomes
A={123412}
B={false, false, false, false}
for(i = 1; i <= 4; i++)
{
if(B[A[i]])
// Duplicate
else
B[A[i]] = true;
}
However in the MS question you're changing the sign of the element in A instead of setting a boolean value in B.
Hope this helps,

What you are doing is using the array values in two ways: they have a number AND they have a sign. You 'store' the fact that you've seen a number n on the n-th spot in your array, without loosing the origional value in that spot: you're just changing the sign.
You start out with all positives, and if you find that your index you want to 'save' the fact you've seen your current value to is allready negative, then this value has allready be seen.
example:
So if you see 4 for the first time, you change the sign on the fourth spot to negative. That doesn't change the 4th spot, because you are using [abs] on that when you would go there, so no worries there.
If you see another 4, you check the 4th spot again, see that it is negative: presto: a double.

When you find some element in position i, let's say n, then you make A[abs(A(i))]=A[abs(n)] negative. So if you find another position j containing n, you will also check A[abs(A(j))]=A[abs(n)]. Since you find it negative, then n is repeated :)

the best approach for finding two repeated elements would be using XOR method.
This solution works only if array has positive integers and all the elements in the array are in range from 1 to n.
As we know A XOR A = 0. We have n + 2 elements in array with 2 repeated elements (say repeated elements are X and Y) and we know the range of elements are from 1 to n.
XOR all the numbers in array numbers from 1 to n. Result be X XOR Y.
1 XOR 1 = 0 and 1 XOR 0 = 1 with this logic in the result of X XOR Y if any kth bit is set to 1 implies either kth bit is 1 either in X or in Y not in both.
Use the above step to divide all the elements in array and from 1 to n into 2 groups, one group which has the elements for which the kth bit is set to 1 and second group which has the elements for which the kth bit is 0.
Let’s have that kth bit as right most set bit (Read how to find right most set bit)
Now we can claim that these two groups are responsible to produce X and Y.
Group -1: XOR all the elements whose kth bit is 1 will produce either X or Y.
Group -2: XOR all the elements whose kth bit is 0 will produce either X or Y.
See the diagram below for more understanding.(Click on the diagram to see it larger).
public class TwoRepeatingXOR {
public static void twoRepeating(int [] A, int n){
int XOR = A[0];
int right_most_bit, X=0, Y=0, size = A.length;
for (int i = 1; i <=n ; i++)
XOR ^= i;
for (int i = 0; i <size ; i++)
XOR ^= A[i];
//Now XOR contains the X XOR Y
//get the right most bit number
right_most_bit = XOR & ~(XOR-1);
//divide the elements into 2 groups based on the right most set bit
for (int i = 0; i <size ; i++) {
if((A[i] & right_most_bit)!=0)
X = X^A[i];
else
Y = Y^A[i];
}
for (int i = 1; i <=n ; i++) {
if((i&right_most_bit)!=0)
X = X^i;
else
Y = Y^i;
}
System.out.println("Two Repeated elements are: " + X + " and " + Y);
}
public static void main(String[] args) {
int [] A = {1,4,5,6,3,2,5,2};
int n = 6;
twoRepeating(A, n);
}
}
credits go to https://algorithms.tutorialhorizon.com/find-the-two-repeating-elements-in-a-given-array-6-approaches/

Simple, use a Hashtable.
For each item, check if it already exists O(1) , and if not, add it to the hashtable O(1).
When you find an item that already exists... that's it.

I know this isn't really an answer to your question, but if I actually had to write this code on a real project, I would start with a sort algo like quicksort, and in my comparison function something like,
int Compare(int l, int r)
{
if(l == r)
{
// duplicate; insert into duplicateNumbers array if it doesn't exist already.
// if we found 2 dupes, quit the sort altogether
}
return r - l;
}
I would file this into the "good balance between performance and maintainability" bucket of possible solutions.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio