calculating the number of bits using K&R method with infinite memory - algorithm

I got answer for the question, counting number of sets bits from here.
How to count the number of set bits in a 32-bit integer?
long count_bits(long n) {
unsigned int c; // c accumulates the total bits set in v
for (c = 0; n; c++)
n &= n - 1; // clear the least significant bit set
return c;
}
It is simple to understand also. And found the best answer as Brian Kernighans method, posted by hoyhoy... and he adds the following at the end.
Note that this is an question used during interviews. The interviewer will add the caveat that you have "infinite memory". In that case, you basically create an array of size 232 and fill in the bit counts for the numbers at each location. Then, this function becomes O(1).
Can somebody explain how to do this ? If i have infinite memory ...

The fastest way I have ever seen to populate such an array is ...
array[0] = 0;
for (i = 1; i < NELEMENTS; i++) {
array[i] = array[i >> 1] + (i & 1);
}
Then to count the number of set bits in a given number (provided the given number is less than NELEMENTS) ...
numSetBits = array[givenNumber];
If your memory is not finite, I often see NELEMENTS set to 256 (for one byte's worth) and add the number of set bits in each byte in your integer.

int counts[MAX_LONG];
void init() {
for (int i= 0; i < MAX_LONG; i++)
{
counts[i] = count_bits[i]; // as given
}
}
int count_bits_o1(long number)
{
return counts[number];
}
You can probably pre-populate the array more wiseley, i.e. fill with zeros, then every second index add one, then every fourth index add 1, then every eighth index add 1 etc, which might be a bit faster, although I doubt it...
Also, you might account for unsigned values.

Related

how do we calculate the number of reads/misses of the cache in this code snippet?

I'm trying to get an understanding of how to calculate the errors in the code, from the link on this page, Example given from text book. I can see where the calculations come from, but as the values are the same (32), I cannot work out how to do the calculation should the value in the two loops differ. Using different sized loops, what would the calculations be please?
`
for (i = 32; i >= 0; i--) {
for (j = 128; j >= 0; j--) {
total_x += grid[i][j].x;
}
}
for (i = 128; i >= 0; i--) {
for (j = 32; j >= 0; j--) {
total_y += grid[i][j].y;
}
}
`
If we had a matrix with 128 rows and 24 columns (instead of the 32 x 32 in the example), using 32-bit integers, and with each memory block able to hold 16 bytes, how do we calculate the number of compulsory misses on the top loop?
Also, if we use a direct-mapped cache holding 256 bytes of data, how would we calculate the number of all the data cache misses when running the top loop?
Finally, if we flip it and use the bottom loop, how does the maths change (if it does) for the points above?
Apologies as this is all new to me and I just want to understand the maths behind it so I can answer the problem, rather than just be given an answer.
Nothing - it's a theoretical question

Rank of string solution

I was going through a question where it asks you to find the rank of the string amongst its permutations sorted lexicographically.
O(N^2) is pretty clear.
Some websites have O(n) solution also. The part that is optimized is basically pre-populating a count array such that
count[i] contains count of characters which are present in str and are smaller than i.
I understand that this'd reduce the complexity but can't fit my head around how we are calculating this array. This is the function that does this (taken from the link):
// Construct a count array where value at every index
// contains count of smaller characters in whole string
void populateAndIncreaseCount (int* count, char* str)
{
int i;
for( i = 0; str[i]; ++i )
++count[ str[i] ];
for( i = 1; i < 256; ++i )
count[i] += count[i-1];
}
Can someone please provide an intuitive explanation of this function?
That solution is doing a Bucket Sort and then sorting the output.
A bucket sort is O(items + number_of_possible_distinct_inputs) which for a fixed alphabet can be advertised as O(n).
However in practice UTF makes for a pretty large alphabet. I would therefore suggest a quicksort instead. Because a quicksort that divides into the three buckets of <, > and = is efficient for a large character set, but still takes advantage of a small one.
Understood after going through it again. Got confused due to wrong syntax in c++. It's actually doing a pretty simple thing (Here's the java version :
void populateAndIncreaseCount(int[] count, String str) {
// count is initialized to zero for all indices
for (int i = 0; i < str.length(); ++i) {
count[str.charAt(i)]++;
}
for (int i = 1; i < 256; ++i)
count[i] += count[i - 1];
}
After first step, indices whose character are present in string are non-zero. Then, for each index in count array, it'd be the sum of all the counts till index-1 since array represents lexicographically sorted characters. And, after each search, we udate the count array also:
// Removes a character ch from count[] array
// constructed by populateAndIncreaseCount()
void updatecount (int* count, char ch)
{
int i;
for( i = ch; i < MAX_CHAR; ++i )
--count[i];
}

Selection Sort in Cuda

So, I'm trying to implement selection sort in Cuda, but so far I haven't been as successful.
__device__ void selection_sort( int *data, int left, int right ){
for( int i = left ; i <= right ; ++i ){
int min_val = data[i];
int min_idx = i;
// Find the smallest value in the range [left, right].
for( int j = i+1 ; j <= right ; ++j ){
int val_j = data[j];
if( val_j < min_val ){
min_idx = j;
min_val = val_j;
}
}
// Swap the values.
if( i != min_idx ){
data[min_idx] = data[i];
data[i] = min_val;
}
}
}
My main attempt here is to find the minimum and parallelize the solution. Now, I realize the code looks very C++ 'ish but I'm nowhere qualified as skilled in Cuda.
Is there a way to parallelize the solution? Are there any more additions to be made?
Selection sort algorithm for N numbers can be roughly described as:
for i from N-1 down to 0
find the maximum element among data[0] ~ data[i]
swap that maximum element with data[i] within the data array
The first part (finding the maximum element) falls into a widely known and well documented class of problems called reduction. However, to perform the second part (swapping), you must track the index of the maximum element while comparing the values, and it is not so natural to do that while performing reduction. This is one of the reasons why selection sort do not port well to parallel architectures.
Also, you can see that the problem size diminishes by one for each loop, and this is another aspect of the selection sort algorithm that does not map well to parallel architectures. In case of CUDA, 32 threads form a warp, which execute at the same time. Although you can tell arbitrary number of threads to run within a warp, it is generally not recommended to do so because it is a loss of computing power.
I've tried to build a CUDA version of selection sort myself, but I stopped doing it because it seems there are better algorithms well suited for CUDA. But I'll just show you what I've done so far to illustrate why selection sort is not good for CUDA.
Firstly, start from a small and simple problem: sorting 32 elements. Since 32 threads form a warp, you can use shuffle instructions to find maximum value. (Full code)
// Finds the maximum element within a warp and gives the maximum element to
// thread with lane id 0. Note that other elements do not get lost but their
// positions are shuffled.
__inline__ __device__ int warpMax(int data, unsigned int threadId)
{
for (int mask = 16; mask > 0; mask /= 2) {
int dual_data = __shfl_xor(data, mask, 32);
if (threadId & mask)
data = min(data, dual_data);
else
data = max(data, dual_data);
}
return data;
}
__global__ void selection32(int* d_data, int* d_data_sorted)
{
unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int laneId = threadIdx.x % 32;
int n = N;
while(n-- > 0) {
// get the maximum element among d_data and put it in d_data_sorted[n]
int data = d_data[threadId];
data = warpMax(data, threadId);
d_data[threadId] = data;
// now maximum element is in d_data[0]
if (laneId == 0) {
d_data_sorted[n] = d_data[0];
d_data[0] = INT_MIN; // this element is ignored from now on
}
}
}
int main()
{
// ... build data and trasfer to d_data ...
selection32<<<1, 32>>>(d_data, d_data_sorted);
// ... get the sorted array stored at d_data_sorted ...
}
(Some may argue that this is not exactly a selection sort since 1) the array elements of the unsorted area keep shuffling, and 2) it is not an in-place sort. Please note that I'm just trying to show that selection sort does not fit in for CUDA. Also, note that warpMax has highly divergent branches, making it less optimal for CUDA.)
The case with only 1 warp of elements may look parallel-ish, but the thing gets worse when the problem size increases to multiple warps. Let's see the case for 1024 elements. (I've chosen the number 1024 becuase it is the maximum number limit of threads in a block.) Now there are 32 warps, and after calling warpMax for each warp, we must compare the maximum elements of each warp to get the maximum element among the 1024 elements. This problem of comparing 32 warp-maximum-values cannot be done with warpMax because we need to track in which warp the maximum value came from to swap the maximum value with the last element in the data array. One way I can think of for doing this is using one single thread to compare warp-maximum-values. This is not a good implemenation for CUDA becuase other 1023 threads in the block become idle.
Furthermore, if the problem size grows larger than a block can cover, we need to compare the maximum values of each block, implying that we will have to launch separate kernels since we need to synchronize between blocks. And it is redundant to say that we need to keep track of in which block the maximum value came from. All of these just tells that implementing selection sort for CUDA is not a good idea.

How can we find a repeated number in array in O(n) time and O(1) space complexity

How can we find a repeated number in array in O(n) time and O(1) complexity?
eg
array 2,1,4,3,3,10
output is 3
EDIT:
I tried in following way.
i found that if no is oddly repeated then we can achieve the result by doing xor . so i thought to make the element which is odd no repeating to even no and every evenly repeating no to odd.but for that i need to find out unique element array from input array in O(n) but couldn't find the way.
Assuming that there is an upped bound for the values of the numbers in the array (which is the case with all built-in integer types in all programming languages I 've ever used -- for example, let's say they are 32-bit integers) there is a solution that uses constant space:
Create an array of N elements, where N is the upper bound for the integer values in the input array and initialize all elements to 0 or false or some equivalent. I 'll call this the lookup array.
Loop over the input array, and use each number to index into the lookup array. If the value you find is 1 or true (etc), the current number in the input array is a duplicate.
Otherwise, set the corresponding value in the lookup array to 1 or true to remember that we have seen this particular input number.
Technically, this is O(n) time and O(1) space, and it does not destroy the input array. Practically, you would need things to be going your way to have such a program actually run (e.g. it's out of the question if talking about 64-bit integers in the input).
Without knowing more about the possible values in the array you can't.
With O(1) space requirement the fastest way is to sort the array so it's going to be at least O(n*log(n)).
Use Bit manipulation ... traverse the list in one loop.
Check if the mask is 1 by shifting the value from i.
If so print out repeated value i.
If the value is unset, set it.
*If you only want to show one repeated values once, add another integer show and set its bits as well like in the example below.
**This is in java, I'm not sure we will reach it, but you might want to also add a check using Integer.MAX_VALUE.
public static void repeated( int[] vals ) {
int mask = 0;
int show = 0;
for( int i : vals ) {
// get bit in mask
if( (( mask >> i ) & 1) == 1 &&
(( show >> i ) & 1) == 0 )
{
System.out.println( "\n\tfound: " + i );
show = show | (1 << i);
}
// set mask if not found
else
{
mask = mask | (1 << i);
System.out.println( "new: " + i );
}
System.out.println( "mask: " + mask );
}
}
This is impossible without knowing any restricted rules about the input array, either that the Memory complexity would have some dependency on the input size or that the time complexity is gonna be higher.
The 2 answers above are infact the best answers for getting near what you have asked, one's trade off is Time where the second trade off is in Memory, but you cant have it run in O(n) time and O(1) complexity in SOME UNKNOWN INPUT ARRAY.
I met the problem too and my solution is using hashMap .The python version is the following:
def findRepeatNumber(lists):
hashMap = {}
for i in xrange(len(lists)):
if lists[i] in hashMap:
return lists[i]
else:
hashMap[lists[i]]=i+1
return
It is possible only if you have a specific data. Eg all numbers are of a small range. Then you could store repeat info in the source array not affecting the whole scanning and analyzing process.
Simplified example: You know that all the numbers are smaller than 100, then you can mark repeat count for a number using extra zeroes, like put 900 instead of 9 when 9 is occurred twice.
It is easy when NumMax-NumMin
http://www.geeksforgeeks.org/find-the-maximum-repeating-number-in-ok-time/
public static string RepeatedNumber()
{
int[] input = {66, 23, 34, 0, 5, 4};
int[] indexer = {0,0,0,0,0,0}
var found = 0;
for (int i = 0; i < input.Length; i++)
{
var toFind = input[i];
for (int j = 0; j < input.Length; j++)
{
if (input[j] == toFind && (indexer[j] == 1))
{
found = input[j];
}
else if (input[j] == toFind)
{
indexer[j] = 1;
}
}
}
return $"most repeated item in the array is {found}";
}
You can do this
#include<iostream.h>
#include<conio.h>
#include<stdio.h>
void main ()
{
clrscr();
int array[5],rep=0;
for(int i=1; i<=5; i++)
{
cout<<"enter elements"<<endl;
cin>>array[i];
}
for(i=1; i<=5; i++)
{
if(array[i]==array[i+1])
{
rep=array[i];
}
}
cout<<" repeat value is"<<rep;
getch();
}

How can I count the digits in an integer without a string cast?

I fear there's a simple and obvious answer to this question. I need to determine how many digits wide a count of items is, so that I can pad each item number with the minimum number of leading zeros required to maintain alignment. For example, I want no leading zeros if the total is < 10, 1 if it's between 10 and 99, etc.
One solution would be to cast the item count to a string and then count characters. Yuck! Is there a better way?
Edit: I would not have thought to use the common logarithm (I didn't know such a thing existed). So, not obvious - to me - but definitely simple.
This should do it:
int length = (number ==0) ? 1 : (int)Math.log10(number) + 1;
int length = (int)Math.Log10(Math.Abs(number)) + 1;
You may need to account for the negative sign..
A more efficient solution than repeated division would be repeated if statements with multiplies... e.g. (where n is the number whose number of digits is required)
unsigned int test = 1;
unsigned int digits = 0;
while (n >= test)
{
++digits;
test *= 10;
}
If there is some reasonable upper bound on the item count (e.g. the 32-bit range of an unsigned int) then an even better way is to compare with members of some static array, e.g.
// this covers the whole range of 32-bit unsigned values
const unsigned int test[] = { 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 };
unsigned int digits = 10;
while(n < test[digits]) --digits;
If you are going to pad the number in .Net, then
num.ToString().PadLeft(10, '0')
might do what you want.
You can use a while loop, which will likely be faster than a logarithm because this uses integer arithmetic only:
int len = 0;
while (n > 0) {
len++;
n /= 10;
}
I leave it as an exercise for the reader to adjust this algorithm to handle zero and negative numbers.
I would have posted a comment but my rep score won't grant me that distinction.
All I wanted to point out was that even though the Log(10) is a very elegant (read: very few lines of code) solution, it is probably the one most taxing on the processor.
I think jherico's answer is probably the most efficient solution and therefore should be rewarded as such.
Especially if you are going to be doing this for a lot of numbers..
Since a number doesn't have leading zeroes, you're converting anyway to add them. I'm not sure why you're trying so hard to avoid it to find the length when the end result will have to be a string anyway.
One solution is provided by base 10 logarithm, a bit overkill.
You can loop through and delete by 10, count the number of times you loop;
int num = 423;
int minimum = 1;
while (num > 10) {
num = num/10;
minimum++;
}
Okay, I can't resist: use /=:
#include <stdio.h>
int
main(){
int num = 423;
int count = 1;
while( num /= 10)
count ++;
printf("Count: %d\n", count);
return 0;
}
534 $ gcc count.c && ./a.out
Count: 3
535 $

Resources