Initializing an Array in the Context of Studying Data Structures

Initializing an Array in the Context of Studying Data Structures - data-structures

I am reading CLRS's Introduction to Algorithms and there is question 11.1 Exercise 4 in the book under the section Direct-Address Tables :
We wish to implement a dictionary by using direct addressing on a huge array. At
the start, the array entries may contain garbage, and **initializing** the entire array
is impractical because of its size. Describe a scheme for implementing a direct address
dictionary on a huge array. Each stored object should use O(1) space;
the operations SEARCH, INSERT, and DELETE should take O(1) time each; and
initializing the data structure should take O(1) time. (Hint: Use an additional array,
treated somewhat like a stack whose size is the number of keys actually stored in
the dictionary, to help determine whether a given entry in the huge array is valid or
not.)
I understand the solution is just to create another array, and have it store pointers to this array for elements that exist.
But I'm slightly confused as to the meaning of "initialize" in this context.
If the array is not initialized, how can we even access the data (i.e. get the value at the i-th position with A[i])?
I'm also not sure why the question states this memory constraint. Suppose we could initialize the array, how would the answer change?

The problem is that initializing an array of length N -- setting all the elements to a known value like NULL -- takes O(N) time.
If you have an array that is initialized to NULL, then implementing a direct access table is super easy -- A[i] == NULL means there is no value for i, and if there is a value for i, then it's stored in A[i].
The question is about how to avoid the O(N) initialization cost. If the array is not initialized, then the initial values for all A[i] could be anything at all... so how do you tell if it's a real value or just the initial garbage?
The solution is not just to create another array that stores pointers to the original -- you would have to initialize that other array and then you've wasted O(N) time again.
To avoid that cost altogether, you have to be more clever.
Make 3 arrays A, B, and C, and keep a count N of the total number of values in the dictionary.
Then, if the value for i is v:
A[i] = v;
0 <= B[i] < N; and
C[B[i]] = i
This way, the B and C arrays let you keep track of which indexes in A have been set to a real value, without initializing any of the arrays. When you add a new item, you check conditions (2) and (3) to see if the index valid, and if it isn't, then you do:
A[i] = NULL
B[i] = N
C[N++] = i
This marks index i as valid, and conditions (2) and (3) will then pass for all future checks.
Because of the amount of memory it takes, this technique isn't often used in practice, BUT it does mean that theoretically, you never have to count the cost of array initialization when calculating run time complexity.

In that context, initializing means setting the values inside the array to NULL, 0 or the empty value for the stored type. The idea is that when allocating the memory for the array, the content of that allocated memory is random, so the array ends up containing random values. In this situation initializing the values means setting them to the "empty" value.

Related

Ruby- delete a value from sorted (unique) array at O(log n) runtime

I have a sorted array (unique values, not duplicated).
I know I can use Array#binarysearch but it's used to find values not delete them.
Can I delete a value at O(log n) as well? How?
Lets say I have this array:
arr = [-3, 4, 7, 12, 15, 20] #very long array
And I would like to delete the value 7.
So far I have this:
arr.delete(7) #I'm quite sure it's O(n)
Assuming Array#delete-at works at O(1).
I could do arr.delete_at(value_index)
Now I just need to get the value's index.
binary search can do it, since the array is already sorted.
But the only method utilizing the sorted attribute (that i know of) is binary search which returns values, nothing about deleting or returning indexes.
To sum it up:
1) How to delete a value from sorted not duplicated array at O(log n) ?
Or
2) Assuming array#delete-at works at O(1) (does it?), how can I get the value's index at O(log n)? ( I mean the array is already sorted, must I implement it myself?)
Thank you.

The standard Array implementation has no constraint on sorting or duplicate. Therefore, the default implementation has to trade performance with flexibility.
Array#delete deletes an element in O(n). Here's the C implementation. Notice the loop
for (i1 = i2 = 0; i1 < RARRAY_LEN(ary); i1++) {
...
}
The cost is justified by the fact Ruby has to scan all the items matching given value (note delete deletes all the entries matching a value, not just the first), then shift the next items to compact the array.
delete_at has the same cost. In fact, it deletes the element by given index, but then it uses memmove to shift the remaining entries one index less on the array.
Using a binary search will not change the cost. The search will cost you O(log n), but you will need to delete the element at given key. In the worst case, when the element is in position [0], the cost to shift all the other items in memory by 1 position will be O(n).
In all cases, the cost is O(n). This is not unexpected. The default array implementation in Ruby uses arrays. And that's because, as said before, there are no specific constraints that could be used to optimize operations. Easy iteration and manipulation of the collection is the priority.
Array, sorted array, list and sorted list: all these data structures are flexible, but you pay the cost in some specific operations.
Back to your question, if you care about performance and your array is sorted and unique, you can definitely take advantage of it. If your primary goal is finding and deleting items from your array, there are better data structures. For instance, you can create a custom class that stores your array internally using a d-heap where the delete() costs O(log[d,n]), same applies if you use a binomial heap.

Data structure supporting O(1) remove/insert/findOldest?

This question was asked in the interview:
Propose and implement a data structure that works with integer data from final and continuous ranges of integers. The data structure should support O(1) insert and remove operations as well findOldest (the oldest value inserted to the data structure).
No duplication is allowed (i.e. if some value already inside - it should not be added once more)
Also, if needed, the some init might be used for initialization.
I proposed a solution to use an array (size as range size) of 1/0 indicating the value is inside. It solves insert/remove and requires O(range size) initialization.
But I have no idea how to implement findOldest with the given constraints.
Any ideas?
P.S. No dynamic allocation is allowed.

I apologize if I've misinterpreted your question, but the sense I get is that
You have a fixed range of values you're considering (say, [0, N))
You need to support insertions and deletions without duplicates.
You need to support findOldest.
One option would be to build an array of length N, where each entry stores a boolean "is active" flag as well as a pointer. Additionally, each entry has a doubly-linked list cell in it. Intuitively, you're building a bitvector with a linked list threaded through it storing the insertion order.
Initially, all bits are set to false and the pointers are all NULL. When you do an insertion, set the bit on the appropriate cell to true (returning immediately if it's already set), then update the doubly-linked list of elements by appending this new cell to it. This takes time O(1). To do a findOldest step, just query the pointer to the oldest element. Finally, to do a removal step, clear the bit on the element in question and remove it from the doubly-linked list, updating the head and tail pointer if necessary.
All in all, all operations take time O(1) and no dynamic allocations are performed because the linked list cells are preallocated as part of the array.
Hope this helps!

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?

What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?

Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html

If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.

Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))

Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Algorithm to find duplicates in multiple linked lists

What is the fastest method of finding duplicates across multiple (large) linked lists.
I will attempt to illustrate the problem with arrays instead just to make it a bit more readable. (I used numbers from 0-9 for simplicity instead of pointers).
list1[] = {1,2,3,4,5,6,7,8,9,0};
list2[] = {0,2,3,4,5,6,7,8,9,1};
list3[] = {4,5,6,7,8,9,0,1,2,3};
list4[] = {8,2,5};
list5[] = {1,1,2,2,3,3,4,4,5,5};
If I now ask: 'does the number 8 exist in list1-5?' I could sort the lists, remove duplicates, repeat this for all lists and merge them into a "superlist" and see if the number of (new) duplicates equal the number of lists that I search through. Assuming that I got the correct number of duplicates I can assume that what I searched for (8) exists in all of the lists.
If I instead searched for 1 I will only get four duplicates—ergo not found in all of the lists.
Is there a faster/smarter/better way to achieve the above without sorting and/or changing the lists in any way?
P.S.: This question is asked mostly out of pure curiosity and nothing else! :)

Just put each number into a hash table and store the number of occurrences for that item in the table. When you find another, just increment the counter. O(n) algorithm (n items across all the lists).
If you want to store the lists that each occurs in, then you need a set representation to be stored under each item as well. YOu can use any set representation -- bit vector, list, array etc. This will tell you the lists that that item is a member of. This does not change it from O(n), just increases the work by a constant factor.

Define an array hash and set all the location values to 0
define hash[MAX_SYMBOLS] = {0};
define new_list[LENGTH]
defile list[LENGTH] and populate
Now for each element in your list, use this number as an index in hash and increment that location of hash . Each presence of that number would increment the value at that hash location once. So a duplicate value i would have hash[i] > 1
for i=0 to (n - 1)
do
increment hash[list[i]]
endfor
If you want to remove the duplicates and create a new list then scan the hash array and for each presence of i ie. if hash[i] > 0 load them into a new list in the order in which they appeared in the original list.
define j = 0
for i=0 to (n - 1)
do
if hash[list[i]] is not 0
then
new_list[j] := i
increment j
endif
endfor
Note that when using with negative numbers you will not be able to use the values directly to index. To use negative numbers, first we can find the largest magnitude of the negative numbers and use that magnitude to add to all the numbers when we use them to index the hash array.
find the highest magnitude of negative value into min_neg
for i=0 to (n - 1)
do
increment hash[list[i + min_neg]]
endfor
Or in implementation you can allocate contiguous memory and then define a pointer at the middle of the allocated memory block, so that you could move in both front and back directions so that you can use negative index with it. You need to make sure that you have enough memory to use in front and back of the pointer.
int *hash = malloc (sizeof (int) * SYMBOLS)
int *hash_ptr = hash + (int)(SYMBOLS/2)
now you can do hash_ptr[-6] or some hash_ptr[i] with -SYMBOLS/2 < i < SUMBOLS/2 + 1

The question is a bit vague, so the answer depends on what you want.
A hash table is the correct answer for asking general questions about duplicates, because it allows you to go through each list just once to build a table that will answer most questions; however, some questions will not require one.
Possible cases that seem to answer your question:
Do you just need to know if a certain value is present in each list? - Check through the first list until the value is found. If not, you're done: it is not. Repeat for each successive list. If all lists are searched and the value found, it is duplicated in each list. In this algorithm, it is not necessary to look at each value in each list, or even each list, so this would be the quickest.
Do you need to know whether any duplicates exist at all?
- If any value in a hash table keyed by number has a count greater than 0, there are duplicates... If that is all you need to know, you can quit right there.
Do you need the number of duplicates
in each table, separately?
- Multiply each value by the number of lists and add the number of the list in process. Store that as the hash key and count duplicates. When all lists are processed, you have a table that can answer all kinds of questions. To check duplicates for a specific value, multiply it by the list count and examine sequential hash keys. If there is one for each list, the number is present in each list. If all the counts are greater than 1 over that range, the number is duplicated in each list.
Etc.

Data structure for an ordered set with many defined subsets; retrieve subsets in same order

I'm looking for an efficient way of storing an ordered list/set of items where:
The order of items in the master set changes rapidly (subsets maintain the master set's order)
Many subsets can be defined and retrieved
The number of members in the master set grow rapidly
Members are added to and removed from subsets frequently
Must allow for somewhat efficient merging of any number of subsets
Performance would ideally be biased toward retrieval of the first N items of any subset (or merged subset), and storage would be in-memory (and maybe eventually persistent on disk)

I am a new member to this forum, I hope you have not forgotten about this old question :)
Solution
Store the master set in an indexed data structure --such as an array (or arraylist, if your library supports it). Let's assume you can associate an id with each set (if not, then how do you know which set to retrieve?). So, we now need a way to find out which elements of your array participate in that set and which ones don't.
Use a matrix (n x m) with n being the number of elements in your array and m being the initial number of sets. i refers to the row index and j refers to the column index.
A[i][j] = 0 if ith element is not in jth set
A[i][j] = 1 if ith element is in jth set
Do not use a simple two dimensional array, go for an ArrayList<ArrayList>. Java/C#/C++ support such generic constructs, but it shouldn't be terribly hard to do so in other languages such as Perl. In C# you can even use a DataTable.
Time to add a new set
You can add a new set in O(n) time. Simply add a new column for that set and set the appropriate rows to 1 for that column. There will be no need to sort this set as long as the original array is sorted.
Time to add a new element
In a simple sorted array, time for insertion is O(log n). In our case, we will first add element to the array (and at whatever index we added the element at, the matrix will also get an all 0 row at that index). Then we set entries in that column to 1 if the element belongs to a set. That way, the worst case runtime becomes O(log n) + O(m).
Time to fetch first N elements from a set
Pick up the column corresponding to the set in O(1) time and then pick the first N entries that are 1. This will be linear.
Time to merge two sets
Let's say we are merging sets at j1 and j2 into a third set at j3.
for (int i = 0; i < n - 1; i++) {
A[i][j3] = A[i][j1] | A[i][j2];
}
This is again linear.
Time to remove an element
First find the element in the master array --this takes O(log n) time. Then remove it from that array and remove row at that index from the matrix.
Efficient deletions from the array
Don't simply remove, just mark them defunct. Upon a threshold number of defunct'd columns/rows, you can consolidate. Similarly, start with a high capacity initially for the arrays. Modern implementations should do this automatically though.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio