I have an assignment to write an algorithm to find duplicate in a Dynamic Sorted Array. I want to write this algoirthm but before starting, I must know the data structure Dynamic Sorted Array but, I dont know it. I tried to googling but I couldn't find anything like Dynamic Sorted Array. would you please guide me? What is this data structure and how dos it look like? thanks.
I think your instructor is simply referring to an array that can change and sort itself, so you can assume that it's always in the correct order and that it is of variable length. If the algorithm is to be written in pseudo-code that's probably all you need to know.
Let's see, you need to understand what a Dynamic Sorted array is:
You already know what a sorted array is, so let's try to understand what a dynamic array is: It's a grow able array where there is no restriction on the size of the array.
So, to summarize, you need to write an array which is:
A. Sorted
B. Dynamic in nature (expanding)
How to implement? Read Dynamic arrays overview and implementation in Java and C++
I assume it means an array whose length is dynamic (i.e. unknown at compile-time), and whose values are sorted.
I never heard of that data structure, but based on the individual words I would guess that it:
Behaves like an array, that is with O(1) access operations get(index) and set(index).
Can be resized if necessary.
Is always sorted.
I don't think though that such a data structure is very efficient for finding duplicates. I would prefer some sort of map, unless you need very simple algorithms.
I would say you may have a typo in your assignment. It might ought to read "sorted Dynamic Array".
However, a dynamic array which always inserts new items in sorted order would probably fit that terminology. So take your dynamic array:
[2][5][7][9]
Inserting the element '8' would result in the following array:
[2][5][7][8][9]
Related
I was reading an answer to a question asked here:
Why does hashcode() returns an integer and not long?
My question is: Why hashcode based data structures use an array to create bins?
Because array is a low-level data structure which allows random access to its elements.
You need a "low-level" data structure to base a "higher-level" data structure on.
You need random access so that you can address bins very fast.
cause, an array is based on integer-based indexes! now you can show some curiosity, why array using integer-based indexing. one of the assumptions should be -- if you could able to use other types (real numbers) rather than using integer, just think how many dimension you would capable to add --
for example --
for 1-th index, you could capable to add sub-indexes like -- 1.1, 1.2, 1.1.2, 1.1.1.1.2 and so on so forth!
doing so, it will create more overhead, rather than popping up the solution we want.
I have to implement a Trie of codes of a given fixed-length. Each code is a sequence of integers and considering that some patterns are usual, I decided to implement a Trie in order to store all the codes.
I also need to iterate throught the codes given they lexicograph order and I'm expecting to work with millions (maybe billions) of codes.
This is why I considered implementing this particular Trie as a dictionary where each key is the index of a given prefix.
Let's say key 0 has a list of his prefix children and for each one i save the corresponding entry on the dictionary...
Example: If my first insertion is the code 231, then the dictionary would look like:
[0]->{(2,1)}
[1]->{(3,2)}
[2]->{(1,3)}
This way, if my second insertion would be 243, the dictionary would be updated this way:
[0]->{(2,1)}
[1]->{(3,2),(4,3)} *Here each list is sorted using a flat_map
[2]->{(1,endMark)}
[3]->{(3,endMark)}
My problem is that I have been using a vector for this purpuse and because having all the dictionary in contiguos memory allows me to have a better performance while iterating over it.
Now, when I need to work with BIG instances of my problem, due to resizing the vector I cannot work with millions of codes (memory consuption could be as much as 200GB).
Now I have tried google's sparse hash insted of the vector and my question is, do you have any suggestion? any other alternative in mind? Is there any other way to work with integers as keys to improve performance?
I know I wont have any collision because each key would be different from the rest.
Best regards,
Quentin
I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.
I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic
What is the best approach to find if a given set(unsorted) is a perfect subset of a main set. I got to do some validation in my program where I got to compare the clients request set with the registered internal capability set.
I thought of doing by having internal capability set sorted(will not change once registered) and do Binary search for each element in the client's request set. Is it the best I could get? I suspected that there might be better approach.
Any idea?
Regards,
Microkernel
Assuming that your language of choice doesn't implement a set class with "contains in a set" method already like Java does with HashSet...
A good approach is to use hashmaps (aka hashes aka associative arrays)
If your superset is not too big, generate a hashmap mapping each object in the larger set to a true value.
Then, loop over each element in a subset. Try to find the element in the generated hashmap.
if you fail, your small set is NOT a peoper subset. If you finish the loop without failing, it is.
it depends on how many elements are in your sets.
for bigger sets, usually use a Hashset for the mainset turns out best performance.
Since you know the internal capability set you can use a perfect hash function to test the elements of the client request set.