Do I need to implement a b-tree search for this? - algorithm

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?

You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.

I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.

A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...

Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.

Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

Related

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Optimizing algorithms for multiple queries of the same kind

There is a particular class of algorithm coding problems which require us to evaluate multiple queries which can be of two kind :
Perform search over a range of data
Update the data over a given range
One example which I've been recently working on is this(though not the only one) : Quadrant Queries
Now, to optimize my algorithm, I have had one idea :
I can use dynamic programming to keep the search results for a particular range, and generate data for other ranges as required.
For example, if I have to calculate sum of numbers in an array from index 4 to 7, I can already keep sum of elements upto 4 and sum of elements upto 7 which is easy and then I'll just need the difference of the two + 4th element which is O(1). But this raises another problem : During the update operation, I'll have to update my stored search data for all the elements following the updated element. This seems to be inefficient, though I did not try it practically.
Someone suggested me that I can combine subsequent update operations using some special data structure.(Actually read it on some forum).
Question: Is there a known way to optimize these kind of problems? Is there a special data structure that does it? The idea I mentioned;Is it possible that it might be more efficient than direct approach? Should I try it out?
It might help:
Segment Trees (Range-Range part)

Optimizing Inserting into the Middle of a List

I have algorithms that works with dynamically growing lists (contiguous memory like a C++ vector, Java ArrayList or C# List). Until recently, these algorithms would insert new values into the middle of the lists. Of course, this was usually a very slow operation. Every time an item was added, all the items after it needed to be shifted to a higher index. Do this a few times for each algorithm and things get really slow.
My realization was that I could add the new items to the end of the list and then rotate them into position later. That's one option!
Another option, when I know how many items I'm adding ahead of time, is to add that many items to the back, shift the existing items and then perform the algorithm in-place in the hole I've made for myself. The negative is that I have to add some default value to the end of the list and then just overwrite them.
I did a quick analysis of these options and concluded that the second option is more efficient. My reasoning was that the rotation with the first option would result in in-place swaps (requiring a temporary). My only concern with the second option is that I am creating a bunch of default values that just get thrown away. Most of the time, these default values will be null or a mem-filled value type.
However, I'd like someone else familiar with algorithms to tell me which approach would be faster. Or, perhaps there's an even more efficient solution I haven't considered.
Arrays aren't efficient for lots of insertions or deletions into anywhere other than the end of the array. Consider whether using a different data structure (such as one suggested in one of the other answers) may be more efficient. Without knowing the problem you're trying to solve, it's near-impossible to suggest a data structure (there's no one solution for all problems). That being said...
The second option is definitely the better option of the two. A somewhat better option (avoiding the default-value issue): simply copy 789 to the end and overwrite the middle 789 with 456. So the only intermediate step would be 0123789789.
Your default-value concern is, however, (generally) not a big issue:
In Java, for one, you cannot (to my knowledge) even assign memory for an array that's not 0- or null-filled. C++ STL containers also enforce this I believe (but not C++ itself).
The size of a pointer compared to any moderate-sized class is minimal (thus assigning it to a default value also takes minimal time) (in Java and C# everything is pointers, in C++ you can use pointers (something like boost::shared_ptr or a pointer-vector is preferred above straight pointers) (N/A to primitives, which are small to start, so generally not really a big issue either).
I'd also suggest forcing a reallocation to a specified size before you start inserting to the end of the array (Java's ArrayList::ensureCapacity or C++'s vector::reserve). In case you didn't know - varying-length-array implementations tend to have an internal array that's bigger than what size() returns or what's accessible (in order to prevent constant reallocation of memory as you insert or delete values).
Also note that there are more efficient methods to copy parts of an array than doing it manually with for loops (e.g. Java's System.arraycopy).
You might want to consider changing your representation of the list from using a dynamic array to using some other structure. Here are two options that allow you to implement these operations efficiently:
An order statistic tree is a modified type of binary tree that supports insertions and selections anywhere in O(log n) time, as well as lookups in O(log n) time. This will increase your memory usage quite a bit because of the overhead for the pointers and extra bookkeeping, but should dramatically speed up insertions. However, it will slow down lookups a bit.
If you always know the insertion point in advance, you could consider switching to a linked list instead of an array, and just keep a pointer to the linked list cell where insertions will occur. However, this slows down random access to O(n), which could possibly be an issue in your setup.
Alternatively, if you always know where insertions will happen, you could consider representing your array as two stacks - one stack holding the contents of the array to the left of the insert point and one holding the (reverse) of the elements to the right of the insertion point. This makes insertions fast, and if you have the right type of stack implementation could keep random access fast.
Hope this helps!
HashMaps and Linked Lists were designed for the problem you are having. Given a indexed data structure with numbered items, the difficulty of inserting items in the middle requires a renumbering of every item in the list.
You need a data structure which is optimized to make inserts a constant O(1) complexity. HashMaps were designed to make insert and delete operations lightning quick regardless of dataset size.
I can't pretend to do the HashMap subject justice by describing it. Here is a good intro: http://en.wikipedia.org/wiki/Hash_table

Hashing - What Does It Do?

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

Is there a method to generate a single key that remembers all the string that we have come across

I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

Resources