Find Next and Previous Timestamps Within Ruby Array of Timestamps for a Given Timestamp - ruby

So if I have an array of timestamps like so (many more than this in reality):
2013-07-27 18:02:59.865572
2013-07-27 18:29:00.132601
2013-07-27 19:00:00.081585
2013-07-27 19:29:00.273857
2013-07-27 20:00:00.011761
And I wanted to find which two timestamps 2013-07-27 19:13:00.081585 falls between, what would be the most elegant way with Ruby?
I can envision an ugly bunch of loops and if statements to do this, but being a novice Ruby programmer I suspect there is a much more elegant way to do this (that I absolutely cannot find!).
Thanks!

It depends on a few things.
Whether the timestamp you're looking for is known to be in the array.
What between means.
Whether elements in the array are unique.
Let's assume the array is sorted, or that you'll sort it yourself beforehand.
If your_timestamp is known to be in the array, you can find its index with timestamp_array.index(your_timestamp). Logically, the elements your_timestamp is between will have indexes immediately above and below. There are two things to watch for.
Falling off either end of the array.
Duplicate timestamps.
If your_timestamp is either the first or last element in the array, you won't have an element with an index immediately below the first or immediately above the last.
If your array contains duplicate timestamps, you're liable to return your_timestamp as one of the values. It seems like you don't want to do that, but there isn't strictly a right or wrong answer here. It's application-dependent.
If you don't know whether your_timestamp is in the array, or if you don't want your_timestamp as one of the values (unless it's the first or last element of the sorted array, that is), then this might be a better approach.
timestamp_array.sort.each_cons(2){ |ts|
# If your desired timestamp is in the timestamp array, you'll
# get at least two pairs of timestamps.
answer.concat ts if your_desired_timestamp.between?(ts[0], ts[1])
}
# If you have more than 2 elements, return only the first and last element.
if answer.length > 2
answer = answer.first, answer.last
end
p answer
["2013-07-27 18:29:00.132601", "2013-07-27 19:29:00.273857"]
This works correctly for duplicate timestamps, and there's no danger of falling off either end of the array.
Some optimizations are available. For example, you can switch to a binary search (bsearch method), which might be worthwhile if you have very large arrays; you can eliminate the conditional if answer.length > 2; etc.

So someone else left an answer and then redacted it for some reason, I think it was because there was an error, but it led me in the right direction, as did #squiguy.
timestamp_array.sort.each_cons(2).select{ |a,b|
puts a
if a < your_desired_timestamp and b > your_desired_timestamp)
puts 'this is the valid range for ' + your_desired_timestamp.to_s
end
puts b
}
Thanks Guys and Gals!

Related

How to search in polynomial hash table in ascending order in C?

I there guys,
i'm developing a small program in C, that reads strings from a .txt file with 2 letters and 3 numbers format. Like this
AB123
I developed a polynomial hash function, that calculates an hash key like this
hash key(k) = k1 + k2*A² + k3*A^3... +Kn*A^n
where k1 is the 1º letter of the word, k2 the 2º (...) and A is a prime number to improve the number of collisions, in my case its 11.
Ok, so i got the table generated, i can search in the table no problem, but only if i got the full word... That i could figure it out.
But what if i only want to use the first letter? Is it possible to search in the hash table, and get the elements started by for example 'A' without going through every element?
In order to have more functionality you have to introduce more data structures. It all depends on how deep you want to go, which depends on what exactly you need to code to do.
I suspect that you want some kind of filtering for the user. When user enters "A" it should be given all strings that have "A" at the start, and when afterwards it enters "B" the list should be filtered down to all strings starting with "AB".
If this is the case then you don't need over-complicated structures. Just iterate through the list and give the user the appropriate sublist. Humans are slow, and they won't notice the difference between 3 ms response and 300 ms response.
If your hash function is well designed, every place in the table is capable of storing a string beginning with any prefix, so this approach is doomed from the start.
It sounds like what you really want might be a trie.

Hash Table and Substring Matching

I have hundreds of keys for example like:
redapple
maninred
foraman
blueapple
i have data related to these keys, data is a string and has related key at the end.
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple
i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.
i know to use hash function and hash table, there is no problem here.
But;
i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.
For example:
i must give "red" and must be able to get
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
as output.
or
i must give "apple" and must be able to get
redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple
as output.
i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?
But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).
With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.
And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.
Thanks for any advise, idea.
Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be
red -> maninred, redapple
words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.
This alogriphm is not O(n), but it practice it has enough speed.
It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string1
A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.
Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.
(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.
I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

Hashing - What Does It Do?

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

Do I need to implement a b-tree search for this?

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

How to determine differences in two lists of data

This is an exercise for the CS guys to shine with the theory.
Imagine you have 2 containers with elements. Folders, URLs, Files, Strings, it really doesn't matter.
What is AN algorithm to calculate the added and the removed?
Notice: If there are many ways to solve this problem, please post one per answer so it can be analysed and voted up.
Edit: All the answers solve the matter with 4 containers. Is it possible to use only the initial 2?
Assuming you have two lists of unique items, and the ordering doesn't matter, you can think of them both as sets rather than lists
If you think of a venn diagram, with list A as one circle and list B as the other, then the intersection of these two is the constant pool.
Remove all the elements in this intersection from both A and B, and and anything left in A has been deleted, whilst anything left in B has been added.
So, iterate through A looking for each item in B. If you find it, remove it from both A and B
Then A is a list of things that were deleted, and B is a list of things that were added
I think...
[edit] Ok, with the new "only 2 container" restriction, the same still holds:
foreach( A ) {
if( eleA NOT IN B ) {
DELETED
}
}
foreach( B ) {
if( eleB NOT IN A ) {
ADDED
}
}
Then you aren't constructing a new list, or destroying your old ones...but it will take longer as with the previous example, you could just loop over the shorter list and remove the elements from the longer. Here you need to do both lists
An I'd argue my first solution didn't use 4 containers, it just destroyed two ;-)
I have not done this in a while but I believe the algorithm goes like this...
sort left-list and right-list
adds = {}
deletes = {}
get first right-item from right-list
get first left-item from left-list
while (either list has items)
if left-item < right-item or right-list is empty
add left-item to deletes
get new left-item from left-list
else if left-item > right-item or left-list is empty
add right-item to adds
get new right-item from right-list
else
get new right-item from right-list
get new left-item from left-list
In regards to right-list's relation to left-list, deletes contains items removed and adds now contains new items.
What Joe said. And, if the lists are too large to fit into memory, use an external file sorting utility or a Merge sort.
Missing information: How do you define added/removed? E.g. if the lists (A and B) show the same directory on Server A and Server B, that is in sync. If I now wait for 10 days, generate the lists again and compare them, how can I tell if something has been removed? I cannot. I can only tell there are files on Server A not found on Server B and/or the other way round. Whether that is because a file has been added to Server A (thus the file is not found on B) or a file has been deleted on Server B (thus the file is not found on B anymore) is something I cannot determine by just having a list of file names.
For the solution I suggest, I will just assume that you have one list named OLD and one list named NEW. Everything found on OLD but not on NEW has been removed. Everything found on NEW, but not on OLD has been added (e.g. the content of the same directory on the same server, however lists have been created at different dates).
Further I will assume there are no duplicates. That means every item on either list is unique in the sense of: If I compare this item to any other item on the list (no matter how this compare works), I can always say the item is either smaller or bigger than the one I'm comparing it to, but never equal. E.g. when dealing with strings, I can compare them lexicographically and the same string is never twice in the list.
In that case the simplest (not necessarily best solution, though) is:
Sort the OLD lists. E.g. if the list consists of strings, sort them alphabetically. Sorting is necessary, because it means I can use binary search to quickly find an object in the list, assuming it does exist there (or to quickly determine, it does not exist in the list at all). If the list is unsorted, finding the object has a complexity of O(n) (I need to look at every single item on the list). If the list is sorted, complexity is only O(log n), as after every try to match an item on the list I can always exclude 50% of the items on the list not being a match. Even if the list has 100 items, finding an item (or detecting that the item is not on the list) takes at most 7 tests (or is it 8? Anyway, far less than 100). The NEW list doesn't have to be sorted.
Now we perform list elimination. For every item on the NEW list, try to find this item on the OLD list (using binary search). If the item is found, remove this item from the OLD list and also remove it from the NEW list. This also means the lists get smaller the further the elimination progresses and thus the lookups will become faster and faster. Since removing an item from the a list has no effect on the correct sort order of the lists, there is no need to ever resort the OLD list during the elimination phase.
At the end of elimination, both lists might be empty, in which case they were equal. If they are not empty, all items still on the OLD list are items missing on the NEW list (otherwise we had removed them), hence these are the removed items. All items still on the NEW list are items that were not on the OLD list (again, we had removed them otherwise), hence these are the added items.
Are the objects in the list "unique"? In this case I would first build two maps (hashmaps) and then scan the lists and lookup every object in the maps.
map1
map2
removedElements
addedElements
list1.each |item|
{
map1.add(item)
}
list2.each |item|
{
map2.add(item)
}
list1.each |item|
{
removedElements.add(item) unless map2.contains?(item)
}
list2.each |item|
{
addedElements.add(item) unless map1.contains?(item)
}
Sorry for the horrible meta-language mixing Ruby and Java :-P
In the end removedElements will contain the elements belonging to list1, but not to list2, and addedElements will contain the elements belonging to list2.
The cost of the whole operation is O(4*N) since the lookup in the map/dictionary may be considered constant. On the other hand linear/binary searching each elements in the lists will make that O(N^2).
EDIT: on a second thought moving the last check into the second loop you may remove one of the loops... but that's ugly... :)
list1.each |item|
{
map1.add(item)
}
list2.each |item|
{
map2.add(item)
addedElements.add(item) unless map1.contains?(item)
}
list1.each |item|
{
removedElements.add(item) unless map2.contains?(item)
}

Resources