I have some JSON data that I want to match to a particular array of IDs. So for example, the JSON temperature: 80, weather: tornado can map to an array of IDs [15, 1, 82]. This array of IDs is completely arbitrary and something I will define myself for that particular input, it's simply meant to give recommendations based on conditions.
So while a temperature >= 80 in tornado conditions always maps to [15, 1, 82], the same temperature in cloudy conditions might be [1, 16, 28], and so on.
The issue is that there are a LOT of potential "branches". My program has 7 times of day, each of those time of day nodes has 7 potential temperature ranges, and each of those temperature range nodes have 15 possible weather events. So manually writing if statements for 735 combinations (if I did the math correctly) would be very unruly.
I have drawn a "decision tree" representing one path for demonstration purposes, above.
What are some recommended ways to represent this in code besides massively nested conditionals/case statements?
Thanks.
No need for massive branching. It's easy enough to create a lookup table with the 735 possible entries. You said that you'll add the values yourself.
Create enums for each of your times of day, temperature ranges, and weather events. So your times of day are mapped from 0 to 6, your temperature ranges are mapped from 0 to 6, and your weather events are mapped from 0 to 14. You basically have a 3-dimensional array. And each entry in the array is a list of ID lists.
In C# it would look something like this:
List<List<int>>[][][] = LookupTable[7][7][15];
To populate the lookup table, write a program that generates JSON that you can include in your program. In pseudocode:
for (i = 0 to 6) { // loop for time of day
for (i = 0 to 6) { // loop for temperature ranges
for (i = 0 to 14) { // loop for weather events
// here, output JSON for the record
// You'll probably want a comment with each record
// to say which combination it's for.
// The JSON here is basically just the list of
// ID lists that you want to assign.
}
}
}
Perhaps you want to use that program to generate the JSON skeleton (i.e. one record for each [time-of-day, temperature, weather-event] combination), and then manually add the list of ID lists.
It's a little bit of preparation, but in the end your lookup is dead simple: convert the time-of-day, temperature, and weather event to their corresponding integer values, and look it up in the array. Just a few lines of code.
You could do something similar with a map or dictionary. You'd generate the JSON as above, but rather than load it into a three-dimensional array, load it into your dictionary with the key being the concatenation of the three dimensions. For example, a key would be:
"early morning,lukewarm,squall"
There are probably other lookup table solutions, as well. Those are the first two that I came up with. The point is that you have a whole lot of static data that's very amenable to indexed lookup. Take advantage of it.
Related
I am in love with probabilistic data structures. For my current problem, it seems that the count-min-sketch structure is almost the right candidate. I want to use count-min-sketch to store events per ID.
Let's assume I do have the following
Map<String, Int> {
[ID1, 10],
[ID2, 12],
[ID2, 15]
}
If I use a count-min-sketch, I can query the data structure by IDs and retrieve the ~counts.
Question
Actually I am interested in the average occurrence over all IDs, which in the example above would be: 12,33. If I am using the count-min then it seems that I need to store the Set of IDs and then iterate over the set and query the count-min for each ID and calculate the average. Is there an improved way without storing all IDs? Ideally I just want to retrieve the average right away without remembering all IDs.
Hope that make sense!?
You should be able to calculate the average count if you know the number of entries, and the number of distinct entries:
averageCount = totalNumberOfEntries / numberOfDistinctEntries
Right? And to calculate the number of distinct entries, you can use e.g. HyperLogLog. You already added the hyperloglog tag to your question, so maybe you already know this?
A hash table is a data structure that can map keys to values. Given a key, hash function will calculate then tell us the index of the slots/buckets which storing the value. If multiple keys map to a same slot, it might start a linked list from this slot. If there's no enough slots for values, it will do a resizing operation to find a bigger space.
Is the first level of a hash table's buckets always an array?
Where are the keys stored? Or is it the case that it doesn't have to store the keys every time hash function takes a key and calculates the position?
In Ruby language, does a hash object such as {:name => "Wix", :age => 18} count as a hash table? If it does, I need the answer of question 2.
The ruby name Hash is somewhat misleading. To most developers, they are actually maps, meaning you give them a value and they give you another associated value. The fact that they are hashmaps is really just an implementation detail that makes them fast, and it is in fact the same principle of hashsets, which, given a value, just tell you if the value is in the set or not.
To simplify it a bit, imagine this:
Storing
You have an array of 10 elements. You are told to remember that 35 = "some data". You then hash the index (35), which I will simplify as just modulo-dividing it by the array length, so the result is 35 % 10 = 5.
We then store store the data 35 = "some data" at that index, for example as a tuple [35, "some data"].
We then get some more data, 25 = "more data" and 78 = "cool stuff". So again, we hash the keys and get 5 and 8. Storing the second one is easy, we just have to store [78, "cool stuff"] at position 8 in the array.
But storing [25, "more data"] is a problem, because there's already a bucket at position 5. As you already pointed out, that is solved by storing a linked list. So we go back to the beginning and instead store [35, "some data", nil] for our first value.
To insert 25 we then just change it so that the first element points to the second, and get array[5] = [35, "some data", <pointer>] -> [25, "more data", nil]
Accessing
After a while the user wants to know what the value associated with "25" is.
Since we implement a hashmap, we can just hash the value, 25 % 10 = 5 and know our pair is stored at position 5. We then only have to iterate a linked list with 2 elements looking for the value [25], and when we find it just take the second value and return it to the user.
In Practice
The above is, of course, an oversimplified example, but it shows the basic idea of how hash-maps operate.
In the real world, the hashing algorithm would, of course, be more complicated than just modulo-dividing, but the idea is the same. The hash of a key is always turned into an index in the array. A good hashing algorithm should be 1. fast and 2. random, to avoid having lots of empty buckets and a few buckets with lots of elements.
Also, our array wouldn't have a fixed length of 10, but be smart about it and try to both save memory by not being excessively big, but at the same time be generous enough with the memory to avoid unnecessary shrinking/growing all the time and keep the buckets reasonably short.
In the best case, you can have a map of a few thousand elements, and to access one you just hash it, which takes the same time independently of the size of the hash, instead of having to iterate all those thousands of elements and comparing each one to the one you're looking for.
Regarding your third question, the answer is yes.
As for the second, keys are stored in the buckets, but probably just as their hashed values.
I'm not sure how ruby internally stores the buckets, but generally they could be implemented in many ways, as arrays, structs, etc.
I am new to world of data science and am trying to understand the concepts on the outcomes of the the ML. I have started off to use scikit - clustering example. Using the scikit library is well documented everywhere. But all the examples go with the assumption of ready numerical data.
Now how does a data scientist convert a business data into machine learning data. Just to give an example, here is a customer and sales data I have prepared..
The first picture shows the customer data with some parameters having an integer, string and boolean values
The second picture shows the historical sales data for those customers.
Now how does such a real business data gets translated to feed to a Machine Learning algorithm? How do I convert each data to a common factor which the algorithm can understand?
Thanks
K
Technicaly, there are many ways, such as one-hot encoding, standardization, and going into logspace for skewed attributes.
But the problem is not just of a technical nature.
Finding a way is not enough, but you need to find one that works really well for your problem. This is usually very different from problem to another. There is no "turn key solution".
Just addition to comment by #Anony-Mousse, you can convert Won/Lost column to value 1, 0 (e.g. 1 for Won, 0 for Lost). For Y column, suppose you have 3 unique values in the column, you can convert A to says [1, 0, 0] and B to [0, 1, 0] and C to [0, 0, 1] (called one-hot encoding). Same on Z column, you can convert TRUE column to 1 and FALSE to 0 (or True or False respectively).
To merge 2 table or excel file together, you can use additional library called pandas which allows you to merge two dataframe together e.g. df1.merge(df2, on='CustID', how='left'). Now, you can put your feature set to scikit learn properly.
I there guys,
i'm developing a small program in C, that reads strings from a .txt file with 2 letters and 3 numbers format. Like this
AB123
I developed a polynomial hash function, that calculates an hash key like this
hash key(k) = k1 + k2*A² + k3*A^3... +Kn*A^n
where k1 is the 1º letter of the word, k2 the 2º (...) and A is a prime number to improve the number of collisions, in my case its 11.
Ok, so i got the table generated, i can search in the table no problem, but only if i got the full word... That i could figure it out.
But what if i only want to use the first letter? Is it possible to search in the hash table, and get the elements started by for example 'A' without going through every element?
In order to have more functionality you have to introduce more data structures. It all depends on how deep you want to go, which depends on what exactly you need to code to do.
I suspect that you want some kind of filtering for the user. When user enters "A" it should be given all strings that have "A" at the start, and when afterwards it enters "B" the list should be filtered down to all strings starting with "AB".
If this is the case then you don't need over-complicated structures. Just iterate through the list and give the user the appropriate sublist. Humans are slow, and they won't notice the difference between 3 ms response and 300 ms response.
If your hash function is well designed, every place in the table is capable of storing a string beginning with any prefix, so this approach is doomed from the start.
It sounds like what you really want might be a trie.
So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.