storing set as a value for hash in redis - data-structures

I have file consisting of thousands of lines(each containing 3 fields, first is a k length string, then a number, third another string) of form:-
k|1|r1
k|1|r2
k|2|r2
k1|1|r3
I load it using redis-py, by using:-
sadd('k:1', 'r1')
sadd('k:1', 'r2')
sadd('k:2', 'r2')
sadd('k1:1', 'r3')
to form a mapping like
{
"k:1" : ("r1", "r2"),
"k:2" : ("r2"),
"k1:1" : ("r3")
}
I intend to store the values of the form, by removing the repetitive information of k(which is a k length string common for the first 3 records):
{
"k": {
"1" : ("r1", "r2"),
"2" : ("r2")
}
"k1": {
"1" : ("r3")
}
}
I have the idea of storing the value of storing the set under a different key, which can act as value for k in the hash. Is there a better way than that?

Storing the sets under a different key will work, however if the sets are static then you don't need the set functionality - instead you can save yourself a lookup and store the set directly in the map, e.g. as a comma separated value (or whatever separator is appropriate for your data) if you've only got a few values, or as a gzipped comma separated value if you've got a lot of values (in my experience, if the string is large enough then the cost of g(un)zip is offset by the reduced network cost).

Related

Performance: Replacing Series values with keys from a Dictionary in Python

I have a data series that contains various names of the same organizations. I want harmonize these names into a given standard using a mapping dictionary. I am currently using a nested for loop to iterate through each series element and if it is within the dictionary's values, I update the series value with the dictionary key.
# For example, corporation_series is:
0 'Corp1'
1 'Corp-1'
2 'Corp 1'
3 'Corp2'
4 'Corp--2'
dtype: object
# Dictionary is:
mapping_dict = {
'Corporation_1': ['Corp1', 'Corp-1', 'Corp 1'],
'Corporation_2': ['Corp2', 'Corp--2'],
}
# I use this logic to replace the values in the series
for index, value in corporation_series.items():
for key, list in mapping_dict.items():
if value in list:
corporation_series = corporation_series.replace(value, key)
So, if the series has a value of 'Corp1', and it exists in the dictionary's values, the logic replaces it with the corresponding key of corporations. However, it is an extremely expensive method. Could someone recommend me a better way of doing this operation? Much appreciated.
I found a solution by using python's .map function. In order to use .map, I had to invert my dictionary:
# Inverted Dict:
mapping_dict = {
'Corp1': ['Corporation_1'],
'Corp-1': ['Corporation_1'],
'Corp 1': ['Corporation_1'],
'Corp2': ['Corporation_2'],
'Corp--2':['Corporation_2'],
}
# use .map
corporation_series.map(newdict)
Instead of 5 minutes of processing, took around 5s. While this is works, I sure there are better solutions out there. Any suggestions would be most welcome.

Simple way of getting key depending on value from hashmap in Golang

Given a hashmap in Golang which has a key and a value, what is the simplest way of retrieving the key given the value?
For example Ruby equivalent would be
key = hashMap.key(value)
There is no built-in function to do this; you will have to make your own. Below is an example function that will work for map[string]int, which you can adapt for other map types:
func mapkey(m map[string]int, value int) (key string, ok bool) {
for k, v := range m {
if v == value {
key = k
ok = true
return
}
}
return
}
Usage:
key, ok := mapkey(hashMap, value)
if !ok {
panic("value does not exist in map")
}
The important question is: How many times will you have to look up the value?
If you only need to do it once, then you can iterate over the key, value pairs and keep the key (or keys) that match the value.
If you have to do the look up often, then I would suggest you make another map that has key, values reversed (assuming all keys map to unique values), and use that for look up.
I am in the midst of working on a server based on bitcoin and there is a list of constants and byte codes for the payment scripts. In the C++ version it has both identifiers with the codes and then another function that returns the string version. So it's really not much extra work to just take the original, with opcodes as string keys and the byte as value, and then reverse the order. The only thing that niggles me is duplicate keys on values. But since those are just true and false, overlapping zero and one, all of the first index of the string slice are the numbers and opcodes, and the truth values are the second index.
To iterate the list every time to identify the script command to execute would cost on average 50% of the map elements being tested. It's much simpler to just have a reverse lookup table. Executing the scripts has to be done maybe up to as much as 10,000 times on a full block so it makes no sense to save memory and pay instead in processing.

Data structure to check if number of elements is equal in O(1) time?

What is the best data structure to check if the number of elements of different types of objects is the same?
For example, if I have
2 a's
3 b's
3 c's
The number of elements of the different types of objects is not the same.
If I have
2 a's
2 b's
2 c's
then this is the same.
What is the best data structure that allows you do this in O(1) time and how would you implement it?
One method is to use two dictionaries to be able to do it in O(1) dynamically.
The first maps each type to a count, {a:2,b:3,c:3}. The second maps each count to a set of types with that count. {2:{a},3:{b,c}}. If the size of the second dictionary is less than 2 (0 or 1) then clearly all types have the same count as if that was not the case then there would be at least two key-item pairs in that dictionary, presuming that the dictionary is updated when the counts change.
Adding a type just means adding it to each dictionary.
Removing a type just means removing it from each dictionary.
Updating a type requires first updating the second dictionary by removing the previous count (obtained from the first dictionary) and adding the current count, after which the first dictionary is updated.
Dictionary<Type, int> typeCounts = new Dictionary<Type, int>();
// read and store type counts
Type type = typeof(A);
if (typeCounts.Contains(type))
{
typeCounts[type]++;
}
else
{
typeCounts.Add(type, 1);
}
// populate type counts by type and finally:
if (typeCounts[typeA] == typeCounts[typeB])
{
// so on
}

Reduce properties which I'm not sure about

I'm a beginner in writing map-reduces and I'm not sure about some reduce function properties.
So, reduce gets (key, list of values) as an input parameter...
is it guaranteed that the list of input values always contains at least 2 members? So, an unique key emitted by the mapper would never be passed to the reducer?
or, if there is just one item in the input list, is it guaranteed that the key is unique?
can reduce emit more values then the input values list size?
I have a large list of strings. I need to find all of them which are not unique. Can I make it with just one map/reduce? The only way I see is to count all the unique strings by one map/reduce and then select those which are not unique by the another map/reduce
Thanks
The list of input values to the reduce() method may have one or more, but not zero members.
All of the values mapped from/to a unique key value are passed as a list to the reduce along with the key value. If that list contains one member then you can assume that that key value was mapped to only one value (or once, if you're counting)
Your reducer can write any number, including zero, of key value pairs for a given input key and list of values. The types of the input key/values may be different from the types of the output key/value pairs.
You can solve your problem with one map/reduce step
So, the problem with the strings, pseudocode:
map(string s) {
emit(s, 0);
}
reduce(string key, list values) {
if (valies.size() > 1) { emit(key, 1); return; }
if (valuse.contains(1)) { emit(key, 1); return; }
}
right?

Word-separating algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?
Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.
Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):
List<String> breakUp(String str, Trie root) {
Set<(List<String>, Trie)> set = {([], root)};
for (char c : str) {
Set<(List<String>, Trie)> newSet = {};
for (List<String> ls, Trie t : set) {
Trie tNext = t.follow(c);
if (tNext != null) {
newSet.add((ls, tNext));
if (tNext.isWord()) {
newSet.add((t.follow(c).getWord() :: ls, root));
}
}
}
set = newSet;
}
for (List<String> ls, Trie t : set) {
if (t == root) return ls;
}
return null;
}
Let me know if I need to clarify or I missed something...
I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.
I'd imaging these sites do it similar to this:
Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)
Of course that leads to nonsense for expertsexchange, but what else would you expect there...
(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.
This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.
So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):
for(i=0;i<length(s)-5;i++) {
// take 4-character substring starting at position i
subs2 = substring(str, i, 4);
if(has_space(subs2)) {
subs = substring(str, i, 5);
delete_space(subs);
yes_space[subs][position(space, subs2)]++;
} else {
subs = subs2;
no_space[subs]++;
}
}
This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.
Then, take your string to split, I denote it as xstr, and do:
for(i=0;i<length(xstr)-5;i++) {
subs = substring(xstr, i, 4);
for(j=0;j<4;j++) {
do_insert_space_here[i+j] -= no_space[subs];
}
for(j=0;j<4;j++) {
do_insert_space_here[i+j] += yes_space[subs][j];
}
}
Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.
Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Resources