Trust based dictionary - algorithm

Lets say we have a system with a number of "users". Each of those users can have their own "key/value" dictionary.
Let say we store this data in a dictionary itself, data, with key (user, key) and value value.
Also, users can trust other users. Lets say we store this in trust, with key user and value which is an ordered array of users that user trusts.
Lets say "Alice" trusts "Bob" the most, but then "Chris".
So we store in trust: "Alice": ["Bob", "Chris"]
I then want a function get, which works like the following:
get(user, key)
if (data.haskey((user, key)) then
return data.get((user,key))
else
trusted_users = trust.get(user)
for (trusted_user in trusted_users)
result = get(trusted_user, key)
if (result is not null)
return result
return null
end if
In plain English, if I've got the key in my store, return it's value. If not, check my most trusted friend (and anyone who they trust, and so on). If not, then check my next most trusted friend and so on. If none of my friends are friends (or friends of friends etc) of someone who has this key, then return that it's not found.
I've given a trivial implementation to explain my problem, however it won't scale well, as due to the recursion it's quite possible it execution time increases linearly with the number of users.
I'm looking for a better data design and algorithm which scales better.

The code presented is worse than linear, because it is possible to visit friends A-B-C and then A-C-B, if A, B, and C, are all friends. In the recursive friend search you should keep a set of friends visited so far, and not recurse through a friend you have already visited. You would have to pass a pointer to this set through as part of the recursive call to get it updated properly.
If values in the map do not change and keys may be accessed again, you could keep a copy of key-value pairs from friends to save bothering them again.

Related

Is temporarily overriding equals method possible?

Is there a way to temporarily override hashcode and equals in java ?
I have a User class with id, name and age attributes. The current equals compares the id, rightly so. So if two ids, are same then the Users are equal.
Now, I have two Lists of User and I want to find users with common names.
Other than looping every element of list1 over every element of list2, I have no other option (java 7). Is there a way, I can temporarily change the behaviour of equals ? In this case, where I am coding the logic for searching users with common names, I want to be able to say if name is same then the users are equal (even though its technically incorrect), without touching the actual User class's equal.

Using redis to store a structured event log

I'm a bit new to Redis, so please forgive if this is basic.
I'm working on an app that sends automatic replies to users for certain events. I would like to use Redis to store who has received what event.
Essentially, in ruby, the data structure could look like this where you have a map of users to events and the dates that each event was sent.
{
"mary#example.com" => {
"sent_comment_reply" => ["12/12/2014", "3/6/2015"],
"added_post_reply" => ["1/4/2006", "7/1/2016"]
}
}
What is the best way to represent this in a Redis data structure so you can ask, did Mary get a sent_comment_reply? and if so, when was the latest?
In short, the question is, how(if possible) can you have a Hash structure that holds an array in Redis.
The rationale as opposed to using a set or list with a compound key is that hashes have O(1) lookup time, whereas lookups on lists(lrange) and sets(smembers) will be O(s+n) and sets O(n), respectively.
One way of structuring it in Redis, depending on the idea that you know the events of the user and you want the latest to be fresh in memory :
A sorted set per user. the content of the sorted set will be event codes; sent_comment_reply, added_post_reply with the score of the latest event as the highest. you can use ZRANK to get the answer for the question :
Did Mary get a sent_comment_reply?
A hash also for the user, this time you will have the field as the event sent_comment_reply and the value is the content of it which should be updated with the latest value including the body, date, etc. this will answer the question:
and if so, when was the latest?
Note: Sorted sets are really fast , and in this example we are depending on the events as the data.
With sorted sets you can add, remove, or update elements in a very
fast way (in a time proportional to the logarithm of the number of
elements). Since elements are taken in order and not ordered
afterwards, you can also get ranges by score or by rank (position) in
a very fast way. Accessing the middle of a sorted set is also very
fast, so you can use Sorted Sets as a smart list of non repeating
elements where you can quickly access everything you need: elements in
order, fast existence test, fast access to elements in the middle!
A possible approach to use a hash to map an array is as follows:
add_element(key , value):
len := redis.hlen(key)
redis.hset(key , len , value)
this will map array[i] element to i field in a hash key.
this will work for some cases, but I would probably go with the answer suggested in https://stackoverflow.com/a/34886801/2868839

Best way to architect unique identifier generation

I have an API in rails that will handle a lot of requests. Lets say millions a day. I want to be able to assign a unique id to each of these requests. The way i am doing it is starting with a uid of 3 letters/numbers and continue up till 9 letters/numbers when the previous bracket is all taken.
One one i am doing it is generate the uid in real time when the request comes, so the app will try to find the first available uid and assign it. But after a while i have the impression that this will impact the performance of the app.
The second way I am thinking is to auto-generate all the possible uids in advance and have a flag assigned to them [free/taken] so that when a request comes in i assign the first free uid to it, which should be very fast if that filed is indexed.
Any suggestions are much appreciated. Thank you
I would just generate a random string and assign it while it doesn't exist. Before I get into that let me just mention that SecureRandom.uuid is the best way to go. It generates random uuids whose chance of collision are mathematically improbable.
Anyway, here is a way to use your own custom random string generator that will only assign if it doesn't already exist:
def generate_random_uid
begin
uid = my_custom_random_string_method
object.uid = uid
end while ObjectModel.exists? uid: uid
object.save
end
The do while block will execute once and set a variable called uid to be a random string you generate in the method my_custom_random_string_method, then in the while part it checks if a record exists whose uid attribute matches the random uid you just generated, if it does it runs the do block again, rinse and repeat until the expression in the while part returns false meaning the uid doesn't exists, then your object is saved and the uid written to the db. This guarantees that you will only ever save the object with a uid that doesn't exist in your db.

Method to determine if keys of a dictionary are in sequence or a range

I'm trying to determine when to remove entries in the sorteddictionary, when a sequence is found, i.e. where the key is a sequence of 1,2,3,4,5,6,7,8,9,10... etc.
I have:
SortedDictionary<int, string>
Its hard to explain. I'm adding pairs where the key can be any integer value, generally on a random'ish basis. So, the program may add
<2,"jim"> <15,"Jack"> <62,"jill"> and so on.
So when it executes, the dictionary is going to filled with a sorted list which is not necessary in sequence, but I want to check, if say key values 1..10 are present, in a proper sequence, i.e 1,2,3,4,5,6,7,8,9,10.
The background is i've got stuff coming in from a messaging pipe, which is not in order. So it goes into this dictionary, and then on another thread I check the dictionary, and if return's success for the range I provide, then removes it from the dictionary and enques it, in order onto a concurrentqueue. Fundamentally an inorder to ordered exchange.
Any help is appreciated.
Bob.
If you get the highest and lowest keys, then the count would tell you if you've got a sequence.

Creating an id from name and address data. Hash/Digest

My problem:
I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?
In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?
>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734
Application Context (for those that are interested):
This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.
I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.
Use a cryptographic hash for its collision resistance, not its other qualities
Use as many bytes from the hash as you want (truncate)
convert to alpha-numeric characters
You can also truncate the alpha-numeric string instead of the hash
An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.
N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
hash = hashlib.md5(name + "|" + address).digest().encode("base64")
alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
return alnum_hash[:N_HASH_CHARS]
How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).
A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.
I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:
Create a SHA1 hash from the data, and store it in a table with a serial number field.
Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.
I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?
So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).
If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.
EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.
Is that feasible in your situation?
Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).
but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...
Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...
You could use AAAAA01 for first person at first address,
AAAAA02 for second person at first address,
AAAAB07 for the seventh resident at the second adresss, etc.
If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...

Resources