Hash for unordered set? - algorithm

I am trying to solve a one-way indentity problem, a group of authors want to publish something without reveal their own real username, so are there algorithm/library for hashing an unordered set of usernames?
Some people would suggest, sort the set alphabetically first, then join, finally hash, but that's not ideal solution for dynamic growing array.
Additionaly questions (not compulsory for the main question):
If such algorithm exists, can we verify if a username is one of the authors by hash?
If we already know the hash of a group of usernames, then there is a new author added, can we get a new hash without knowing previous author usernames?

Are you willing to accept a small probability of false positives, that is of names that aren't authors which will be incorrectly identified as authors if anyone checks? (The probability can be made arbitrarily small.)
If you are, then a bloom filter would fit the bill perfectly.

You can always generate a hash, regardless of whether or not you know the other authors' user names. You can't guarantee that it's a unique hash, though.
If you know all the user names in advance, you can generate a minimal perfect hash, but any time you add a user name you'll have to generate a completely new hash table--with different hashes. That's obviously not a good solution.
It depends on what you want your final keys to look like.
One possibility is to assign unique sequential IDs to the user names and then obfuscate those ids so that they don't look like sequential IDs. This is similar to what YouTube does with their IDs--they turn a 64-bit number into an 11-character base64 string. I wrote a little article about that, with code in C#. Check out http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=839.
And, yes, the process is reversible.

It sounds like a single hash won't do you any good. 1. You can't verify that a single username is in the hash; you would need to know all the usernames. 2. You can't add a new user to the hash without knowing something about the unhashed usernames (the order in which you add users to the hash will matter, for all good hash algorithms).
For #2, a partial solution is that you would not keep all the usernames, just keep something like an XOR of all the existing users. When you want to add a new user, XOR it with the existing one and re-hash the result. Then it won't matter which order you added the users in.
But the real solution, I think, is just to have a set of hashes, rather than a hash of a set. Is there a reason you can't do this? Then you can easily keep the set ordered or unordered as you wish, you can easily add users to the set, and easily check to see if a given author is already in the set.

Related

How can an untrusted client prove that an ID isn't human meaningful?

I have a situation where an untrusted client is generating string IDs, but I don't want them to be human readable.
They don't need to be crytographically random or even unique, I just don't want IDs like "idiot" to be accepted.
How can I go about preventing this?
EDIT: It would be nice if the IDs were "aspirationally-unique", like GUIDs.
Here are some ideas:
Convert the supplied ID to a hash or encrypt it. This will result in meaningless strings
Create a dictionary of words you don't want used, and when the supplied ID contains one of those words, reject it... a PHP example can be found at https://scvinodkumar.wordpress.com/2009/06/17/bad-word-filter-and-replace/
Require that the IDs not contain sequences where two (or however many) alpha characters are next to each other
If you have any additional info/preference/requirements, let me know.

Is there a way knowing what hash-algorithm is used?

Is there a way knowing what hash-algorithm is used?
My question is grounded of that I've got an database from a customer with some users and passwords. I have no idea what the passwords are (so it's correctly stored in the database) and the customer would not like to give these passwords away (it's understandable)
I have access to the database and I know that the passwordhash is 60 characters long, but nothing else.
I basically want to create a new user (directly in the database if possible) with a temporary password so I can login to the system - but it's kind of impossible if I don't know how to create the password. Any thoughts?
The system is created in CodeIgniter but I don't know what authentification-method is used.
What data do the passwords contain? Do they contain only 0-9 and a-f, i.e. hex
values, or can they contain other data too? If you want to know the algorithm, it is crucial to answer to this question.
If they contain hex values only, 60*4 = 240 and there is no common algorithm
which gives a hash that is 240 bits long.
It has been suggested that the password contains salt, which might explain the
unusual length.
Why not ask the customer what has algorithm is used? It is understandable that
the customer doesn't want to give away these passwords, but there should be no
objection to giving away the hash algorithm.

Is it possible to get items from DynamoDB where the primary key ends with a given string?

Is it possible, using the AWS Ruby SDK (or just DynamoDB in general), to get an item or items from a table that uses a primary key only, and where that primary key ends with a certain string?
I haven't come across anything in the docs that explicitly answers this question, either in the ruby ddb docs or the general docs for ddb. I'm not saying the question is not answered, but if it is, I can't find it.
If it is possible, could someone provide an example for ruby or link to the docs where an example exists?
Although #Ryan is correct and this can be done with query, just bear in mind that you're doing a "full-table-scan" here. That might be OK for a one-time job but probably not the best practice for a routine task (and of course not as a part of your API calls).
If your use-case involves quickly finding objects based on their suffix in a specific field, consider extracting this suffix (assuming it's a fixed-size suffix) as another field and have a secondary index on that one. If you want to query arbitrary length suffixes, I would create a lookup table and update it with possible suffixes (or some of them, to save some calls, and then filter when querying).
It looks like you would want to use the Query method on the SDK to find the items your looking for. It seems that "EndsWith" is not available as a comparison operator in the SDK though. So you would need to use CONTAINS and then check your results locally.
This should lead to the best performance, letting DynamoDb do the initial heavy lifting and then further pruning the results once you receive them.
http://docs.aws.amazon.com/sdkforruby/api/Aws/DynamoDB/Client.html#query-instance_method

dynamically classify categories

I am new at the idea of programming algorithms. I can work with simplistic ideas, but my current project requires that I create something a bit more complicated.
I'm trying to create a categorization system based on keywords and subsets of 'general' categories that filter down into more detailed categories that requires as little work as possible from the user.
I.E.
Sports >> Baseball >> Pitching >> Nolan Ryan
So, if a user decides they want to talk about "Baseball" and they filter the search, I would like to also include 'Sports"
User enters: "baseball"
User is then taken to Sports >> Baseball
Now I understand that this would be impossible without a living - breathing dynamic program that connects those two categories in some way. It would also require 'some' user input initially, and many more inputs throughout the lifetime of the software in order to maintain it and keep it up to date.
But Alas, asking for such an algorithm would be frivolous without detailing very concrete specifics about what I'm trying to do. And i'm not trying to ask for a hand out.
Instead, I am curious if people are aware of similar systems that have already been implemented and if there is documentation out there describing how it has been done. Or even some real life examples of your own projects.
In short, I have a 'plan' but it requires more user input than I really want. I feel getting more info on the subject would be the best course of action before jumping head first into developing this program.
Thanks
IMHO It isn't as hard as you think. What you want is called Tagging and you can do it Automatically just by setting the correlation between tags (i.e. a Tag can have its meaningful information plus its reation with other ones. Then, if user select a Tag well, you related that with others via looking your ADT collection (can be as simple as an array).
Tag:
Sport
Related Tags
Football
Soccer
...
I'm hoping this helps!
It sounds like what you want to do is create a tree/menu structure, and then be able to rapidly retrieve the "breadcrumb" for any given key in the tree.
Here's what I would think:
Create the tree with all the branches. It's okay if you want branches to share keys - as long as you can give the user a "choice" of "Multiple found, please choose which one... ?"
For every key in the tree, generate the breadcrumb. This is time-consuming, and if the tree is very large and updating regularly then it may be something better done offline, in the cloud, or via hadoop, etc.
Store the key and the breadcrumb in a key/value store such as redis, or in memory/cached as desired. You'll want every value to have an array if you want to share keys across categories/branches.
When the user selects a key - the key is looked up in the store, and if the resulting value contains only one match, then you simply construct the breadcrumb to take the user where you want them to go. If it has multiple, you give them a choice.
I would even say, if you need something more organic, say a user can create "new topic" dynamically from anywhere else, then you might want to not use a tree at all after the initial import - instead just update your key/value store in real-time.

associate multiple strings to only one

I'm trying to make an algorithm that easily simplifies and groups synonyms (with mismatches, capitals, acronims, etc) into only one. I supose there should exist a standard way to build such a structure that, looking for a string with possible mismatches, if the string exists in the structure, it returns a normalized string key. In short, sometimes the same concept could be written in several ways, but I only want to keep the concept.
For instance: Supose I want to normalize or simplify the appearances of
"General Director", "General Manager", "G, Dtor", "Gen Dir", ...
into
"GEN_DIR"
and keep only this result for further reference.
By the way, I suppose that building a Hash with key/value pairs like
hash["General Director"]="GEN_DIR"
hash["General Manager"]="GEN_DIR"
hash["G, Dtor"]="GEN_DIR"
hash["G, Dir"]="GEN_DIR"
could be a solution, but I suspect that there are more elegant or adequate solutions to that.
I would also need the way to persist this associative structure easily without any database because it should grow as I find more mismatches of the same word or sentence. A possible approach I think is to define this structure by means of a DSL, but I'm open to suggestions.
Well, there is no rule, at least a clear one.
My aim is to scrap from web some "structured" data that sometimes is incorrectly or incompletely typed. Some fields are descriptions and can be left as is. But some fields are suposedly to be "sets" but aren's correctly typed (as in my example). As a human can read that, he immediatelly knows what it means and can associate that with its meaning.
But I would like to automate as much as possible the process of reducing those possible mismatches to only one "string" (or symbol) before, for instance, saving it into a database. So, what I would need is a kindof hash or dictionary, as sawa correctly stated, that I can use to lookup any of such dirty strings to get the normalized string or symbol.
Also, of course, it would be desirable a way to make this hash (or whatelse it could be) to learn from new mismatches in some way and add a new association automatically (possibly it could be based on a distance measure between mismatched string and normalized string that, if lower than X, a new association is built). The whole association (i.e, hash) should grow as new mismatches and concepts arise and, though, it should be kept anywhere (possibly in an xml file, or something like what Mori answered below) for future uses.
Any new Idea?

Resources