Parallelized record combining - matching on multiple keys

Parallelized record combining - matching on multiple keys - parallel-processing

I have been looking at using MapReduce to build a parallelized record combining system. The language doesn't matter, I can use a pre-existing library such as Hadoop or build my own if necessary, I'm not worried about that.
The problem that I keep running into, however, is that I need the records to be matched on multiple criteria. For example: I may need to match the records based on person's name or the person's phone number, but not necessarily the person's name and phone number.
For instance, given the following keys for each record:
'John Smith' and '555-555-5555'
'Jane Smith' and '555-555-5555'
'John Smith' and '555-555-1111'
I want the system to take all three records, figure out that they match on one of the keys, and combine them into a single combined record that has both names ('John Smith' and 'Jane Smith') as well as both phone numbers ('555-555-5555' and '555-555-1111').
Is this something that I can accomplish using MapReduce? If so, how would I go about matching the keys produced by the Map function so that all of the matched records can be passed into the Reduce function.* Alternatively, is there a different/better way I could be doing this? My only real requirement is that I need it parallelized.
[*] Please note: I am assuming that the Reduce function could be used in such a way that each call to the Reduce function produces a single combined record, rather than the Reduce function producing a single result for the entire job.

You can definitely do this in the map/reduce paradigm.
Let's say that you're matching on anything containing "smith" or phone numbers starting with "555". You would canonicalize your search string into "smith|^555", for example. In the Map phase, you would do:
John Smith / 555-555-5555 → K: smith|^555, V = (John Smith,555-555-5555)
Jane Doe / 555-555-5555 → K: smith|^555, V = (Jane Doe,555-555-5555)
John Smith / 555-555-1111 → K: smith|^555, V = (John Smith,555-555-1111)
Since you've given them all the same key ("smith|^555") they will all be handed off to the same reducer instance, which would now get, as input:
K: smith|^555, V: [(John Smith,555-555-5555),(Jane Smith,555-555-5555),(John Smith,555-555-1111))
Now, in your reducer step, you can instantiate a hashset for names and another one for numbers, and then when done processing the array of values, output all the keys from the names hashset and all the keys from the numbers hashset.

I don't think Map is useful here, because you can't really create a meaningful key for each record that will help identify the groupings of records.
It is not possible to implement this using Reduce either. Consider the example you yourself gave... If you query for 'Jane Smith', you cannot detect at the time that the first record is related to the query and so will ignore it. In fact you could end up chaining names and numbers together until you've got every record in the file. The only way to pick up all the matches is iteratively scan over the list until you stop finding new links.
This is very easy to parallelize though, you can just share out the records amongst some number of threads, and each can search its own records for new links. I'd suggest treating these sets as rings of data, so that you can record the point you were searching with the most up to date information, and you know you're finished once all threads have done a complete loop.

Related

Algorithm and data structure to store First name and last name

Is there a efficient way to store first name and last name in data structure so that we can lookup using either first or last name? I would consider a binary search tree with first name. It would be efficient to search first name. But wouldnt be efficient when trying to search last name. we can also consider one more BST with last name. Any ideas to implement it efficiently?
What if the question is
String names[] = { "A B","C D"};
A requirement is to be able to extend this directory dynamically at runtime,
without persistent storage. The directory can eventually grow to hundreds or
thousands of names and must be searchable by first or last name.
Now we can't have hash tables to store. Any ideas?

Two hash tables: one from first name to person, and one from last name to person.
Simple is best.

Why not put both first and last names in a trie?
As a bonus, this way you can even get suggestions on partial names by traversing all leaves after current node (maybe on an asynchronous call)

You're idea is pretty good, but here's another option: how about implementing to hash tables?
The first hash table would use first names as a key, and the associated value would either be the last name or a pointer to a Name object. The second hash table would use last names as keys, with the first names or pointers to Name as the values.
Personally, for choosing the values, I would go for a pointer to a Name object, since this method would be more applicable in case you'd like to store even more information (e.g. data of birth, etc.)

Also, see Does Java have a HashMap with reverse lookup?…, which is specific to Java but the discussion on the data structures is relevant to any language.
Note that structures such as Bidirectional Sorted Maps also allow range searches (which dual hash tables don't).

if you need to search only by first name or only by last name then yes, two hashmaps are the best (and notice you're not duplicating the data, you're partitioning it) but if you don't mind then put both first and last names in a single hashmap and don't differentiate between the two.

Efficient set operations in mapreduce

I have inherited a mapreduce codebase which mainly calculates the number of unique user IDs seen over time for different ads. To me it doesn't look like it is being done very efficiently, and I would like to know if anyone has any tips or suggestions on how to do this kind of calculation as efficiently as possible in mapreduce.
We use Hadoop, but I'll give an example in pseudocode, without all the cruft:
map(key, value):
ad_id = .. // extract from value
user_id = ... // extract from value
collect(ad_id, user_id)
reduce(ad_id, user_ids):
uniqe_user_ids = new Set()
foreach (user_id in user_ids):
unique_user_ids.add(user_id)
collect(ad_id, unique_user_ids.size)
It's not much code, and it's not very hard to understand, but it's not very efficient. Every day we get more data, and so every day we need to look at all the ad impressions from the beginning to calculate the number of unique user IDs for that ad, so each day it takes longer, and uses more memory. Moreover, without having actually profiled the code (not sure how to do that in Hadoop) I'm pretty certain that almost all of the work is in creating the set of unique IDs. It eats enormous amounts of memory too.
I've experimented with non-mapreduce solutions, and have gotten much better performance (but the question there is how to scale it in the same way that I can scale with Hadoop), but it feels like there should be a better way of doing it in mapreduce that the code I have. It must be a common enough problem for others to have solved.
How do you implement the counting of unique IDs in an efficient manner using mapreduce?

The problem is that the code you inherited was written with the mindset "I'll determine the unique set myself" instead of the "let's leverage the framework to do it for me".
I would something like this (pseudocode) instead:
map(key, value):
ad_id = .. // extract from value
user_id = ... // extract from value
collect(ad_id & user_id , unused dummy value)
reduce(ad_id & user_id , unused dummy value):
output (ad_id , 1); // one unique userid.
map(ad_id , 1): --> identity mapper!
collect(ad_id , 1 )
reduce(ad_id , set of a lot of '1's):
summarize ;
output (ad_id , unique_user_ids);

Niels' solution is good, but for an approximate alternative that is closer to the original code and uses only one map reduce phase, just replace the set with a bloom filter. The membership queries in a bloom filter have a small probability of error, but the size estimates are very accurate.

How do I go about building a matching algorithm?

I've never built an algorithm for matching before and don't really know where to start. So here is my basic set up and why I'm doing it. Feel free to correct me if I'm not asking the right questions.
I have a database of names and unique identifiers for people. Several generated identifiers (internally generated and some third party), last name, first name, and birth date are the primary ones that I would be using.
Several times throughout the year I receive a list from a third party that needs to be imported and tied to the existing people in my database but the data is never as clean as mine. IDs could change, birth dates could have typos, names could have typos, last names could change, etc.
Each import could have 20,000 records so even if it's 99% accurate that's still 200 records I'd have to go in manually and match. I think I'm looking for more like 99.9% accuracy when it comes to matching the incoming people to my users.
So, how do I go about making an algorithm that can figure this out?
PS Even if you don't have an exact answer but do know of some materials to reference would also be helpful.
PPS Some examples would be similar to what m3rLinEz wrote:
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Original'
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:10/20/84 '- Typo in birth date'
ID: 0876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Wrong ID'
ID: 9876234 Fname: Jose LName: Guitierrez-Brown Birthdate:01/20/84 '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez Birthdate:01/20/84 '- Added middle initial'
ID: 3453555 Fname: Joseph LName: Guitierrez Birthdate:01/20/84 '- Probably someone else with same birthdate and same last name'

You might be interested in Levenshtein distance.
The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character. It is named
after Vladimir Levenshtein, who
considered this distance in 1965.1
It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}
For example:
Record A - ID: 4831213321, Name: Jane
Record B - ID: 431213321, Name: Jann
Record C - ID: 4831211021, Name: John
The distance between A and B will be lower than A and C / B and C, which indicates better match.

When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..
Good luck!

If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.

If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.

I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.
For the remaining rows you can use a simplified version of Bayes' Theorem.
For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.
For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.

Regular expressions are what you need, why reinvent the wheel?

How to generate the effective order number? (nice pattern with unpredicatable gap)

just wondering does anyone in here have good idea about generating nice order id?
for example
832-28-394, which show a quite nice and formal order id (rather than just use an database auto increment number like ID=35).
the order id need to look random so it can not be able to guess by user.
e.g. 832-28-395 (shoudnt exist) so there will always some gap between each id.
just like the account number for your bank card?
Cheers

If you are using .NET you can use System.Guid.NewGuid()

The auto-incremented IDs are stored as integer or long integer data. One of the reasons for this is that this format is compact, saving space, including in indexes which are typically inclusive a primary key for use with joins and such.
If you wish to create a nice looking id following a particular format syntax, you'll need to manage the generation of the IDs yourself, and store these in a "regular" column not one that is auto-incremented.
I suggest you keep using "ugly looking" ids, be they auto-incremented or not, and format these value for display purposes only, using whatever format you may desire, including some format that use the values from several columns. Depending on the database system you are using you may be able to declare custom functions, at the level of the database itself, allowing you to obtain the readily formatted value with a simple query (as in
SELECT MakeAFancyId(id_field), some_other_columns, ..
FROM ...
If you cannot use some built-in or custom function at the level of SQL, you'll need to format the value supplied by SQL (an integer of sorts), into the desired format, on the client-side, using the language associated with your UI / presentation framework.

I'd create something where the first eight numbers are loosely in a pattern, and a third quartet looks random but is really a sort of checksum.
So, for example, the first eight digits increment based on the current seconds on the server clock.
The last four could be something like the sum of the first four, plus twice the sum of the second four, which will give either a two or three digit number. The final digit is calculated so that the sum of all 11 digits plus this last one is a multiple of 9.
This is slightly akin to how barcode numbers are verified. You can format the resulting 12 digits any way you want, although it is the first eight that are unique here.

Hash the clock time.
Mod by 100,000 or something.
Format with hyphens.
Check for duplicates. If found, restart.

I would suggest using a autoincrement ID in the database to link tables and as a primary key. Integer fields are always faster than string fields for indexing and well as searching.
You can have the order number field (which is for display) as a different field in the order table which will be used to display. And whenever you are planning to send a URl to a user or display a URL to the user which has order ID (which is a autoincremented number) you can encrypt it with some algorithm.
Both your purpose will be solved.
But I suggest not to make string as primary key. Though you can have a unique constraint on the order number which is going to be displayed.
Hope this helps.
Kalpak Luniya

I would suggest internally you keep the database derived primary key, which is auto-incremented.
For the visible order number, you will probably need a longer length than 8 characters, if you are using this for security.
If you are using Ruby, look at SecureRandom, which will generate sufficiently random strings to accomodate this. For example, you can use SecureRandom.hex(16), and it will give you a 16 digit hex number. I believe it can also give you base 64 strings, which will look weirder but be shorter.
Make sure this is not your only security on an order, as it may not be that hard to find a valid order number within your 8 digit code, especially if some are some sort of checksum.

For security reasons i suggest that you should use Criptographicaly secure random number generator. Think about idea on icreasing User Id length -if you have 1 million users then the probability to gues User ID in first try is 0.01 and 67 tries to increase probability over 0.5

Dynamic search and display

I have a big load of documents, text-files, that I want to search for relevant content. I've seen a searching tool, can't remeber where, that implemented a nice method as I describe in my requirement below.
My requirement is as follows:
I need an optimised search function: I supply this search function with a list (one or more) partially-complete (or complete) words separated with spaces.
The function then finds all the documents containing words starting or equal to the first word, then search these found documents in the same way using the second word, and so on, at the end of which it returns a list containing the actual words found linked with the documents (name & location) containing them, for the complete the list of words.
The documents must contain all the words in the list.
I want to use this function to do an as-you-type search so that I can display and update the results in a tree-like structure in real-time.
A possible approach to a solution I came up with is as follows:
I create a database (most likely using mysql) with three tables: 'Documents', 'Words' and 'Word_Docs'.
'Documents' will have (idDoc, Name, Location) of all documents.
'Words' will have (idWord, Word) , and be a list of unique words from all the documents (a specific word appears only once).
'Word_Docs' will have (idWord, idDoc) , and be a list of unique id-combinations for each word and document it appears in.
The function is then called with the content of an editbox on each keystroke (except space):
the string is tokenized
(here my wheels spin a bit): I am sure a single SQL statement can be constructed to return the required dataset: (actual_words, doc_name, doc_location); (I'm not a hot-number with SQL), alternatively a sequence of calls for each token and parse-out the non-repeating idDocs?
this dataset (/list/array) is then returned
The returned list-content is then displayed:
e.g.: called with: "seq sta cod"
displays:
sequence - start - code - Counting Sequences [file://docs/sample/con_seq.txt]
- stop - code - Counting Sequences [file://docs/sample/con_seq.txt]
sequential - statement - code - SQL intro [file://somewhere/sql_intro.doc]
(and-so-on)
Is this an optimal way of doing it? The function needs to be fast, or should it be called only when a space is hit?
Should it offer word-completion? (Got the words in the database) At least this would prevent useless calls to the function for words that does not exist.
If word-completion: how would that be implemented?
(Maybe SO could also use this type of search-solution for browsing the tags? (In top-right of main page))

What you're talking about is known as an inverted index or posting list, and operates similary to what you propose and what Mecki proposes. There's a lot of literature about inverted indexes out there; the Wikipedia article is a good place to start.
Better, rather than trying to build it yourself, use an existing inverted index implementation. Both MySQL and recent versions of PostgreSQL have full text indexing by default. You may also want to check out Lucene for an independent solution. There are a lot of things to consider in writing a good inverted index, including tokenisation, stemming, multi-word queries, etc, etc, and a prebuilt solution will do all this for you.

The fastest way is certainly not using a database at all, since if you do the search manually with optimized data, you can easily beat select search performance. The fastest way, assuming the documents don't change very often, is to build index files and use these for finding the keywords. The index file is created like this:
Find all unique words in the text file. That is split the text file by spaces into words and add every word to a list unless already found on that list.
Take all words you have found and sort them alphabetically; the fastest way to do this is using Three Way Radix QuickSort. This algorithm is hard to beat in performance when sorting strings.
Write the sorted list to disk, one word a line.
When you now want to search the document file, ignore it completely, instead load the index file to memory and use binary search to find out if a word is in the index file or not. Binary search is hard to beat when searching large, sorted lists.
Alternatively you can merge step (1) and step (2) within a single step. If you use InsertionSort (which uses binary search to find the right insert position to insert a new element into an already sorted list), you not only have a fast algorithm to find out if the word is already on the list or not, in case it is not, you immediately get the correct position to insert it and if you always insert new ones like that, you will automatically have a sorted list when you get to step (3).
The problem is you need to update the index whenever the document changes... however, wouldn't this be true for the database solution as well? On the other hand, the database solution buys you some advantages: You can use it, even if the documents contain so many words, that the index files wouldn't fit into memory anymore (unlikely, as even a list of all English words will fit into memory of any average user PC); however, if you need to load index files of a huge number of documents, then memory may become a problem. Okay, you can work around that using clever tricks (e.g. searching directly within the files that you mapped to memory using mmap and so on), but these are the same tricks databases use already to perform speedy look-ups, thus why re-inventing the wheel? Further you also can prevent locking problems between searching words and updating indexes when a document has changed (that is, if the database can perform the locking for you or can perform the update or updates as an atomic operation). For a web solution with AJAX calls for list updates, using a database is probably the better solution (my first solution is rather suitable if this is a locally running application written in a low level language like C).
If you feel like doing it all in a single select call (which might not be optimal, but when you dynamacilly update web content with AJAX, it usually proves as the solution causing least headaches), you need to JOIN all three tables together. May SQL is a bit rusty, but I'll give it a try:
SELECT COUNT(Document.idDoc) AS NumOfHits, Documents.Name AS Name, Documents.Location AS Location
FROM Documents INNER JOIN Word_Docs ON Word_Docs.idDoc=Documents.idDoc
INNER JOIN Words ON Words.idWord=Words_Docs.idWord
WHERE Words.Word IN ('Word1', 'Word2', 'Word3', ..., 'WordX')
GROUP BY Document.idDoc HAVING NumOfHits=X
Okay, maybe this is not the fastest select... I guess it can be done faster. Anyway, it will find all matching documents that contain at least one word, then groups all equal documents together by ID, count how many have been grouped togetehr, and finally only shows results where NumOfHits (the number of words found of the IN statement) is equal to the number of words within the IN statement (if you search for 10 words, X is 10).

Not sure about the syntax (this is sql server syntax), but:
-- N is the number of elements in the list
SELECT idDoc, COUNT(1)
FROM Word_Docs wd INNER JOIN Words w on w.idWord = wd.idWord
WHERE w.Word IN ('word1', ..., 'wordN')
GROUP BY wd.idDoc
HAVING COUNT(1) = N
That is, without using like. With like things are MUCH more complex.

Google Desktop Search or a similar tool might meet your requirements.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio