Related
Imagine I have two sources of data. One source is calling Mærsk for A.P. Møller - Mærsk A while the other is A.P. Møller - Mærsk A/S. Now I have a lot of companies and I want to streamline the naming of these.
Both sources are indexed in elasticsearch but I am too much of a newbie with this technology to come up with a proper search query. My initial though was to use common which gives decent results, but I figure there are better ways.
Any suggestions?
EDIT
A little clarification. My two sources is just a data source that deliver company names. I've stored these names in its own index for each source - a document is just the name.
So I have two indices with company names (nothing else there). Now for each company name in index A I want find the corresponding company in index B. The challenge is that there are various ways to write a company name - it is not standardized. I want to create this link with as little manual labour as possible and minimal risk for errors as well.
The OP has probably moved on from this question, given it was asked a while ago. And, for example, common has now been deprecated. But in case it helps others, here are some guidelines:
The Problem
As I understand it from the question, the problem is exemplified by this: I have two company names in two different data sources. One is:
A.P. Møller - Mærsk A
The other is:
A.P. Møller - Mærsk A/S
Assuming these represent the same company, the problem is how to resolve these to a single canonical name (for example, "Mærsk" if that is an appropriate name in this case).
Furthermore, how can we perform this matching process across a large set of company names in as automated a way as possible?
One warning - it usually pays to make such tasks repeatable - even if you think it's going to be a one-time-only clean-up exercise, it often doesn't end up that way (IMHO).
One Solution
Getting to a fully-automated matching solution is typically not possible in cases like this - some manual intervention is usually needed. But you may be able to get close.
I will take some liberties - for example, I will ignore the "two different data sources" aspect. Instead, I will assume we have one overall list, the union of both sources (because maybe there are name variants within each list).
Here is what has broadly worked for me in a similar domain (film titles).
FULL DISCLOSURE: I did not use ElasticSearch, in my case. I used Lucene and some custom Java. But in this context, there are many similarities. My references below are all to ElasticSearch v7.5 functionality.
Tokenization
The question indicates that data has already been indexed - but using what tokenization steps? Some suggestions (which may already have been implemented in the OP's case):
Consider leaving in stop-words. Not a hard-and-fast rule, but consider what would happen to the band The The if stop-words were removed. There would be nothing to index. In relatively short text such as names, stop-words may be too important to remove.
Consider ascii folding, etc. to normalize text (removal of diacritics, such as é to e; expansion of ligatures, such as æ to ae; and so on. This assumes you are using Latin-based text. Less relevant for other scripts (Chinese, etc.).
Consider customizations specific to your problem domain. For example, there may be nomenclature variations such as "LTD", "Ltd", etc. representing the word "Limited" in company names. Or the use of ampersands (&) in some examples, but "and" in others. "Smith & Sons, Ltd" versus "Smith and Sons Limited".
other transformations such as lowercase and removal of punctuation are more straightforward.
Supporting Metadata
The OP may not have access to any of this - but supporting metadata can be vital in determining if two name variants refer to the same entity. An example from the world of film titles: There are two movies in IMDb called "Kicking and Screaming" - and numerous TV episodes. They can be distinguished from each other by comparing related metadata such as:
type of release (movie, TV episode, etc).
year of initial release (perhaps with a +/- tolerance threshold).
I don't know what the equivalent might be for companies.
A fairly crude technique would be to append such data to each company name, thus increasing the number of tokens available in each indexable term.
Or, the metadata data can be used downstream to further verify whether two terms match or not.
Matching & Score Thresholds
Let's assume we have simple word-boundary indexed terms (although there are plenty of other ways to go - ngrams, shingles, etc.).
Now we perform a search on each company name (plus additional metadata, if we added it).
Let's assume we have defined a threshold score that must be reached for a search result to be considered a match. The score should be easily adjustable to tune matching behavior.
If we get only one match which exceeds this score, we can assume we have an automated match: the two names represent the same underlying company.
If we get zero matches which exceed this score, then we can assume the company name is unique in our data set.
If we get multiple matches, then that is the point at which manual intervention may be needed, to determine if the names are equivalent or not.
Test Cases
The aim is to minimize false positive matches, while also minimizing match misses.
How do you know?
The only good answer I have for this is to generate a set of test cases. And the best way to do that is to study the data, so you can find suitably cunning & devious cases to test.
Conclusion
This all sounds like a lot of work. How much of it you actually do, or how little - how rigorous or how cursory - is up to you. Depends on your context, of course.
i wanted to download gene expression data derived from generated by microarray experiments. i do not know too much about this subject, but as i understand, rows often correspond to genes and columns corresponds to samples. ideally, i expect a matrix of gene expression data.
i've been searching on the internet, and although it may seem like there are many places to download such data, when i actually do download the data, i do not get the matrix of gene expression. could someone please let me know if there is a place or how to download gene expression data in the format that i expect above?
any help is appreciated.
If you look at e.g. this entry in the Gene Expression Omnibus, one of the file formats is "TXT" and contains a matrix like you are asking for, after some metadata.
In principle microarray data can be expressed (please pardon the pun) as a matrix with samples as columns and rows as genes. In practice it is a good bit more complicated to derive such a representation for the raw data of an experiment. If you just get a pre-processed dataset you have little guarantee that the raw data was processed in a way that makes it comparable to other experiments or that the underlying raw data was of sufficiently high quality.
You are also going to need high quality metadata to derive any meaning from the data matrix. What were the biological conditions and sources from which the samples were derived? What genes do the probes on the particular array used correspond to? (Note that 9890_at is "probeset id", a unique identifier of a molecular probe of a particular sequence design which then needs to be mapped to a gene, different probes for the same gene won't give exactly the same response.)
The public microrarray databases therefore provide a lot of additional information in addition to a processed data matrix. In addition to GEO that has already been mentioned I would recommend ArrayExpress which in my opinion has the better search interface.
The tool of choice to work with microarray data for many is the bioconductor suite of software for the statistical programming language R.
Bioconductor provides APIs to download raw data with accompanying metadata from both repositories, see the GEO bioc package and ArrayExpress bioc package.
Both packages, in common with most bioconductor software come with excellent "vignettes" that introduce the software:
GEO bioc vignette and
Arrayexpress bioc vignette
Those vignettes should also give you examples of taking the raw data and deriving "Esets" (expression sets) from the raw data. At that point you can access the gene expression matrix in the bioconductor Eset object, and you have an object and APIs to interrogate the necessary metadata.
Note that there are different types of microarray. I would recommend starting with data from Affymetrix arrays as they have probably the most straightforward analysis APIs.
I would like to create algorithm to distinguish the persons writing on forum under different nicknames.
The goal is to discover people registring new account to flame forum anonymously, not under their main account.
Basicaly I was thinking about stemming words they use and compare users according to similarities or these words.
As shown on the picture there is user3 and user4 who uses same words. It means there is probably one person behind the computer.
Its clear that there are lot of common words which are being used by all users. So I should focus on "user specific" words.
Input is (related to the image above):
<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter
Output should be:
user1
user2
user3 = user4
I am doing this in Java but I want this question to be language independent.
Any ideas how to do it?
1) how to store words/users? What data structures?
2) how to get rid of common words everybody use? I have to somehow ignore them among user specific words. Maybe I could just ignore them because they get lost. I am afraid that they will hide significant difference of "user specific words"
3) how to recognize same users? - somehow count same words between each user?
I am very thankful for every advice in advance.
In general this is task of author identification, and there are several good papers like this that may give you a lot of information. Here are my own suggestions on this topic.
1. User recognition/author identification itself
The most simple kind of text classification is classification by topic, and there you take meaningful words first of all. That is, if you want to distinguish text about Apple the company and apple the fruit, you count words like "eat", "oranges", "iPhone", etc., but you commonly ignore things like articles, forms of words, part-of-speech (POS) information and so on. However many people may talk about same topics, but use different styles of speech, that is articles, forms of words and all the things you ignore when classifying by topic. So the first and the main thing you should consider is collecting the most useful features for your algorithm. Author's style may be expressed by frequency of words like "a" and "the", POS-information (e.g. some people tend to use present time, others - future), common phrases ("I would like" vs. "I'd like" vs. "I want") and so on. Note that topic words should not be discarded completely - they still show themes the user is interested in. However you should treat them somehow specially, e.g. you can pre-classify texts by topic and then discriminate users not interested in it.
When you are done with feature collection, you may use one of machine learning algorithm to find best guess for an author of the text. As for me, 2 best suggestions here are probability and cosine similarity between text vector and user's common vector.
2. Discriminating common words
Or, in latest context, common features. The best way I can think of to get rid of the words that are used by all people more or less equally is to compute entropy for each such feature:
entropy(x) = -sum(P(Ui|x) * log(P(Ui|x)))
where x is a feature, U - user, P(Ui|x) - conditional probability of i-th user given feature x, and sum is the sum over all users.
High value of entropy indicates that distribution for this feature is close to uniform and thus is almost useless.
3. Data representation
Common approach here is to have user-feature matrix. That is, you just build table where rows are user ids and columns are features. E.g. cell [3][12] shows normalized how many times user #3 used feature #12 (don't forget to normalize these frequencies by total number of features user ever used!).
Depending on features your are going to use and size of the matrix, you may want to use sparse matrix implementation instead of dense. E.g. if you use 1000 features and for every particular user around 90% of cells are 0, it doesn't make sense to keep all these zeros in memory and sparse implementation is better option.
I recommend a language modelling approach. You can train a language model (unigram, bigram, parsimonious, ...) on each of your user accounts' words. That gives you a mapping from words to probabilities, i.e. numbers between 0 and 1 (inclusive) expressing how likely it is that a user uses each of the words you encountered in the complete training set. Language models can be stored as arrays of pairs, hash tables or sparse vectors. There are plenty of libraries on the web for fitting LMs.
Such a mapping can be considered a high-dimensional vector, in the same way documents are considered as vector in the vector space model of information retrieval. You can then compare these vectors by using KL-divergence or any of the popular distance metrics: Euclidean distance, cosine distance, etc. A strong similarity/small distance between two users' vectors might then indicate that they belong to one and the same user.
how to store words/users? What data structures?
You probably have some kind of representation for the users and the posts that they have made. I think you should have a list of words, and a list corresponding to each word containing the users who use it. Something like:
<word: <user#1, user#4, user#5, ...> >
how to get rid of common words everybody use?
Hopefully, you have a set of stopwords. Why not extend it to include commonly used words from your forum? For example, for stackoverflow, some of the most frequently used tags' names should qualify for it.
how to recognize same users?
In addition to using similarity or word-frequency based measures, you can also try using interactions between users. For example, user3 likes/upvotes/comments each and every post by user8, or a new user doing similar things for some other (older) user in this way.
I have been playing around with Markov Chain Text Generation and Naive Bayes classifiers. I am wondering if there is a way to apply either of those concepts towards identifying certain types of words in a novel. E.G. Last names or place names
I can look through my markov chain and I see that certain words tend to relate the same way to certain other types of words. E.G. Mr. frequently comes before a last name, 'went to' tends to come before a place name and last names tend to follow first names.
Is there a good way that I can write a program that will take a list of example names and then go through a large set of books and identify all words like those names with decent accuracy? Is English regular enough for this to work? Has this been done before? Would this method have a name?
Thanks,
Andrew
In fact, there are only few patterns for names, e.g.:
{FirstName}{Space}{Token with big first char}
{BigCharacter}{Dot}{Space}{Token with big first char}
{"Mr" | "Ms"}{Dot}{Space}{Token with big first char}
and several more. All you need is a dictionary of first names and simple engine to catch such patterns. There's a good framework for this (and many other things) - GATE. It has very large dictionary of first names and special pattern language (JAPE) for manipulating token sequences. You can use it directly or just get the dictionary and implement the logic by yourself.
I have got a simple contacts database but I'm having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicated data that is being entered is not exactly the same. For example, names are incorrectly spelled or one person will put in 'Bill Smith' and another will put in 'William Smith' for the same person.
So is there some sort of algorithm that can give a percentage for how similar an entry is to another?
So is there some sort of algorithm
that can give a percentage for how
similar an entry is to another?
Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".
The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.
To your regular Address table, add Root-versions of the names, e.g
Person (Firstname, RootFirstName, Surname, Rootsurname....)
Now, create a mapping table.
FirstNameMappings (Primary KEY Firstname, Rootname)
Populate your Mapping table by:
Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings
This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"
Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William"
This is very time consuming, but if data quality really is important for you I think it's one of the best ways.
Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.
You can compare the names with the Levenshtein distance. If the names are the same, the distance is 0, else it is given by the minimum number of operations needed to transform one string into the other.
I imagine that this problem is well understood but what occurs to me on first reading is:
compare fields individually
count those that match (for a possibly loose definition of match, and possibly weighing the fields differently)
present for human intervention any cases which pass some threshold
Use your existing database to get a good first guess for the threshold, and correct as you accumulate experience.
You may prefer a fairly strong bias toward false positives, at least at first.
While I do not have an algorithm for you, my first action would be to take a look at the process involved in entering a new contact. Perhaps users do not have an easy way to find the contact they are looking for. Much like on Stack Overflow's new question form, you could suggest contacts that already exist on the new contact screen.
If you have access SSIS check out the Fuzzy grouping and Fuzzy lookup transformation.
http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
http://msdn.microsoft.com/en-us/library/ms137786.aspx
If you have a large database with string fields, you can very quickly find a lot of duplicates by using the simhash algorithm.
This may or may not be related but, minor misspellings might be detected by a Soundex search, e.g., this will allow you to consider Britney Spears, Britanny Spares, and Britny Spears as duplicates.
Nickname contractions, however, are difficult to consider as duplicates and I doubt if it is wise. There are bound to be multiple people named Bill Smith and William Smith, and you would have to iterate that with Charles->Chuck, Robert->Bob, etc.
Also, if you are considering, say, Muslim users, the problems become more difficult (there are too many Muslims, for example, that are named Mohammed/Mohammad).
I'm not sure it will work well for the names vs nicknames problem, but the most common algorithm in this sort of area would be the edit distance / Levenshtein distance algorithm. It's basically a count of the number of character changes, additions and removals required to turn one item into another.
For names, I'm not sure you're ever going to get good results with a purely algorithmic approach - What you really need is masses of data. Take, for example, how much better Google spelling suggestions are than those in a normal desktop application. This is because Google can process billions of web queries and look at what queries lead to each other, what 'did you mean' links actually get clicked etc.
There are a few companies which specialise in the name matching problem (mostly for national security and fraud applications). The one I could remember, Search Software America seems to have been bought out by these guys http://www.informatica.com/products_services/identity_resolution/Pages/index.aspx, but I suspect any of these sorts of solutions would be far to expensive for a contacts application.
FullContact.com has API's that can solve this for you, see their documentation here: http://www.fullcontact.com/developer/docs/?category=name.
They have APIs for Name Normalization (Bill into William), Name Deducer (for raw text), and Name Similarity (comparing two names).
All APIs are free at the moment, it could be a good way to get started.
You might also want to look into probabilistic matching.
For those wandering around the web and end up here, might I suggest that you try using a Google Sheet add-on I created called Flookup.
It's particularly good with names and it has a couple of other awesome features which I'll describe below:
Say you have a list of names and there are 2 people called "John Smith". You can use the rank parameter from Flookup to instruct the algorithm to return the 1st, 2nd, 3rd or nth best match. This is helpful if you have additional information that you can use to identify the "John Smith" you want.
Say you have an additional database/list of apartment numbers. You an specify which "John Smith" you want by typing: John Smith & Apartment A or John Smith & Apartment B as the lookup parameter to help distinguish between the two names.
I hope you find Flookup as beneficial as others have.