Searching for companies with elasticsearch - elasticsearch

Imagine I have two sources of data. One source is calling Mærsk for A.P. Møller - Mærsk A while the other is A.P. Møller - Mærsk A/S. Now I have a lot of companies and I want to streamline the naming of these.
Both sources are indexed in elasticsearch but I am too much of a newbie with this technology to come up with a proper search query. My initial though was to use common which gives decent results, but I figure there are better ways.
Any suggestions?
EDIT
A little clarification. My two sources is just a data source that deliver company names. I've stored these names in its own index for each source - a document is just the name.
So I have two indices with company names (nothing else there). Now for each company name in index A I want find the corresponding company in index B. The challenge is that there are various ways to write a company name - it is not standardized. I want to create this link with as little manual labour as possible and minimal risk for errors as well.

The OP has probably moved on from this question, given it was asked a while ago. And, for example, common has now been deprecated. But in case it helps others, here are some guidelines:
The Problem
As I understand it from the question, the problem is exemplified by this: I have two company names in two different data sources. One is:
A.P. Møller - Mærsk A
The other is:
A.P. Møller - Mærsk A/S
Assuming these represent the same company, the problem is how to resolve these to a single canonical name (for example, "Mærsk" if that is an appropriate name in this case).
Furthermore, how can we perform this matching process across a large set of company names in as automated a way as possible?
One warning - it usually pays to make such tasks repeatable - even if you think it's going to be a one-time-only clean-up exercise, it often doesn't end up that way (IMHO).
One Solution
Getting to a fully-automated matching solution is typically not possible in cases like this - some manual intervention is usually needed. But you may be able to get close.
I will take some liberties - for example, I will ignore the "two different data sources" aspect. Instead, I will assume we have one overall list, the union of both sources (because maybe there are name variants within each list).
Here is what has broadly worked for me in a similar domain (film titles).
FULL DISCLOSURE: I did not use ElasticSearch, in my case. I used Lucene and some custom Java. But in this context, there are many similarities. My references below are all to ElasticSearch v7.5 functionality.
Tokenization
The question indicates that data has already been indexed - but using what tokenization steps? Some suggestions (which may already have been implemented in the OP's case):
Consider leaving in stop-words. Not a hard-and-fast rule, but consider what would happen to the band The The if stop-words were removed. There would be nothing to index. In relatively short text such as names, stop-words may be too important to remove.
Consider ascii folding, etc. to normalize text (removal of diacritics, such as é to e; expansion of ligatures, such as æ to ae; and so on. This assumes you are using Latin-based text. Less relevant for other scripts (Chinese, etc.).
Consider customizations specific to your problem domain. For example, there may be nomenclature variations such as "LTD", "Ltd", etc. representing the word "Limited" in company names. Or the use of ampersands (&) in some examples, but "and" in others. "Smith & Sons, Ltd" versus "Smith and Sons Limited".
other transformations such as lowercase and removal of punctuation are more straightforward.
Supporting Metadata
The OP may not have access to any of this - but supporting metadata can be vital in determining if two name variants refer to the same entity. An example from the world of film titles: There are two movies in IMDb called "Kicking and Screaming" - and numerous TV episodes. They can be distinguished from each other by comparing related metadata such as:
type of release (movie, TV episode, etc).
year of initial release (perhaps with a +/- tolerance threshold).
I don't know what the equivalent might be for companies.
A fairly crude technique would be to append such data to each company name, thus increasing the number of tokens available in each indexable term.
Or, the metadata data can be used downstream to further verify whether two terms match or not.
Matching & Score Thresholds
Let's assume we have simple word-boundary indexed terms (although there are plenty of other ways to go - ngrams, shingles, etc.).
Now we perform a search on each company name (plus additional metadata, if we added it).
Let's assume we have defined a threshold score that must be reached for a search result to be considered a match. The score should be easily adjustable to tune matching behavior.
If we get only one match which exceeds this score, we can assume we have an automated match: the two names represent the same underlying company.
If we get zero matches which exceed this score, then we can assume the company name is unique in our data set.
If we get multiple matches, then that is the point at which manual intervention may be needed, to determine if the names are equivalent or not.
Test Cases
The aim is to minimize false positive matches, while also minimizing match misses.
How do you know?
The only good answer I have for this is to generate a set of test cases. And the best way to do that is to study the data, so you can find suitably cunning & devious cases to test.
Conclusion
This all sounds like a lot of work. How much of it you actually do, or how little - how rigorous or how cursory - is up to you. Depends on your context, of course.

Related

What is the purpose of Tags in Doc2Vec TaggedDocument?

Is it to aid in classification tasks? The [docs][1] and tutorials don't explain this; they seem to assume a level of understanding that I don't have. These SO answers get near it do not say explicitly:
https://datascience.stackexchange.com/questions/10216/doc2vec-how-to-label-the-paragraphs-gensim
Multiple tags for single document in doc2vec. TaggedDocument
The 'tag' is just the key with which to look-up the learned document vector, after training is done.
The original 'Paragraph Vectors' research papers, on which Gensim's Doc2Vec is based, tended to just assume each document had one unique ID – perhaps, a string token just like any other word. (So, too, did a small patch to the original Google word2vec.c that was once shared, long ago, as a limited example of one mode of 'paragraph vectors`.)
In those original formulations, documents had just one unique ID – lookup key for their vector.
However, it was a fairly obvious/straightforward extension to allow these associated vectors to potentially map to other known shared labels, across many documents. (That is, not a unique vector per document, but a unique vector per label, which might appear on multiple texts.) And further, that multiple such range-of-text vectors might be relevant to a single text, that's known to deserve more-than-one label.
So the word 'tag' was used in the Gensim implementation, to convery that this is an association more general than either a unique-ID, or a known-label, though it might in some cases be either.
If you're just starting out, or trying to match early papers, just consider the 'tag' a single unique ID per document. Give every independent document its own unique name – whether it's something natural from your data source (like a unique article title or primary key), or a mere serial number, from '0' to the count of docs in your data.
Only if you're trying expert/experimental other approaches, after understanding the basic approach, would you want to either repeat a 'tag' across multiple documents, or use mroe than one 'tag' per document. Neither to those approaches are necessary, or typical, in the initial application of Doc2Vec.
(And if you start to re-use known tags in training, Doc2Vec is no longer a strictly 'unsupervised' machine-learning technique, but starts to behave more like a 'supervised' or 'semi-supervised' technique, where you're nudging the algorithm towards desired answers. That's sometimes useful, and appropriate, but starts to complicate estimates of how well your steps are working: you then have to use things like held-back test/validation data to get trustworthy estimates of your system's success.)

How two check if two unstructured street adresses strings are the same?

I need to compare two unstructured addresses and be able to identify if they are the same (or similar enough).
Scenario
Address is supplied by the end user in plain text.
There is nothing to help the user to write on a more identifiable manner (no autocomplete, nothing. Just an empty textbox).
"#102 Nice-Looking Street, Gotham City, NY" should match with "Nice Loking St., Gotham City, New York, apt 102".
Using a third-party service is not an option.
Search is not a problem. I already have the two strings. What I need is to check if they represent the same address, despite its differences on structure.
What I have found
I know we can use some Fuzzy logic for this kind of comparison, with some tolerance for misspelling, but...
There are some keywords (like, for instance, comparing "Street" to "St." or comparing "#102" to "apt 102", or "NY" to "New York") that are not supposed to penalize the degree of reliability.
Some words can be placed in different order (like the appartement in the above example).
I do not want to reinvent the Wheel. This problem seems like a common concern in different contexts and I think there is an algorithm (with some slight modifications, maybe) that might be a fit for this scenario.
Thanks in advance
I've helped build some open source tools to do this.
Basically, the approach is to try to split and address into it's constituent parts and then intelligently compare those parts.
Both parts of the problem are hard.
The first part is often called address parsing. Here's what we use: https://github.com/datamade/usaddress
The second part has many, many names but, let's call it fuzzy matching. Here's the library we made for that: https://github.com/datamade/dedupe
We also provided some facilities for using them together: http://dedupe.readthedocs.io/en/latest/Variable-definition.html#address-type

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Finding personal information in documents (hard problem)

I am tasked with trying to create an automated system that removes personal information from text documents.
Emails, phone numbers are relatively easy to remove. Names are not. The problem is hard because there are names in the documents that need to be kept (eg, references, celebrities, characters etc). The author name needs to be removed from the content (there may also be more than one author).
I have currently thought of the following:
Quite often personal names are located at the beginning of a document
Look at how frequently the name is used in the document (personal names tend to be written just once)
Search for words around the name to find patterns (mentions of university and so on...)
Any ideas? Anyone solved this problem already??
With current technology, doing what what you are describing in a fully automated way with a low error rate is impossible.
It might be possible to come up with an approximate solution, but it would still make a lot of errors...... either false positives or false negatives or some combination of the two.
If you are still really determined to try, I think your best approach would be Bayseian filtering (as used in spam filtering). The reason for this is that it is quite good at assigning probabilities based on relative positions and frequencies of words, and could also learn which names are more likely / less likely to be celebrities etc.
The area of machine learning that you would need to learn about to make an attempt at this would be natural language processing. There are a few different approaches that could be used, bayesian networks (something better then a naive bayes classifier), support vector machines, or neural nets would be areas to research. Whatever system you end up building would probably need to use an annotated corpus (labeled set of data) to learn where names should be. Even with a large corpus, whatever you build will not be 100% accurate, so you would probably be better off setting flags at the names for deletion instead of just deleting all of the words that might be names.
This is a common problem in basic cryptography courses (my first programming job).
If you generated a word histogram of your entire document corpus (each bin is a word on the x-axis whose height is frequency represented by height on the y-axis), words like "this", "the", "and" and so forth would be easy to identify because of their large y-values (frequency). Surnames should at the far right of your histogram--very infrequent; given names towards the left, but not by much.
Does this technique definitively identify the names in each document? No, but it could be used to substantially constrain your search, by eliminating all words whose frequency is larger than X. Likewise, there should be other attributes that constrain your search, such as author names only appear once on the documents they authored and not on any other documents.

Algorithms to detect phrases and keywords from text

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.
If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?
Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).
Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.
Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.
POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)
[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]
[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]
I'd start with a wonderful chapter, by Peter Norvig, in the O'Reilly book Beautiful Data. He provides the ngram data you'll need, along with beautiful Python code (which may solve your problems as-is, or with some modification) on his personal web site.
It sounds like you're looking for collocation extraction. Manning and Schütze devote a chapter to the topic, explaining and evaluating the 'proposed formulas' mentioned in the Wikipedia article I linked to.
I can't fit the whole chapter into this response; hopefully some of their links will help. (NSP sounds particularly apposite.) nltk has a collocations module too, not mentioned by Manning and Schütze since their book predates it.
The other responses posted so far deal with statistical language processing and n-grams more generally; collocations are a specific subtopic.
Do a matrix for words. Then if there are two consecutive words then add one to that appropriate cell.
For example you have this sentence.
mat['for']['example'] ++;
mat['example']['you'] ++;
mat['you']['have'] ++;
mat['have']['this'] ++;
mat['this']['sentence'] ++;
This will give you values for two consecutive words.
You can do this word three words also. Beware this requires O(n^3) memory.
You can also use a heap for storing the data like:
heap['for example']++;
heap['example you']++;
One way would be to build yourself an automaton. most likely a Nondeterministic Finite Automaton(NFA).
NFA
Another more simple way would be to create a file that has contains the words and/or word groups that you want to ignore, find, compare, etc. and store them in memory when the program starts and then you can compare the file you are parsing with the word/word groups that are contained in the file.

Resources