Keywords inside a spaceless string (using NLP?) - algorithm

I am trying to find the relevant set of keywords inside a spaceless string. An example would be:
freelancemarketingconsultant
By reading it, you can distinguish the following keywords:
freelance marketing consultant
You can see the task is not trivial, as a common confusion would be to distinguish between "free" and "lance".
Are there known (potentially NLP) techniques to extract keywords from such strings?

You can use the Viterbi algorithm to find the most likely (best) way to split the string. There's a library called wordsegment that does this in Python, and you can read more about the technique at Peter Norvig's page.
There's also a recent research project called Hashtag Master which uses neural methods to tokenize hashtags.
This isn't a common problem in English, but it's standard in languages were spaces don't split words, like Japanese. There are a variety of methods, and research continues, but Viterbi based methods generally have the best balance of speed and accuracy.

Related

Is there an algorithm to compound multiple sentences into a more complex one?

I'm looking to do the opposite to what is described here: Tools for text simplification (Java)
Finding meaningful sub-sentences from a sentence
That is, take two simple sentences and combine them as a compound sentence.
Are there any algorithms to do this?
I'm particularly sure that you will not be able to compound sentences like in the example from the linked question (John played golf. John was the CEO of a company. -> John, who was the CEO of a company, played golf), because it requires such language understanding that is too far from now.
So, it seems that the best option is to bluntly replace dot by comma and concatenate simple sentences (if you have to choose sentences to be compounded from text, you can try simple heuristics like approximating semantic similarity by number of common words or tools like those based on WordNet). I guess, in most cases human readers can infer missed conjunction from the context.
Of course, you could develop more sophisticated solutions, but it requires either narrow domain (e.g. all sentences share very similar structure), or tools that can determine relations between sentences, e.g. relationship of cause and effect. I'm not aware of such tools and doubt in their existence, because this level (sentences and phrases) are much more diverse and sparse than the level of words and collocations.

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Word splitting statistical approach

I want to solve word splitting problem (parse words from long string with no spaces).
For examle we want extract words from somelongword to [some, long, word].
We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore => or core or orc ore (We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.
I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?
UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code
I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.
This problem is entirely analagous to word segmentation in many Asian languages that don't explicitly encode word boundaries (e.g. Chinese, Thai). If you want background on approaches to the problem, I'd recommend you look at Google Scholar for current Chinese Word Segmentation approaches.
You might start by looking at some older approaches:
Sproat, Richard and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)
If you want a ready-made solution, I'd recommend LingPipe's tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I've used it on unsegmented English text with good results. I trained the underlying character language model on a couple million words of newswire text, but I suspect that for this task you'll get reasonable performance using any corpus of relatively normal English text.
They used a spelling-correction system to recommend candidate 'corrections' (where the candidate corrections are identical to the input but with spaces inserted). Their spelling corrector is based on Levenshtein edit distance; they just disallow substitution and transposition, and restrict allowable insertions to only a single space.

Identifying names and places in literature

I have been playing around with Markov Chain Text Generation and Naive Bayes classifiers. I am wondering if there is a way to apply either of those concepts towards identifying certain types of words in a novel. E.G. Last names or place names
I can look through my markov chain and I see that certain words tend to relate the same way to certain other types of words. E.G. Mr. frequently comes before a last name, 'went to' tends to come before a place name and last names tend to follow first names.
Is there a good way that I can write a program that will take a list of example names and then go through a large set of books and identify all words like those names with decent accuracy? Is English regular enough for this to work? Has this been done before? Would this method have a name?
Thanks,
Andrew
In fact, there are only few patterns for names, e.g.:
{FirstName}{Space}{Token with big first char}
{BigCharacter}{Dot}{Space}{Token with big first char}
{"Mr" | "Ms"}{Dot}{Space}{Token with big first char}
and several more. All you need is a dictionary of first names and simple engine to catch such patterns. There's a good framework for this (and many other things) - GATE. It has very large dictionary of first names and special pattern language (JAPE) for manipulating token sequences. You can use it directly or just get the dictionary and implement the logic by yourself.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Resources