Multi language full text: Which stemming [Snowball] language should be used? - stemming

Which stemming language I should be using if I want to support all
language full text search. As far as I know the index need to created
using that specific stemming language to support search with that
language, but this is not possible for me as my search program may
contain different languages

Stemming is not a concept applicable to all languages. It is not, for example, applicable in Chinese. But to languages of the Indo-European group (and most of the stemmers on Snowball site are for Indo-European languages), a common pattern of word structure does emerge. Assuming words are written left to right, the stem, or root of a word is on the left, and zero or more suffixes may be added on the right. If the root is modified by this process it will normally be at its right hand end. And also prefixes may be added on the left.
#Resource: Snowball Website


How should I index and search on hyphenated words in English?

I'm using Elasticsearch to search over a fairly broad range of documents, and I'm having trouble finding best practices for dealing with hyphenated words.
In my data, words frequently appear either hyphenated or as compound words, e.g. pre-eclampsia and preeclampsia. At the moment, searching for one won't find the other (the standard tokenizer indexes the hyphenated version as pre eclampsia).
This specific case could easily be fixed by stripping hyphens in a character filter. But often I do want to tokenize on hyphens: searches for jean claude and happy go lucky should match jean-claude and happy-go-lucky.
One approach to solving this is in the application layer, by essentially transforming any query for hyphenated-word into hyphenated-word OR hyphenatedword. But is there any way of dealing with all these use cases within the search engine, e.g. with some analyzer configuration? (Assume that my data is large and varied enough that I can't manually create exhaustive synonym files.)
You can use a compound word token filter - hyphenation_decompounder should probably work decent enough.
It seems like your index consists of many domain specific words that isn't necessarily in a regular English dictionary, so I'd spend some time creating my own dictionary first with the words that are important to your domain. This can be based on domain specific literature, taxonomies, etc. The dictionary_decompounder is suitable for doing stuff like that.
This assumes that your question was relevant to Elasticsearch and not Solr, where the filter is named DictionaryCompoundWordTokenFilter instead.

Partial and Full Phrase Match

Say I have the sentence: "John likes to take his pet lamb in his Lamborghini Huracan more than in his Lamborghini Gallardo" and I have a dictionary containing "Lamborghini", "Lamborghini Gallardo" and "Lamborghini Huracan". What's a good way of extracting the bold terms, achieving the terms "Lamborghini Gallardo" and "Lamborghini Huracan" as phrase matches, and other partial matches "Lamborghini" and "lamb"? Giving preference to the phrase matches over individual keywords.
Elastic search provides exact term match, match phrase, and partial matching. Exact term would obviously not work here, and neither match phrase since the whole sentence is considered as phrase in this case. I believe partial match would be appropriate if I only had the keywords of interest in the sentence. Going through previous SO threads, I found proximity for relevance which seems relevant, although not sure if this is the 'best option' since requires setting a threshold. Or even if there are simpler / better alternatives than elasticsearch (which seems more for full text search rather than simple keyword matching to a database)?
It sounds like you'd like to perform keyphrase extraction from your documents using a controlled vocabulary (your dictionary of industry terms and phrases).
[Italicized terms above to help you find related answers on SO and Google]
This level of analysis takes you a bit out of the search stack into the natural-language processing stack. Since NLP tends to be resource-intensive, it tends to take place offline, or in the case of search-applications, at index-time.
To implement this, you'd:
Integrate a keyphrase extraction tool, into your search-indexing code to generate a list of recognized key phrases for each document.
Index those key phrases as shingles into a new Elasticsearch field.
Include this shingled keyphrase field in the list of fields searched at query-time — most likely with a score boost.
For a quick win tool to help you with controlled keyphrase extraction, check out KEA (written in java).
(You could also probably write your own, but if you're also hoping to extract uncontrolled key phrases (not in dictionary) as well, a trained extractor will serve you better. More tools here.)

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Does ICU handle the collation of a list of strings of varying languages?

My application may have strings comprised of different alphabets / languages in a single list. I can't seem to find any information on what the correct method for sorting these should be or any indication that ICU supports this functionality.
Example List:
There is no sensible way to do this well. There is no universal sort for all languages, even within the same alphabet. Different languages (cultures, basically) have come up with different collation rules for how words should be sorted.
The only way to do this consistently at all, I think, is to use plain old codepoint sorting (e.g. in Java, String.compareTo).
You could come up with some heuristics, depending on what your data represents. You could group the strings based on guesses about the alphabet and language, and then use locale-specific sorting for each group. But you'd have to do this the hard way (code it yourself), I think, because you would guess differently depending on the terms (e.g. is 'mar' the English verb or the Spanish noun?). It's conceivable that you would end up with a worse result than the naive Unicode numerical sort, in terms of unpredictable "errors".
As with anything else, it depends on how much you can afford to put into the solution, and what kind of performance you need.
This suggestion is not the answer you're looking for: if there's any way to identify the locale when initially storing the strings, you should do so, and record it as part of the string's metadata. Then you won't have this problem.
Withe all the caveats above, here is one "standard universal multilingual sorting" : the unicode collation algorithm (UCA), which is NOT the codepoint order. From a cursory glance at this page, ICU seems to handle the mixture of UCA and local preference.
As mentioned by #Zac there is no universal sort. A code point sort will be consistent, but may not be what the user expects.
So you should probably use the preferred sort order for the user's selected locale. Any code points not defined in that sort order will be grouped together.
You could transliterate into your 'target' language (all in one script) and then sort. But languages have conflicting rules for sorting.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size. is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag
