Elasticsearch subset filter - elasticsearch

I have a dataset about books, each of which can be in one or more languages. Every user is registered as having one or more languages.
When a user searches for books, I'd like to return only those books where they understand all of its languages.
For example, the following two books are in the system:
Book A: English, French, German
Book B: English, Greek
If John is registered as knowing English, German, French, and Italian, then his query results should never include Book B.
My system is currently written using Apache Solr, where I ended up writing a plugin to perform a subset operation (where a record matches if the languages of the record are a subset of the languages of the user, where the user's languages are declared in the query).
However, I'd like to transition to an Elasticsearch backend. This particular subsetting behavior, however, doesn't seem to be part of the core filter package. Am I missing something, or should I look at writing a similar plugin / custom filter?

This can be done using a script filter , you can pass it a comma separated list of strings as a param and use for loop to ensure each component is contained , if even one is not use break and return false. if all present loop exits and it returns a true.
I'm not sure how efficient this is, but theoretically this can be done on elasticsearch. Ideally apply an optimized filter to narrow down the set of books and then run this on those subsets look at https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets and docs on post_filters, the efficiency should be ideally tested over a bunch of queries as this filter will preform better once its result begins to be cached

Another possible answer to this is to invert the problem on its head.This data has certain characteristics. Assuming sufficient scale and real world practicalities the basic idea is that the cardinality of the language field is extremely low wrt books, users and authors (you could further improve this by using language roots as a field eg Latin- for english, italian and proto languages http://en.wikipedia.org/wiki/List_of_proto-languages at index time) Frequently users tend to know languages from the same family so you can exploit this fact to your benefit.
Then the user query would be essentially be the difference of the sets of all present and the one he knows. These can easily be modeled as a bunch of filters using the execution:bool flag (extremely optimized bitsets internally) to cache and combine them. Make sure you are wise about execution order of filters have a look at https://www.elastic.co/blog/all-about-elasticsearch-filter-bitsets

Related

How to index mixed language contents on Elasticsearch?

How to index mixed language contents in Elasticsearch. Let's say that we have a system where people submit contents from various parts of the world. Countries ranges from US, Canada, Europe, Japan, Korea, India, China, Kenya, Arabs, Russia to all other parts of the world.
Contents can be in any language that we can't know beforehand and can even be in mixed language. We don't want to guess the language of the contents and create multiple language specific indexes for each of the inputted language, we believe this is unmanageable.
We need an easy solution to index those contents efficiently in Elasticsearch with full text search capability as well as fuzzy string searching. Can anyone help in this regard?
What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?
One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer.
Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.
However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.
I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.
Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).
You will not have language specific stemming but at least token streams for most languages.

a simple filtering language that can be embedded in ruby?

I have a ruby project where part of the operation is to select entities given user-specified constraints. So far, I've been hacking my own filter language, using regular expressions and specifying inclusion/exclusion based on the fields in the entities.
If you are interested in my current approach, here's an example: For instance, given this list of entities:
[{"type":"dog", "name":"joe"}, {"type":"dog", "name":"fuzz"}, {"type":"cat", "name":"meow"}]
A user could specify a filter like so:
{"filter":{
"type":{"included":["dog"] },
"name":{"excluded":["^f.*"] }
}}
Would match all dogs but exclude fuzz.
This is sort of working now. However, I am starting to require more sophisticated selection parameters. I am thinking that rather than continuing to hack on my own filter language, there might be a more general-purpose filter language I can just embed in my application? For instance, is there a parser that can in-app filter using a SQL where clause? Or are there some other general, simple filter languages that I'm not aware of? I would especially like to move away from regexps since I want to do range querying on numbers (like is entity["size"] < 50 ?)
It is a little bit of an extrapolation, but I think you may be looking for a search engine, or at least enough of one that you may as well use one just for the query language.
If so you might want to look at elasticsearch which does have Ruby client bindings, and could be a good fit for what you are trying to do. Especially if you want or need to express the data you want to search as JSON for use by client code, as that format is natively supported by the search engine.
The query language is quite expressive, and there are a variety of built-in and plugin tools available to explore and use it.
in the end, i ended up implementing a ruby dsl. it's easy, fun, and powerful.

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Searching algorithmics: Parsing and processing a request OOP style

Say you were to create a search engine that can accept a query statement under the form of a String. The statement can be used to retrieve different types of objects with a given set of characteristics and possibly linked to other objects. In plain english or pseudo-code using an OOP approach, how would you go about parsing and processing statements as follows to get the series of desired objects ?
get fruit with colour green
get variety of apples, pears from Andy
get strawberry with colour "deep red" and origin not Spain
get total of sales of melons between 2010-10-10 and 2010-12-30
get last deliverydate of bananas from "Pete" and state not sold
Hope the question is clear. If not I'll be more than happy to reformulate.
P.S: This isn't homework ;)
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Type
Variety
Color
Origin
DateSold
etc
:
Then you can write a Lucene query such as Type:Fruit AND Color:Green. You can also build nested queries such as (Fruit:Straberry AND Color:Deep Red) AND NOT Origin:Spain.
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
BTW Solr has something called faceting which lets the user filter results using each of the criteria above. So user types fruit into search box and then gets results back.
Type:
- Fruit (109)
- Nut (99)
Origin:
- Spain(32)
- France(39)
Color:
- Red (22)
- Deep Red(45)
Clicking on each of the facets filters the results with the intersection. So if you want a more user-friendly interaction model, faceting/filtering is much easier, than getting users to type extensive Lucene queries.
Update: You might still need to do some lexical parsing if you wish to let users type natural language queries and break it down, but given the tremendously difficult challenge, my suggestion would be to use the simple & powerful faceting approach.
Hope that helps.
It sounds like you're developing a mini language, since you're concerned with syntax and parsing. So, check out the many tools used to generate lexers and parsers. You can start here: http://en.wikipedia.org/wiki/Lexical_analysis
I agree with John.
a) Start with lexical analysis
b) Take statistics of searches and use them to index
c) Find relationships by analysing possibly related searches
This is just a wild guess though, never tried it before.

Resources