Multilanguage elastic search - elasticsearch

I will be indexing posts in ElasticSearch. For now there are two languages: English and Chinese. So Each post has one (English) or two translations plus some data that are common for both languages. My question is how should I index posts?
Create two indices: posts-en and posts-cn and store posts separately?
Create single index posts and keep data in format like this:
{
commonParam1: 1,
commonParam2: "somevalue",
...
titleEn: "English title",
titleCn: "Chinese title",
contentEn: "Content EN",
contentCn: "Content CN",
...
}

Unless you have a compelling reason to split a single document across two indexes I'd strongly advise keeping it all in one index.
With one index you can easily use a different analyzer for each each language specific field. Adding additional mappings in the future for new languages is fairly straightforward. It allows you to index each document in a single call as opposed to two, one for each language, if you index separately. You reduce duplicated data (e.g. the common data).
I'd also take a good look at this post: http://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/
It's a good discussion on analyzing and indexing for multiple languages into Elasticsearch.

Related

Elasticsearch query for wikipedia pages

I have indexed all wikipedia pages on elasticsearch, and now I would like to search through them according to a list of keywords that I have created. The documents on elasticsearch have only three fields: id for the page id, title for the page title and content for the page content (already clean of wikipedia markup).
My goal is to reproduce the mediawiki query api as much as possible, with parameters action=query and list=search. For instance, given the keywords "non riemannian metric spaces", a call to
https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=10&srprop=&srsearch=non%20riemannian%20metric%20spaces
gives a list of the most relevant pages for those keywords.
So far I have been using rather simple elasticsearch search queries, like for instance
POST _search
{
"query": {
"bool" : {
"must" : {
"match" : {
"content": {
"query": "non riemannian metric spaces"
}
}
},
"should" : {
"match" : {
"title": {
"query": "non riemannian metric spaces",
"boost": x
}
}
}
}
}
}
for several values of boost, like 1, 2 or 0.5. This gives already some decent results, in the sense that the pages I obtain are relevant to the keywords, but sometimes they are not quite the same I get with the mediawiki api.
I would be glad to hear some suggestions on how to fine-tune the elasticsearch query to mimic more accurately the mediawiki api behavior. Or even, since the mediawiki api itself is built with elasticsearch and cirrussearch, I would like to know whether the actual elasticsearch query for the entry point above with those specific parameters is openly available.
Thank you in advance!
UPDATE (after Robis Koopmans' answer): Seeing the actual query with cirrusDumpQuery has indeed been very useful. I do however have some followup questions concerning the query:
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
Please feel free to answer whichever question you know the answer to, and again thanks a lot!
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
The .plain fields are generated as part of the elasticsearch mapping. The current settings and mappings are available to see how exactly they work. mediawiki.org includes a search glossary entry on the plain field as well. In general the top level field contains a highly processed form of the text, and the plain field uses minimal analysis.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
mediawiki.org also contains an (incomplete) CirrusSearch schema that gives a brief description of these fields and the various analysis chain components used. The all field is an optimization to give a strong first-pass filter against the search index.
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
Suggest field contains shingles (word ngrams) of the articles title and redirects, essentially a pre-calculation of phrase queries. The suggest might look like synonyms in the explain output, and they often contain those, but it also includes misspellings, translations, and numerous other reasons editors have for creating redirects. Matches on redirects are generally a strong relevance signal.
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
This is the fraction of page views on the wiki that go to that article.
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
This model is generated by mjolnir and essentially overwrites the score from the rest of the query. There is some information in wikitech (found there as it is more specific to the WMF deployment of mediawiki than mediawiki itself), also a slide deck called From Clicks to Models might give some insight into whats happening in that code base. Perhaps important to know mjolnir only applies to bag of words queries, queries invoking phrases or other expert functionality skip the ML model.
Noone had asked for the models before, if they might be useful i dumped the current models from the ranking plugin. This contains both the feature definitions used and the decision trees generated by xgboost.
I didn't find an excuse to link it above, but maybe the draft page at CirrusSearch/Scoring that mentions some of the factors that go into retrieval and scoring, particularly for queries that can't be run through mjolnir models, might help as well.
You can add cirrusDumpQuery to your query
example:
https://en.wikipedia.org/w/index.php?title=Special:Search&cirrusDumpQuery=&search=cat+dog+chicken&ns0=1
more information:
https://www.mediawiki.org/wiki/Extension:CirrusSearch#API
You can't make Elasticsearch queries to Wikipedia directly, but CirrusSearch can generate many types of queries beyond fulltext search. It's not clear from your question exactly what type of query you are looking for, but it might be worth to look at sorting options, if you prefer to weight results by text similarity only, and not things like page views.

ElasticSearch Index Modeling

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?
You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

Elasticsearch - what is better: query several fields or single combined field?

Declaimer: Possible duplicate of this SO question, not sure...
Let's assume I have something similar to IMDB (e.g. catalog of movies) and I want to store it in Elasticsearch.
Single Movie record contains Title, Description, and Categories (strings, e.g. "Children", "Action", etc).
Let's assume that users allowed to search a free text, which can be everything: words from title, from description or from categories (e.g. "movie for children").
I wondering, from search performance perspective, what is more efficient: to query on each of the fields, or to create a special big field which is a concatenation of all of the fields and then to query only on it.

Elastic Search: Modelling data containing variable fields

I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.

How to implement Tag search?

I've designed a news hub system which read Rss links and stores whole news in the database. Now I want to implement a search system using tags. Each news has it's own tags. There are lots of algorithms to implement this but I don't know what is the most common to have the best performance. Currently I'm using Elastic search database and I use multiple keyword search. Which one of these are the best?
1- to store tags in a list or a string with a separator and search among them?
2- work like a relational system and have a table of tags, and a table of news tags to have a record for each news tag. and 5 records for 5 tags of one news
3- another algorithm which I don't know
Seems like you want something like the inverted index
This is an index, that for each term (hashtag in your case) holds a list of document ids which contain this hashtag.
For example, if you have 3 documents: d1,d2,d3 with the hash tags:
d1: #tag1, #tag2
d2: #tag3
d3: tag3, #tag2
The inverted index will be:
#tag1: d1
#tag2: d1,d3
#tag3: d2,d3
It is fairly easy using the inverted index to find all documents that contain a certain term (hashtag in your case), by simply going over the list the is attached to this term.
This datastructure is also very efficient for union (or queries) and intersection (and queries).
This DS is very popular for information retrieval for full text search and also is often used in semi-structured search.
For more information, you can read about Information Retrieval in general. Mannings Introduction to Information Retrieval represents this Data structure in the book's first chapter.
ElasticSearch will handle that very well and you have multiple ways of implementing that behavior.
What you want is a parent child relationship between a news article (parent) and its tags (children).
Depending on whether you need to update the hashtags after indexing your news articles or not, you could go with storing them in the news article or as separate documents pointing to the news article document as their parent.
See more details here: http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
You mentioned a choice between storing the tags as a list or a comma separated string. Go with the list as that is more idiomatic and ElasticSearch can handle json objects (you would actually analyze the string and turn it into a list of token anyways).

Resources