Elasticsearch - what is better: query several fields or single combined field? - elasticsearch

Declaimer: Possible duplicate of this SO question, not sure...
Let's assume I have something similar to IMDB (e.g. catalog of movies) and I want to store it in Elasticsearch.
Single Movie record contains Title, Description, and Categories (strings, e.g. "Children", "Action", etc).
Let's assume that users allowed to search a free text, which can be everything: words from title, from description or from categories (e.g. "movie for children").
I wondering, from search performance perspective, what is more efficient: to query on each of the fields, or to create a special big field which is a concatenation of all of the fields and then to query only on it.

Related

elasticseach similarity mechanism in array field

My usecase is I have a field called subjects in elasticsearch index which is a list. This field will be having multiple values. For example one doc has ['subject one', 'subject two', 'subject three'] in field subjects, another doc has ['one test', 'one example', 'two'] in field name. So when I search for subject one in field name, I should get the first document first since it is most relevant, but I was getting the second doc first, even though I am sorting the result by _score.
Basically what I want is for when the user searches multiple search terms, and if all the search terms are present in one documents field then that document should get listed first. For text fields and all, it works fine, But for array fields, it didn't. my list field has more data.
Is there anyway that we can achieve this using any ES similarity mechanisms like BM25..
Thank you

Sphinx: How can I change default ranking method?

I have table of movies (movie_id, title), one movie can have many titles (different languages).
I would like to implement full-text search by all titles, then movies with the same relevance should be ordered by date. Now I’m using sphinx and doing this:
sql_joined_field = all_movie_titles from query; select movie_id as id, title from tbl_movie_titles order by movie_id
It’s the only field which used for search
As I understand, in this way sphinx search matches of keyword in each title of one movie, but some movies have 2 titles, while another, for example, 10. Due to keywords often duplicates in different titles of one movie, sphinx calculate result relevance weight depends on matches in all titles of one movie. Because of this, two movies, which should have the same relevance have different weights. I’ve tried to use different rankers, but anyway results are bad. How can I make sphinx to calculate weight for each title of one movie independently and then take the highest?
If this task can be solved easier by another search engine, like elasticsearch tell me.
Thanks
You've effectively created a field that just contains all the titles concatenated as one long string (the 'joined' in the definition)
So multi-titled movie, will have the words multiple times, which as you say can affect ranking.
You seem to be currently setup to have your sphinx document as a movie. ie one document per movie (regardless of what data you have for the movie)
One options would be to change to instead have one document per title (ie movie/language combination), then the ranking will be 'within' the one language.
Because you (presumably) only want one result per movie, can use the query time GROUP BY option. (which means making sure you have movie_id as an attribute)

Multilanguage elastic search

I will be indexing posts in ElasticSearch. For now there are two languages: English and Chinese. So Each post has one (English) or two translations plus some data that are common for both languages. My question is how should I index posts?
Create two indices: posts-en and posts-cn and store posts separately?
Create single index posts and keep data in format like this:
{
commonParam1: 1,
commonParam2: "somevalue",
...
titleEn: "English title",
titleCn: "Chinese title",
contentEn: "Content EN",
contentCn: "Content CN",
...
}
Unless you have a compelling reason to split a single document across two indexes I'd strongly advise keeping it all in one index.
With one index you can easily use a different analyzer for each each language specific field. Adding additional mappings in the future for new languages is fairly straightforward. It allows you to index each document in a single call as opposed to two, one for each language, if you index separately. You reduce duplicated data (e.g. the common data).
I'd also take a good look at this post: http://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/
It's a good discussion on analyzing and indexing for multiple languages into Elasticsearch.

Elasticsearch indexed database table column structure

I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.
Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.

different field according to categories

im trying to use elasticsearch to search through products. If product is a car for instance, it will have some field like "color", "brand", "model", "km", ...
If it is clothes, it will only have "color", "size", ...
I would like to index all this info in elastic to be able then to search cars with km between aaa km and bbb km, and / or xxxx model, same for clothes or any other products.
how can I create such field(s) in elasticsearch ? I want all products to be in same index, so user can search through all products, but also if user search a type a product, then he should be able to specify some more details according to this kind of product.
I was thinking about array field, but does that mean that all products will have all fields corresponding to all type of products even if some fields are not relevant with some products (ie clothes will have km field ??) ? Or is it possible on indexing to put just info needed corresponding to each product ?
thanks
You could use types. Create a type called car with fields color, brand, model, k etc. and then a type called cloth with fields color, size, etc.
A single index can have many types. The following two links might help you in this:
Creating indices
Creating types and mapping to the index
You could easily search across types so that you could issue a search like this to return all documents form all types within that index:
curl -XGET http://localhost:9200/_search?pretty=true -d '{"query":{"matchAll":{}}}'
Additional information - Searching across types
Having an array field is not a good idea since you would not be utilizing the ability of elasticsearch to index semi structured documents.
All the best.

Resources