ElasticSearch Index Modeling - elasticsearch

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?

You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

Related

Elasticsearch - Test new analyzers against an existing data set

New to Elasticsearch.
I need to update an index to treat both plurals & singulars as matches. So green apple should match green apples and well (and vice versa).
Through my research, I understand I need to recreate the index with a stemmer filter.
So:
"analysis": {
"analyzer": {
"std_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ]
}
}
}
Can anyone confirm if the above is correct? If not, what will I need to use?
I also understand that I cannot modify the existing index, but rather I will need to create a new one with this analyzer, and then re-add all the documents to the new index. Is that correct? If so, is there a shortcut or easy way to tell it to "add all documents from index X to new index Y?"
Thank you for your help
Find inline answers
In most of the cases, it should work, and also its really difficult to cover all the future use-cases and in your case we don't even know your current use-cases, you can use Analyze API and test some of your use-case, before pushing these analyzer related changes to production.*
Adding/changing the Analyzer is a breaking change as it controls how the tokens are generated and indexed in the elasticsearch inverted index, hence you have to create reindex all the documents with updated Analyzer setting, you can use the reindex API with
alias to do it with zero down time.

difference between a field and the field.keyword

If I add a document with several fields to an Elasticsearch index, when I view it in Kibana, I get each time the same field twice. One of them will be called
some_field
and the other one will be called
some_field.keyword
Where does this behaviour come from and what is the difference between both of them?
PS: one of them is aggregatable (not sure what that means) and the other (without keyword) is not.
Update : A short answer would be that type: text is analyzed, meaning it is broken up into distinct words when stored, and allows for free-text searches on one or more words in the field. The .keyword field takes the same input and keeps as one large string, meaning it can be aggregated on, and you can use wildcard searches on it. Aggregatable means you can use it in aggregations in elasticsearch, which resembles a sql group by if you are familiar with that. In Kibana you would probably use the .keyword field with aggregations to count distinct values etc.
Please take a look on this article about text vs. keyword.
Briefly: since Elasticsearch 5.0 string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for simple document with string:
{
"some_field": "string value"
}
below dynamic mapping will be created:
{
"some_field": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.
I hope this answers your question.
Look at this issue. There is some explanation of your question in it. Roughly speaking some_field is analyzed and can be used for fulltext search. On the other hand some_field.keyword is not analyzed and can be used in term queries or in aggregation.
I will try to answer your questions one by one.
Where does this behavior come from?
It is introduced in Elastic 5.0.
What is the difference between the two?
some_field is used for full text search and some_field.keyword is used for keyword searching.
Full text searching is used when we want to include individual tokens of a field's value to be included in search. For instance, if you are searching for all the hotel names that has "farm" in it, such as hay farm house, Windy harbour farm house etc.
Keyword searching is used when we want to include the whole value of the field in search and not individual tokens from the value. For eg, suppose you are indexing documents based on city field. Aggregating based on this field will have separate count for "new" and "york" instead of "new york" which is usually the expected behavior.
From Elastic 5.0 onwards, strings now will be mapped both as keyword and text by default.

Excluding users in a search, the most optimal way

I have two indexes, on for a collection of profiles, and another containing each users excludes, e.g. blocked profiles.
The per user exclude lists will be updated very often, while in comparison the profiles are seldom updated... In this situation it is recommended to separate the data in two indexes, as I understand it.
EDIT [2017-01-25]
This is the mappings for the two indexes:
PROFILES MAPPING
PUT xyz_profiles
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"profile": {
"_all": {"enabled":false},
"dynamic":"strict",
"properties": {
"profile_id":{"type":"integer"},
"firstname": {"type":"text"},
"age": {"type":"integer"},
"gender":{"type":"keyword"},
"height":{"type":"integer"},
"area_id":{"type":"integer"},
"created":{
"type": "date",
"format":"date_time_no_millis"
},
"location": {"type": "geo_point"}
}
}
}
}
EXCLUDE LISTS MAPPING
PUT xyz_exclude_from_search
{
"settings": {
"auto_expand_replicas": "0-all"
},
"mappings": {
"exclude_profile": {
"_all": {"enabled":false},
"dynamic":"strict",
"properties": {
"profile_id":{"type":"integer"},
"exlude_ids":{"type":"integer"}
}
}
}
number_of_shards is 1 since this is on a single node (my test server).
auto_expand_replicas set to 0-all is to make sure that the exclude list it copied to all nodes. I am aware that this is superfluous on a single node, but I don't want to forget when this is implemented on the production cluster.
exclude_ids will be an array of integers (profile ids) to exclude from the search.
This is the part of a search where certain profiles are excluded using current users (id 3076) exclude list:
GET /xyz_profiles/profile/_search{
"query" : {
"bool": {
"must_not": {
"terms" : {
"profile_id" : {
"index" : "xyz_exclude_from_search",
"type" : "exclude_profile",
"id" : "3076",
"path" : "exclude_ids"
}
}
}
}
}
}
Being very new to Elasticsearch, I would very much like to know if this is the most optimal way of doing it. I imagine there are some very experienced people out there, who can pinpoint if my mappings or my search is missing something obvious that would improve performance.
For example, I haven't fully understood the analyze/not_analyzed part of mappings as well as using routings in the search.
This is an interesting question, I think it's a quite common pattern but at the moment there is not much information about it on Internet.
I was in a similar situation some time ago and it was solved in a similar way to the one you propose. But I did not separate it in two indexes, just added an exclude_ids field to our user index. For example, let's say that when the user with id 1 is searching, we use a Term Query to check that id 1 is not inside exclude_ids of target users, a query like:
{ "term": { "exclude_ids": 1 }
After using it with around two million documents I found out that:
Search is fast
Taking into account how the inverted index works I think this usage is correct
Search is done inside the same index (having to search in other indexes means checking more shards)
Updates are slow
Each time an id is added to exclude_ids the whole document is reindexed, since partial updates to a document are not possible. If the exclude_ids array gets very long, the updates can become specially slow.
For the same reason, indexed data that is not usually updated is reindexed, like name or age.
In your case, since you are separating the exclude list in other index; as you said, the data that is not usually updated does not have to be reindexed each time. But the problem of arrays that grow indefinitely is still there.
Plus, taking into account the way you would do the query (with a Terms Query using lookup I guess), filtering a big amount of data there is a possibility of ending with some overhead. But I'm not sure about this. This is discussed here.
It's difficult to decide which one would escalate better with a huge amount of data, doing loading tests could be a good idea.
A way to solve the expensive updates problem could be not inserting exclude_ids in Elasticsearch, but inserting only the active users exclude lists in memory (using Redis or similar), setting a TTL to them. I supposed that the original data is still being stored in MySQL, so it can be taken from there and be inserted in memory each time it is necessary (for example, when a user becomes active). But I think this is not recommended since it seems that a Terms Query with many terms degrades the performance a lot (explained in this issue).
There is already a similar question, but in my opinion there are many things that should be taken into account not spoken there. I would be happy to read more opinions about the search and update performance with big amounts of data.

Elastic Search: Modelling data containing variable fields

I need to store data that can be represented in JSON as follows:
Article{
Id: 1,
Category: History,
Title: War stories,
//Comments could be pretty long and also be changed frequently
Comments: "Nice narration, Reminds me of the difficult Times, Tough Decisions"
Tags: "truth, reality, history", //Might change frequently
UserSpecifiedNotes:[
//The array may contain different users for different articles
{
userid: 20,
note: "Good for work"
},
{
userid: 22,
note: "Homework is due for work"
}
]
}
After having gone through different articles, denormalization of data is one of the ways to handle this data. But since common fields could be pretty long and even be changed frequently, I would like to not repeat it. What could be the other ways better ways to represent and search this data? Parent-child? Inner object?
Currently, I would be dealing with a lot of inserts, updates and few searches. But whenever search is to be done, it has to be very fast. I am using NEST (.net client) for using elastic search. The search query to be used is expected to work as follows:
Input: searchString and a userID
Behavior: The Articles containing searchString in either Title, comments, tags or the note for the given userIDsort in the order of relevance
In a normal scenario the main contents of the article will be changed very rarely whereas the "UserSpecifiedNotes"/comments against an article will be generated/added more frequently. This is an ideal use case for implementing parent-child relation.
With inner object you still have to reindex all of the "man article" and "UserSpecifiedNotes"/comments every time a new note comes in. With the use of parent-child relation you will be just adding a new note.
With the details you have specified you can take the approach of 4 indices
Main Article (id, category, title, description etc)
Comments (commented by, comment text etc)
Tags (tags, any other meta tag)
UserSpecifiedNotes (userId, notes)
Having said that what need to be kept in mind is your actual requirement. Having parent-child relation will need more memory, and ma slow down search performance a tiny bit. But indexing will be faster.
On the other hand a nested object will increase your indexing time significantly as you need to collect all the data related to an article before indexing. You can of course store everything and just add as an update. As a simpler maintenance and ease of implementation I would suggest use parent-child.

many indexes for mongodb refined searches

Referring to this question here:
I am working on a similar site using mongodb as my main database. As you can imagine, each user object has a lot of fields that need to be serchable, say for example mood, city, age, sex, smoker, drinker, etc.
Now, apart from the problem that there cannot be more than 64 indexes per collection, is it wise to assign index to all of my fields?
There might be another viable way of doing it: tags (refer to this other question) If i set the index on an array of predetermined tags and then text-search over them, would it be better? as I am using only ONE index. What do you think? E.g.:
{
name: "john",
tags: ["happy", "new-york", "smoke0", "drink1"]
}
MongoDB doesn't (yet) support index intersection, so the rule is: one index per query. Some of your query parameters have extremely low selectivity, the extreme example being the boolean ones, and indexing those will usually slow things down rather than speed them up.
As a simple approximation, you could create a compound index that starts with the highest-selectivity fields, for instance {"city", "age", "mood", ... }. However, then you will always have to use a city constraint. If you query for {age, mood}, the above index wouldn't be used.
If you can narrow down your result set to a reasonable size using indexes, a scan within that set won't be a performance hog. More precisely, if you say limit(100) and MongoDB has to scan 200 items to fill up those 100, it won't be critical.
The danger lies is very narrow searches across the database - if you have to perform a scan on the entire dataset to find the only unhappy, drinking non-smoker older than 95, things get ugly.
If you want to allow very fine grained searches, a dedicated search database such as SolR might be a better option.
EDIT: The tags suggestion looks a bit like using the crowbar to me -- maybe the key/value multikey index recommended by in the MongoDB FAQ is a cleaner solution:
{ _id : ObjectId(...),
attrib : [
{ k: "mood", v: "happy" },
{ k: "city": v: "new york" },
{ k: "smoker": v: false },
{ k: "drinker": v: true }
]
}
However, YMMV and 'clean' and 'fast' often don't point in the same direction, so the tags approach might not be bad at all.

Resources