What is the best way to index data on elasticsearch?

What is the best way to index data on elasticsearch? - elasticsearch

I have 4 tables:
country
state
city
address
These tables are related by ids where country is the top parent:
state.countryId
city.stateId
address.cityId
I want to integrate elastic search on my application and want to know what is the best way to index these table?
Should i create 1 index for each tables so that i have 1 index for each of country, state, city and address?
Or should i denormalize the tables and create only 1 index and store all the data with redundancy?

ES is not afraid of redundancy in your data, so I would clearly denormalize so that each document represents one address like this:
{
"country_id": 1,
"country_name": "United Stated of America",
"state_id": 1,
"state_name": "California"
"state_code": "CA",
"city_id": 1,
"city_name": "San Mateo"
"zip_code": 94402,
"address": "400 N El Camino Real"
}
You can then aggregate your data on whatever city, state, country field you wish.
Your mileage may vary as it ultimately depends on how you want to query/aggregate your data, but it's much easier to query address data like this in a single index instead of hitting several indices.

I like Val's answer, it is the most straight forward option. But if you really want to reduce duplication (for example to minimize size on disk) you could use parent-child mapping. It will make indexing and querying a bit more verbose though. I still sugges to go with "flat" mapping.
You asked "what if you need the individual country or state or city records?", I'd recommend to add an additional field (not_analyzed or integer) which would indicate which level of hierarchy this document represents. It is fine not to have fields which correspond to lower levels of hierarchy. This way you could easily have a filter on just searching states or countries.

Here is a very useful article by #adrien-grand which elaborates on the subject of the trade-offs between creating many indexes, or less indexes and many types.
Hope it helps!

Related

ElasticSearch Index Modeling

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?

You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

How to use the elasticsearch type?

Note: it will be very appreciate if you tell me why you think this is a shit question by comment. Please do not just down vote and not telling why..
We know there is the concept called type under index. But I do not know why we need it.
Firstly I thought we use it to organize data. Like we have index like below:
curl -XPOST 'localhost:9200/customer/USA/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
But in the above situation, we can always eliminate the type, move it to the json body like :
curl -XPOST 'localhost:9200/customer/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe","country":"USA" }
{"index":{"_id":"2"}}
{"name": "Jane Doe","country":"USA" }
'
In this way we can always add a field to replace the type.
Then I thought it may be performance related. I thought If you split the data into different type, then there is less data under each type. So the performance to query each type should be better. But it is also not like that.
The performance of elasticsearch index is related to the shard. So even you split the data into different type, it still stored under the same sets of shards.
Then why we need type?

First of all, although elastic search determine types of fields on runtime, but once it has assigned a particular type to a field it would always expect same type of value for that field. So you need multiple types if you need to store different type of data. Secondly it allows for storing multiple types with difference mappings in single index. Besides it makes querying on a particular type easier if you are sure about its schema.

From my understanding of ES , type is something we can relate to table concept in a relational database. In which a database can be said as group of related tables. Likewise in ES,index is a group of related types each type in index holds documents that share some common property or fields.
In your example,for a index say Customer we can have different employees from different countries like USA,india,UK etc. Customer records from each country can be grouped under different types so that it will be organized. And when we run a search query for customers in a particular country we will need to run that query on type USA only. We don't need to lookup in the whole index to get the data of customers from USA.
Another example : Let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data. So we are logically organizing the data to different types and looking up to the required type whenever we do a search.
So in general,type is a logical category/partition of your index whose semantics is completely up to you. It can be defined as documents that have a set of common fields.
You may refer to this post for better understanding https://www.elastic.co/blog/index-vs-type

ElasticSearch routing based on array field

I've tried to google my question but search results are flooded with articles using very basic implementation of routing. I didn't manage to find anything useful.
Let's say I have an object "product":
{ price: 100, category: 1 }
Routing based on "category" field will work as expected.
Now I change my "product" to:
{ price: 100, categories: [1,2,3] }
How routing will behave based on "categories" field? Is it safe to do it? Any side effects (except possible product duplication in search results)?
Will ES engine split an array and put product into different shards? Or will it combine all values into one (for example, "1_2_3") and put this product into a single shard?
I would be really thankful for any thoughts on this topic.

Drawing "opened count" over time given open-event and close-event documents

I have documents modeling the creation of a ticket, such as:
{
"number": 12,
"created_at": "2015-07-01T12:16:17Z",
"closed_at": null,
"state": "open"
}
At some point in the future, a second document models the closing event:
{
"number": 12,
"created_at": "2015-07-01T12:16:17Z",
"closed_at": "2015-07-08T8:12:42Z",
"state": "closed"
}
Problem: I want to draw the history of opened tickets. In the example above, I'd like ticket number 12 to contribute to the count on the whole 2015-07-01 to 2015-07-08 timespan. What I tried:
Bucketing with date_histogram only seems to be able to give the number of tickets created or closed on any given date bucket.
Scripted metrics only seem to allow me to change the metric computation, not the particular bucketing of a document.
This is my very first day playing with Elastic Search and Kibana so I might be missing something obvious. Especially, I cannot tell if buckets act as partitions (hence if a document can only be in a single bucket), and hence if my problem can only be solved by creating additional documents for each datapoint I want to appear on the graph.
Additional note: I have control over the feeding process and the schema if storing additional data can help, but I'd like to avoid doing so if possible.

Though thats not a big deal , either mantain Hashing on the basis of Dates, or keep the
created_at
as a grouping key for documents made on a day , so that you can distinguish and query them as you want !!

many indexes for mongodb refined searches

Referring to this question here:
I am working on a similar site using mongodb as my main database. As you can imagine, each user object has a lot of fields that need to be serchable, say for example mood, city, age, sex, smoker, drinker, etc.
Now, apart from the problem that there cannot be more than 64 indexes per collection, is it wise to assign index to all of my fields?
There might be another viable way of doing it: tags (refer to this other question) If i set the index on an array of predetermined tags and then text-search over them, would it be better? as I am using only ONE index. What do you think? E.g.:
{
name: "john",
tags: ["happy", "new-york", "smoke0", "drink1"]
}

MongoDB doesn't (yet) support index intersection, so the rule is: one index per query. Some of your query parameters have extremely low selectivity, the extreme example being the boolean ones, and indexing those will usually slow things down rather than speed them up.
As a simple approximation, you could create a compound index that starts with the highest-selectivity fields, for instance {"city", "age", "mood", ... }. However, then you will always have to use a city constraint. If you query for {age, mood}, the above index wouldn't be used.
If you can narrow down your result set to a reasonable size using indexes, a scan within that set won't be a performance hog. More precisely, if you say limit(100) and MongoDB has to scan 200 items to fill up those 100, it won't be critical.
The danger lies is very narrow searches across the database - if you have to perform a scan on the entire dataset to find the only unhappy, drinking non-smoker older than 95, things get ugly.
If you want to allow very fine grained searches, a dedicated search database such as SolR might be a better option.
EDIT: The tags suggestion looks a bit like using the crowbar to me -- maybe the key/value multikey index recommended by in the MongoDB FAQ is a cleaner solution:
{ _id : ObjectId(...),
attrib : [
{ k: "mood", v: "happy" },
{ k: "city": v: "new york" },
{ k: "smoker": v: false },
{ k: "drinker": v: true }
]
}
However, YMMV and 'clean' and 'fast' often don't point in the same direction, so the tags approach might not be bad at all.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio