Elasticsearch indexed database table column structure - elasticsearch

I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.

Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.

Related

Which is the best way to index the data from relational database table of One to many relationship

Can you please let me know which is the best way to index the records in elastic search for my scenario.
My Scenario is :
1) Need to index around 40 million records from oracle table which has entries having one to many relationship records. And the uniqueness of the records is based on the composite key with 4 columns
2) After indexing , Search should support "full text search" on all the fields
3) Filters and sorting on selected fields needs to be supported.
After going through the official documentation i found couple of options , but want to know which approach would be most useful among below
1) For each record in table create a entry in the elastic index
2) Create a nested json object based on the composite key and then add this elastic index
3)Parent child Relationship mechanism and application side joins are not suitable for my scenario
Thanks
Girish T S
Your question is not particularly clear, here's how I understand it: you have 40M child records in one table, each with a reference to a parent record.
You want to index your records so as to be able to search for a parent record whose children match certain criteria.
There are two solutions here:
Indexing one document per parent, with all children indexed as nested documents within the parent
Indexing each child record as a separate document, with a parent-child relationship in ElasticSearch
The first solution will have better performance, but it means that every time a child is updated, the full parent document must be reindexed with all its children.
In any case you're saying that a parent-child scheme is not suitable for your case, so you're left with only the first solution.

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

What does the "Type" mean in Elasticsearch?

I am totally confused by Elasticsearch's documents.
In Basic Concepts: Type, "type" are somehow like collections in MongoDB:
In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
But in Types and Mappings: Type Takeaways, it says:
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
Doesn't "user" and "blog" above mentioned have mutually exclusive sets of fields?
For example: there are "name", "age" fields for "user", and "createdAt", "content" for "blog".
I'm used to believe the mapping relation between Elasticsearch and MongoDB is:
index <=> database
type <=> collection
isn't it right?
If not, what is the recommended mapping style between them?
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems.
The type is just another field in Elasticsearch, at the very basic level. When you do GET /my_index/my_type/_search ES will run a pre-filter for my_type value for field _type - it's like an automatic filter.
Don't think about indices and types as databases and tables in SQL world, because they are not that.
If you have type1 with fields f1 and f2 and type2 with fields f1 and f3 in the index there will be documents with fields f1, f2, f3. Why this matters - when the score for a document will be calculated with queries that search for values in field f1 the terms frequencies in field f1 will be global (both type1 and type2) so if you search some value in f1 from type1 then the score you get back is slightly influenced, also, by the values of f1 in type2.
Also, please, don't translate a set of SQL tables to ES by simply following the primary key/foreign key approach to define parent/child relationships in ES.
You're right, index == database and type == collection for elasticsearch. In RDBMS terms, index is a database and type can be a table which contains many rows(document in elasticsearch).
You could have a different index maintaining user information, with the "name", "age" and other such fields generally attributed to a person, and a different one for blogs with "createdAt", "content", etc. Yet, you might want to have a "user" field inside each blog document to be able to identify the person who posted it. Later, you can apply application-side joins, if need be.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

parent-child documents relation retrieval

could you help me with little problem regarding parent-child documents relation?
Considering JSON, I have objects, each of them contains an array of sub-objects. Sub-objects contain some text fields.
I need to maintain full-text-search on these objects and construct snippets. I need highlighting for building snippets.
If I use nested objects, highlighting does not deal with them.
Therefore, I use Parent-Child relationships.
Now I need to retrieve Parent-documents, which children match the query_string. Furthermore, I need to get highlighted fields of matched children and associate each one(each child) with corresponding parent to construct snippets in my application.
Is it possible to accomplish my goal in one query?
I think that you should consider using the children aggregation. With that you can retrieve children items within their parents. It's aggregation so you are not able to get the whole document (just id) but with that you retrieve the relationship... Then with another query you can get document details quickly.
Link here : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html
And more details : https://www.elastic.co/guide/en/elasticsearch/guide/current/children-agg.html

Resources