Aggregating on entities within Kibana using values in nested entities - elasticsearch

An ElasticSearch index contains a Product entity. Each product has an array of Components entities.
A component may contain an optional outOfStock field.
Given the following example:
"Product":
"name": "blue_toy"
"Components": [
{
"partnumber": "100"
"supplier": "smith and sons"
"outOfStock": "true"
}
{
"partnumber": "200"
"supplier": "smith and sons"
}]
}
"Product":
"name": "green_toy"
"Components": [
{
"partnumber": "300"
"supplier": "smith and sons"
}]
}
blue_toy cannot be built because one part is unavailable.
I want to show in a chart how many products cannot be build, as opposed to the number which can be built.
Given that if even one component is unavailable the entire product cannot be built, in the above example to distribution would be 50% - 50%.
Note that this is different than how many components of the total set are are of stock (which would be 33% - 66%).
In essense, the question is how to mark or flag a root entity based on the contents of one of its nested entities.
How could one do this in Kibana?
Thanks

I dont know if it will fit in your exemple but i once did have a similar problem which I solved with the "copy_to" parameter.
In your exemple, you have to change the mapping of Product to add a "copy_to" to your "outOfStock" field.
it'll create a field (with a specified name) in the root document with your "outOfStock" value.
This field will be add at indexing time and you can say that if the field created by the "copy_to" is "true" then the Product cannot be built.
See: https://www.elastic.co/guide/en/elasticsearch/reference/1.4/mapping-core-types.html

Related

Elastic Search indexing for many different mapping types

I have implemented something like Class and Instance logic in my application where I create an object named category which is a blue print for it's instances.
User has freedom to create as many Categories as they like with whatever fields hence I used to use one new TYPE for each category in my elastic search index mapping until it was deprecated in latest upgrades.
With latest upgrades of ES , I can think of only these 2 approaches -
creating one index for each category
keeping one object type field named after the TYPE that holds fields for
each category and keep updating this one mapping every time.
I am trying to decide on which approach to take up for ES upgrade to version 7 from 5 to keep this dynamic nature of my data modelling. Searches would be governed by TYPE string that is system generated ID for each category hence need to have grouping of fields based on the category they belong to.
OLD MAPPINGS - NOW DEPRECATED
first one - one for each TYPE(category)
{
"type_cat1" : {
"dynamic" : "strict"
"mapping" :{
"field11" : {...}
}
}
}
second one and so on
{
"type_cat2" : {
"dynamic" : "strict"
"mapping" :{
"field21" : {...}
}
}
}
}
NEW MAPPING WITH OBJECTS FOR EACH OLD TYPE
{
"mapping" :{
"properties" :{
"type_cat1" : {
"properties" :{
"field11" : {...}
}
},
"type_cat2" : {
"properties" :{
"field11" : {...}
}
}
}
}
}
ALTERNATIVE NEW MAPPING - ONE INDEX PER CATEGORY (not more than 500)
One index would be created separately for each category...
Please advice if a better approach is out there or which one to choose among these...
I have a similar use-case at my workplace where the user can create an object with any number of fields, each field can be of any datatype.
Our approach is similar to one of yours:
All categories will be mapped to a single index.
Whenever a new object is created, the index mappings are updated to accommodate the new object (a category in your case).
This is what our mappings look like when molded to your needs:
{
"mappings": {
"category": { // this is a field present in all documents
"type": "keyword"
},
"createdTime": { // this is a field present in all documents
"type": "date"
},
"id": { // this is a field present in all documents
"type": "long"
},
"fields": {
"properties": {
"type_cat1": {
"properties": {
"field1": {...},
"field2": {...}
}
},
"type_cat2": {
"properties": {
"field1": {...},
"field2": {...}
}
},
{...}
}
}
}
Get all records of a certain category:
"category": "cat1"
Get all records of cat1 where field2 == "dummy_value"
"category": "cat1 AND "fields.cat1.field2.keyword": "dummy_value"
When a new category is created, the fields part of our mappings get updated.
Extracting out the common fields (category, createdTime, id) eliminates redundancy in mappings.
Some worthy points:
As the number of unique categories is only 500, you can also go with a separate index per category. This is more beneficial if there are going to be many records (> 1,00,000) per category.
If the categories are sparse in nature (each category has less number records), then ES can easily handle everything in a single index.
If we assume 50 fields per category on average, then the total fields in the single index approach will be 50*500 = 25000. This is a manageable number.
Of course, in the end, many things will depend upon resources allocated to the cluster.

ElasticSearch: index document correctly and create correct search request

I apologize for my silly question, but as always being a newbie to some new SW stack it is hard to find an exact answer quickly. So please help.
Question 1:
I have many documents of the following shape. Quite simple, description of a company and some products that a company offers. The question is how to post this document into elasticsearch(ES)? Because nested structures have some important sense, but ES does not eat it as is.
Question 2:
It is about search request itself. I need to search through all such documents looking for appropriate phrase in the field "description" and also looking for particular type of products.
For example, I need to find all companies with description which includes phrase "South Africa", which offer fruits and in the same time offer only onions from vegetable category.
Field "description" is just text. When everything is under products is pre-defined known lists. There are can be many many different categories and names under categories.
What could be search request in such case?
{
"description": "The best goods from Africa",
"products": [
{
"category": "fruits",
"name": [ "oranges", "cocos" ]
},
{
"category": "vegetables",
"name": [ "cabbage", "cucumbers", "onion" ]
},
...
]
}

How to index and query Nested documents in the Elasticsearch

I have 1 million users in a Postgres table. It has around 15 columns which are of the different datatype (like integer, array of string, string). Currently using normal SQL query to filter the data as per my requirement.
I also have an "N" number of projects (max 5 projects) under each user. I have indexed these projects in the elasticsearch and doing the fuzzy search. Currently, for each project (text file) I have a created a document in the elasticsearch.
Both the systems are working fine.
Now my need is to query the data on both the systems. Ex: I want all the records which have the keyword java (on elasticsearch) and with experience of more than 10 years (available in Postgres).
Since the user's count will be increasing drastically, I have moved all the Postgres data into the elasticsearch.
There is a chance of applying filters only on the fields related to the user (except project related fields).
Now I need to created nest projects for the corresponding users. I tried parent-child types and didn't work for me.
Could anyone help me with the following things?
What will be the correct way of indexing projects associated with the users?
Since each project document has a field called category, is it possible to get the matched category name in the response?
Are there any other better way to implement this?
By your description, we can tell that the "base document" is all based on users.
Now, regarding your questions:
Based on what I said before, you can add all the projects associated to each user as an array. Like this:
{
"user_name": "John W.",
..., #More information from this user
"projects": [
{
"project_name": "project_1",
"role": "Dev",
"category": "Business Intelligence",
},
{
"project_name": "project_3",
"role": "QA",
"category": "Machine Learning",
}
]
},
{
"user_name": "Diana K.",
..., #More information from this user
"projects": [
{
"project_name": "project_1"
"role": "Project Leader",
"category": "Business Intelligence",
},
{
"project_name": "project_4",
"role": "DataBase Manager",
"category": "Mobile Devices",
},
{
"project_name": "project_5",
"role": "Project Manager",
"category": "Web services",
}
]
}
This structure is with the goal of adding all the info of the user to each document, doesn't matter if the info is repeated. Doing this will allow you to bring back, for example, all the users that work in a specific project with queries like this:
{
"query":{
"match": {
"projects.name": "project_1"
}
}
}
Yes. Like the query above, you can match all the projects by their "category" field. However, keep in mind that since your base document is merely related to users, it will bring back the whole user's document.
For that case, you might want to use the Terms aggregation, which will bring you the unique values of certain fields. This can be "combined" with a query. Like this:
{
"query":{
"match": {
"projects.category": "Mobile Devices"
}
}
},
"size", 0 #Set this to 0 since you want to focus on the aggregation's result.
{
"aggs" : {
"unique_projects_names" : {
"terms" : { "field" : "projects.name" }
}
}
}
That last query will bring back, in the aggregation fields, all the unique projects' name with the category "Mobile Devices".
You can create a new index where you'll store all the information related to your projects. However, the relationships betwen users and projects won't be easy to keep (remember that ES is NOT intended for being an structured or ER DB, like SQL) and the queries will become very complex, even if you decide to name both of your indices (users and projects) in a way you can call them with a wildcard.
EDIT: Additional, you can consider store all the info related to your projects in Postgress and do the call separately, first get the project ID (or name) from ES and then the project's info from Postgres (since I assume is maybe the info that is more likely not to change).
Hope this is helpful! :D

Index main-object, sub-objects, and do a search on sub-objects (that return sib-objects)

I've an object like it (simplified here), Each strain have many chromosomes, that have many locus, that have many features, that have many products, ... Here I just put 1 of each.
The structure in json is:
{
"name": "my strain",
"public": false,
"authorized_users": [1, 23, 51],
"chromosomes": [
{
"name": "C1",
"locus": [
{
"name": "locus1",
"features": [
{
"name": "feature1",
"products": [
{
"name": "product1"
//...
}
]
}
]
}
]
}
]
}
I want to add this object in Elasticsearch, for the moment I've add objects separatly: locus, features and products. It's okay to do a search (I want type a keyword, watch in name of locus, name of features, and name of products), but I need to duplicate data like public and authorized_users, in each subobject.
Can I register the whole object in elasticsearch and just do a search on each locus level, features and products ? And get it individually ? (no return the Strain object)
Yes you can search at any level (ie, with a query like "chromosomes.locus.name").
But as you have arrays at each level, you will have to use nested objects (and nested query) to get exactly what you want, which is a bit more complex:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.3/query-dsl-nested-query.html
For your last question, no, you cannot get subobjects individually, elastic returns the whole json source object.
If you want only data from subobjects, you will have to use nested aggregations.

Elasticsearch find uniqueness of content

I have a system that pulls in articles and stores them in an elasticsearch index. When a new article is available I want to determine how unique the article's content is before I publish it on my site, so that I can try and reduce duplicates.
Currently I search for the new article title against the index using a min_score filter and if there are 0 results then it can be published:
{
"index": "articles",
"type": "article",
"body": {
"min_score": 1,
"query": {
"multi_match": {
"query": "[ARTICLE TITLE HERE]",
"type": "best_fields",
"fields": [
"title^3",
"description"
]
}
}
}
}
This is not very accurate as you can imagine, most articles get published with a fair amount of duplicates.
How do you think I could improve this (if at all)?
Well , you need to handle this before indexing the document.
My best solution would be to model the _id based on title , so that if the same title exist , the new document can be discarded ( using _create API ) or all document can be discarded.
Even better , you can use upsert so that the exisitng document is updated by the duplicate info , like you can tell that news from this source has also appeared in this source.
You can see some practical example of the same here.

Resources