How to index and query Nested documents in the Elasticsearch - elasticsearch

I have 1 million users in a Postgres table. It has around 15 columns which are of the different datatype (like integer, array of string, string). Currently using normal SQL query to filter the data as per my requirement.
I also have an "N" number of projects (max 5 projects) under each user. I have indexed these projects in the elasticsearch and doing the fuzzy search. Currently, for each project (text file) I have a created a document in the elasticsearch.
Both the systems are working fine.
Now my need is to query the data on both the systems. Ex: I want all the records which have the keyword java (on elasticsearch) and with experience of more than 10 years (available in Postgres).
Since the user's count will be increasing drastically, I have moved all the Postgres data into the elasticsearch.
There is a chance of applying filters only on the fields related to the user (except project related fields).
Now I need to created nest projects for the corresponding users. I tried parent-child types and didn't work for me.
Could anyone help me with the following things?
What will be the correct way of indexing projects associated with the users?
Since each project document has a field called category, is it possible to get the matched category name in the response?
Are there any other better way to implement this?

By your description, we can tell that the "base document" is all based on users.
Now, regarding your questions:
Based on what I said before, you can add all the projects associated to each user as an array. Like this:
{
"user_name": "John W.",
..., #More information from this user
"projects": [
{
"project_name": "project_1",
"role": "Dev",
"category": "Business Intelligence",
},
{
"project_name": "project_3",
"role": "QA",
"category": "Machine Learning",
}
]
},
{
"user_name": "Diana K.",
..., #More information from this user
"projects": [
{
"project_name": "project_1"
"role": "Project Leader",
"category": "Business Intelligence",
},
{
"project_name": "project_4",
"role": "DataBase Manager",
"category": "Mobile Devices",
},
{
"project_name": "project_5",
"role": "Project Manager",
"category": "Web services",
}
]
}
This structure is with the goal of adding all the info of the user to each document, doesn't matter if the info is repeated. Doing this will allow you to bring back, for example, all the users that work in a specific project with queries like this:
{
"query":{
"match": {
"projects.name": "project_1"
}
}
}
Yes. Like the query above, you can match all the projects by their "category" field. However, keep in mind that since your base document is merely related to users, it will bring back the whole user's document.
For that case, you might want to use the Terms aggregation, which will bring you the unique values of certain fields. This can be "combined" with a query. Like this:
{
"query":{
"match": {
"projects.category": "Mobile Devices"
}
}
},
"size", 0 #Set this to 0 since you want to focus on the aggregation's result.
{
"aggs" : {
"unique_projects_names" : {
"terms" : { "field" : "projects.name" }
}
}
}
That last query will bring back, in the aggregation fields, all the unique projects' name with the category "Mobile Devices".
You can create a new index where you'll store all the information related to your projects. However, the relationships betwen users and projects won't be easy to keep (remember that ES is NOT intended for being an structured or ER DB, like SQL) and the queries will become very complex, even if you decide to name both of your indices (users and projects) in a way you can call them with a wildcard.
EDIT: Additional, you can consider store all the info related to your projects in Postgress and do the call separately, first get the project ID (or name) from ES and then the project's info from Postgres (since I assume is maybe the info that is more likely not to change).
Hope this is helpful! :D

Related

Elastic/Opensearch: HowTo create a new document from an _ingest/pipeline

I am working with Elastic/Opensearch and want to create a new document in a different index out of an _ingest/pipeline
I found no help in the www...
All my documents (filebeat) get parsed and modified in the beginning by a pipline, lets say "StartPipeline".
Triggered by an information in a field of the incoming document, lets say "Start", I want to store that value in a special way by creating a new document in a different long-termindex - with some more information from the triggering document.
If found possibilities, how to do this manually from the console (update_by_query / reindex / painlesscripts) but it has to be triggered by an incoming document...
Perhaps this is easier to understand - in my head it looks like something like that.
PUT _ingest/pipeline/StartPipeline
{
"description" : "create a document in/to a different index",
"processors" : [ {
"PutNewDoc" : {
"if": "ctx.FieldThatTriggers== 'start'",
"index": "DestinationIndex",
"_id": "123",
"document": { "message":"",
"script":"start",
"server":"alpha
...}
}
} ]
}
Does anyone has an idea?
And sorry, I am no native speaker, I am from Germany

is there a elasticsearch standard solution to load recently changed relational data

I have following tables which have millions of records and they are changing frequently is there a way to load that data in elasticsearch (for eventual consistency ) with spring boot initially and incrementally?
Tables :
Employee
Role
contactmethod (Phone/email/mobile)
channel
department
Status
Address
Here the document will be like below
{
"id":1,
"name": "tom john",
"Contacts":[
{
"mobile":123,
"type":"MOBILE"
},
{
"phone":223333
"type":"PHONE"
}
]
"Address":[
{
"city": "New york"
"ZIP": 12343
"type":"PERMANENT"
},
{
"city": "New york"
"ZIP": 12343
"type":"TEMPORARY"
}
]
}
.. simillar data for ROLE,DEPT etc tables
]
How do I make sure that ev.g. mobile number of "tom john" changed in relational DB will be propagated to elasticsearch DB ?
You should have a background job in your application, which pulls the data from DB(you know when there is change in DB of-course), and based on what you need(filtering, massaging) reindex that in your Elasticsearch index.
or you can use the logstash with JDBC to keep your data in sync, please refer to elastic blog on how to do it.
The first one is a flexible and not out of the box solution, while the second one is out of the box solution, and there are pros and cons of both the approaches and choose what fits best in your use-case.

Using elastic search to build flow/funnel results based on unique identifiers

I want to be able to return a set of counts of individual documents from a single index based on a previous set of results, and am wondering if there is a way to do it without running a separate query for each.
So, given a data set like this (simplified version of my ES documents):
{
"name": "visit",
"sessionId": "session1"
},
{
"name": "visit",
"sessionId": "session2"
},
{
"name": "visit",
"sessionId": "session3"
},
{
"name": "click",
"sessionId": "session1"
},
{
"name": "click",
"sessionId": "session3"
}
What I would like to do is be able to search for name: visit and give a count of all those. That part is easy. But I would also like to be able to now count my name: click docs that have the sessionId of the name: visit result set and return a count of how many of those name: click there were as well as the name: visit.
Is there an easy way to do this? I have looked at aggregation APIs but they all seem to not quite fit my needs. There also seems to be a parent/child relationship but it doesn't apply to my situation since both documents I want to individually get counts of are of the same type.
Expected result would be something like this:
{
"count": {
// total number of visit events since this is my start point
"visit": 3,
// the amount of click results that have sessionId
// matching my previous search's sessionId values
"click": 2
}
}
At first glance, you need to do this in two queries:
the first aggregation query to retrieve the sessionIds and
a second aggregation query filtered with those sessionIds to find the count of clicks.
I don't think it's a big deal to run those two queries, but that depends on how much data you have and how many sessionIds you want to retrieve at once.

Elasticsearch find uniqueness of content

I have a system that pulls in articles and stores them in an elasticsearch index. When a new article is available I want to determine how unique the article's content is before I publish it on my site, so that I can try and reduce duplicates.
Currently I search for the new article title against the index using a min_score filter and if there are 0 results then it can be published:
{
"index": "articles",
"type": "article",
"body": {
"min_score": 1,
"query": {
"multi_match": {
"query": "[ARTICLE TITLE HERE]",
"type": "best_fields",
"fields": [
"title^3",
"description"
]
}
}
}
}
This is not very accurate as you can imagine, most articles get published with a fair amount of duplicates.
How do you think I could improve this (if at all)?
Well , you need to handle this before indexing the document.
My best solution would be to model the _id based on title , so that if the same title exist , the new document can be discarded ( using _create API ) or all document can be discarded.
Even better , you can use upsert so that the exisitng document is updated by the duplicate info , like you can tell that news from this source has also appeared in this source.
You can see some practical example of the same here.

Aggregating on entities within Kibana using values in nested entities

An ElasticSearch index contains a Product entity. Each product has an array of Components entities.
A component may contain an optional outOfStock field.
Given the following example:
"Product":
"name": "blue_toy"
"Components": [
{
"partnumber": "100"
"supplier": "smith and sons"
"outOfStock": "true"
}
{
"partnumber": "200"
"supplier": "smith and sons"
}]
}
"Product":
"name": "green_toy"
"Components": [
{
"partnumber": "300"
"supplier": "smith and sons"
}]
}
blue_toy cannot be built because one part is unavailable.
I want to show in a chart how many products cannot be build, as opposed to the number which can be built.
Given that if even one component is unavailable the entire product cannot be built, in the above example to distribution would be 50% - 50%.
Note that this is different than how many components of the total set are are of stock (which would be 33% - 66%).
In essense, the question is how to mark or flag a root entity based on the contents of one of its nested entities.
How could one do this in Kibana?
Thanks
I dont know if it will fit in your exemple but i once did have a similar problem which I solved with the "copy_to" parameter.
In your exemple, you have to change the mapping of Product to add a "copy_to" to your "outOfStock" field.
it'll create a field (with a specified name) in the root document with your "outOfStock" value.
This field will be add at indexing time and you can say that if the field created by the "copy_to" is "true" then the Product cannot be built.
See: https://www.elastic.co/guide/en/elasticsearch/reference/1.4/mapping-core-types.html

Resources