Elastic search application side join - pagination and aggregations - elasticsearch

I have many-to-many relationship in db. when records are indexed into Elastic search will be using application side join. given joining two resultsets from Elasticsearch will have to be merged at an application level - will be losing out of the box pagination that Elastic search offers (on merged results sets at application side) and also most importantly aggregations. trying to find out way to go about leveraging ES aggregations and pagination while using application side join and merging results on application side?

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

What are the best ways to do a one time data load from Oracle to Elastic Search

We are trying to do a one-time data load from Oracle to Elastic Search.
We have evaluated Logstash but the indexing is taking a lot of time.
We have tried Apache Nifi but are facing difficulty in loading nested objects and computed results in apache Nifi.
We are trying to maintain one-many relations in nested objects(We have an oracle query to fetch these results) and we are also maintaining the result of a hierarchical query as a field in the index.
We are looking for an open-source alternative and an efficient approach to load around 10 tables with 3 million records each from Oracle to Elastic Search.
Please suggest.

How to join two elasticsearch inserts?

I am very new to elasticsearch and come from a SQL background. We are trying to use a ELK stack to monitor a Jenkins server. We use the elasticsearch report plugin to send a bunch of information about the job. However, we also have some custom information that we would need to send. However, how can I join these two pieces of information in Kibana? In a SQL database, I would have two tables, then join them based on a key. However, I don't how to do it in elasticsearch. Any suggestions?
Generally speaking, join is the strong-suite of relational DBs (aka SQL DBs) and the weak-spot of the NoSQL (Elasticsearch among them). Having said that, ES does support such operations and if performance is not critical, you can try it: Elasticsearch joining queries. In a nutshell:
Create a join-field mapping. This is the equivalent for a foreign key constraint in SQL. Since you have control over the logstash part, I suggest you make it the parent and the ES report info the child.
Use the has_child query when you query the logs. This type of query acts like the join query in SQL.

Querying the elastic search before inserting into it

i am using spring boot application to load messages into elastic search. i have a use case where i need to query the elastic search data , get some id value and populate it in elastic search json document before inserting it into elastic search
Querying the elastic search before insertion . Will this be expensive ? If yes is there some other way to approach this issue.
You can use update_by_query to do it in one step.
But otherwise it shouldn't be slow if you do it in two steps (get + update). It depends on many things - how often you do it, how much data is transferred, etc.

How to filter a huge list of ids from Solr at runtime

I have an index for products is Solr. I need to serve a customized list of products for each customer such that I have to exclude some specific products for each customer.
Currently I am storing this relationship of customer & excluded products in a SQL database and then filtering them in Solr using a terms query. Is there a way I can store this relationship in Solr itself so that I dont have to calculate the exclude list every time from SQL first.
Something very similar to what we can do in elasticsearch using https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
Possible ways I could think of doing in Solr:
Keeping a list of customers in the products index itself, and filter on that. But this will really be a pain if I have to reindex all the documents. Also the list can be huge.
Another way that I could think of is maintaining a separate core for keeping documents per customer and excluded product_id and perform a join using {!join} to filter out products for a customer. Is it a scalable solution.
What should be the ideal approach for storing such kind of data in Solr.
Are there any performance issues with the SQL DB? It is perfectly fine to query the DB and get the IDs, and send them to Solr. You would avoid complexity and data duplication. You would anyway have to do some computation to send those IDs to Solr as well.
But to answer your question, yes, you could store the excluded product IDs per customer indeed in a separate index. You would be using a multi-valued field and update using atomic updates. If you do that, make sure to keep the indexing schema simple with no analyzer used for the IDs (just use the string type without any tokenizer or filter).
You do not need to do a Solr join query. You only have to lookup the product IDs per customer (1st query) and massage them as CSV, and do the terms query with the product IDs retrieved from the index (2nd query).
You need to find the best compromise for you
Best Query Time Performances
You add a field (multi valued) to the product index : allowed_users ( or forbidden_users) depending on the cardinality ( that you want to minimise).
This would require a re-indexing for the first time and an index update for each User permission change.
In order to reduce the network traffic and optimise the updates you could take a look to atomic updates[1] .
Best Index Time Performances
If the previous approach is not feasible in your case or doesn't satisfy you, you could try to optimise the indexing side.
You can index a document in a separate collection :
<Id>
<product_id>
<user_id>
You can use the query time join to filter the collection for the current user and then get back the products to filter them on your query.
So basically, you already thought about both the ideas :)
[1] https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

Resources