How can I get information from 2 different ElasticSearch indexes? - elasticsearch

So, I have 2 indexes in my Elasticsearch server.
I need to gather the results from the first index, and for each result I need to gather info from the second index.
How to do that? Tried the foreach processor, but no luck so far.
Tky

I need to gather the results from the first index, and for each result I need to gather info from the second index.
Unless you create parent/child relationships, that's not possible in ElasticSearch.
However, note:
In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.
Handle reading from multiple indexes within your application or rethink your index mapping.

The foreach processor is for ingest pipelines, meaning, stuff that gets done at indexing time. So it won't help you when you are trying to gather the results.
In general, it's not going to be possible to query another index (which might live on another shard) from within a query.
In some cases, you can use a join field. There are performance implications, it's only recommended in specific cases.
If you are not in the join field use case, and you can restructure your data to use nested objects, it will be more performant than join fields.
Otherwise, you'll be better off running multiple queries in the application code (maybe you can fetch all the "secondary" results using just one query, so you'd have 2 queries in total?)

Related

Elasticsearch where in

Can I in elasticsearch express a query that is similar to the following sql query?
select * from data where data.uid in(select d2.uid from data d2 where d2.colX='val1');
There are two possible solutions.
You can first run a query to get a list of IDs and then use those IDs to run a second query (terms query). Use this approach if you know that the result of the first query will stay under 65,536 IDs/terms. Elasticsearch has a default limit on this amount. You shouldn't increase this limit, it's there for a reason!
You can use nested or a parent/child documents. The main difference is that nested are faster compared to parent/child, but, nested docs require reindexing the parent with all its children, while parent child allows to reindex / add / delete specific children. I don't have enough context to know which type of join will work best in your case.
If Elasticsearch is not a requirement, you might want to take a look at Clickhouse. It supports join queries out of the box (in an SQL way).

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

I have 2 index as, which were 2 tables in sql as I can perform an inner joins query

how I have 2 index one is called assignment and the other user in sql had a data field fk but I do not know how to perform an inner join in elasticsearch someone can support me
So you have a couple options which might be useful, without knowing your specific use case I'm going to list a potentially useful links.
1)
parent child mapping, really useful when you want to return all documents associated with a specific document. To make the mapping process a bit easier I typically index the data the retrieve the mapping using the /_mapping endpoint, modify the mapping, delete the index, then reingest the data. Sometimes that isn't an option in the case of short lived data.
https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-mapping.html
after updating the current mapping it's possible to use one of the joining queries.
https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html
2)
When deleting the index and re ingesting the data isn't an option, create a new index, modify the data as described above, but instead of deleting the index use the re index API to get the information to the new index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
3)
It might also be possible to use an ingest processor to join the tables
https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-processors.html
4)
possibly the quickest until you get your head wrapped around how elasticsearch works is to either join the information prior to ingesting or write a script joining the tables using one of the sdk's.
https://elasticsearch-py.readthedocs.io/en/master/
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/index.html
plus a lot more build by the community.

How to filter a huge list of ids from Solr at runtime

I have an index for products is Solr. I need to serve a customized list of products for each customer such that I have to exclude some specific products for each customer.
Currently I am storing this relationship of customer & excluded products in a SQL database and then filtering them in Solr using a terms query. Is there a way I can store this relationship in Solr itself so that I dont have to calculate the exclude list every time from SQL first.
Something very similar to what we can do in elasticsearch using https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
Possible ways I could think of doing in Solr:
Keeping a list of customers in the products index itself, and filter on that. But this will really be a pain if I have to reindex all the documents. Also the list can be huge.
Another way that I could think of is maintaining a separate core for keeping documents per customer and excluded product_id and perform a join using {!join} to filter out products for a customer. Is it a scalable solution.
What should be the ideal approach for storing such kind of data in Solr.
Are there any performance issues with the SQL DB? It is perfectly fine to query the DB and get the IDs, and send them to Solr. You would avoid complexity and data duplication. You would anyway have to do some computation to send those IDs to Solr as well.
But to answer your question, yes, you could store the excluded product IDs per customer indeed in a separate index. You would be using a multi-valued field and update using atomic updates. If you do that, make sure to keep the indexing schema simple with no analyzer used for the IDs (just use the string type without any tokenizer or filter).
You do not need to do a Solr join query. You only have to lookup the product IDs per customer (1st query) and massage them as CSV, and do the terms query with the product IDs retrieved from the index (2nd query).
You need to find the best compromise for you
Best Query Time Performances
You add a field (multi valued) to the product index : allowed_users ( or forbidden_users) depending on the cardinality ( that you want to minimise).
This would require a re-indexing for the first time and an index update for each User permission change.
In order to reduce the network traffic and optimise the updates you could take a look to atomic updates[1] .
Best Index Time Performances
If the previous approach is not feasible in your case or doesn't satisfy you, you could try to optimise the indexing side.
You can index a document in a separate collection :
<Id>
<product_id>
<user_id>
You can use the query time join to filter the collection for the current user and then get back the products to filter them on your query.
So basically, you already thought about both the ideas :)
[1] https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

Resources