Elasticsearch where in - elasticsearch

Can I in elasticsearch express a query that is similar to the following sql query?
select * from data where data.uid in(select d2.uid from data d2 where d2.colX='val1');

There are two possible solutions.
You can first run a query to get a list of IDs and then use those IDs to run a second query (terms query). Use this approach if you know that the result of the first query will stay under 65,536 IDs/terms. Elasticsearch has a default limit on this amount. You shouldn't increase this limit, it's there for a reason!
You can use nested or a parent/child documents. The main difference is that nested are faster compared to parent/child, but, nested docs require reindexing the parent with all its children, while parent child allows to reindex / add / delete specific children. I don't have enough context to know which type of join will work best in your case.
If Elasticsearch is not a requirement, you might want to take a look at Clickhouse. It supports join queries out of the box (in an SQL way).

Related

How can I get information from 2 different ElasticSearch indexes?

So, I have 2 indexes in my Elasticsearch server.
I need to gather the results from the first index, and for each result I need to gather info from the second index.
How to do that? Tried the foreach processor, but no luck so far.
Tky
I need to gather the results from the first index, and for each result I need to gather info from the second index.
Unless you create parent/child relationships, that's not possible in ElasticSearch.
However, note:
In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.
Handle reading from multiple indexes within your application or rethink your index mapping.
The foreach processor is for ingest pipelines, meaning, stuff that gets done at indexing time. So it won't help you when you are trying to gather the results.
In general, it's not going to be possible to query another index (which might live on another shard) from within a query.
In some cases, you can use a join field. There are performance implications, it's only recommended in specific cases.
If you are not in the join field use case, and you can restructure your data to use nested objects, it will be more performant than join fields.
Otherwise, you'll be better off running multiple queries in the application code (maybe you can fetch all the "secondary" results using just one query, so you'd have 2 queries in total?)

How to create SQL style join on two Elasticsearch indexes?

I have two Elasticsearch indexes. I want to be able to search them in a similar way to an SQL join.
One index stores data for Lessons and contains a reference to the Locations index using the id of the location document.
What I'm trying to do in essence is a typical SQL join.
SELECT * FROM Lessons L JOIN Locations LC ON L.location_id = LC.id
My first solution would be to add the locations info into the Lesson index when I update a document. This would be the correct approach in the methodology of Elasticsearch - flat data. However the problem is that the two sets of data are maintained independently. So when a Location is updated all the relevant Lessons documents would need to be updated.
The second solution I've looked at are joining queries in Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html , however from what I understand from the documentation this not able between different indexes.
You might be able to use terms lookup .
https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-terms-query.html#query-dsl-terms-lookup
But if there are too many terms involved in join - performance would be a concern. Also elastic limits them to 65536 ( in newer versions i guess)

How to manage SQL query with many of joins in a search engine like elastic?

I have a SQL query which has at least 6 joins. That query takes 10 minutes or above for executing.
Right now I'm using sphinx and I just set a source from that SQL query.
But I have a problem with reindexing.
One of the joins is a join to a dictionary table which updates really often.
I have to reindex the source after every dictionary update.
But I do not want to update the entire index.
For example:
this is SQL query:
SELECT m.col1, m.col2. m.col3, d.col1 FROM MainTable m JOIN
SupportTable t1 JOIN SupportTable t2 JOIN SupportTable t3 JOIN
DictionaryTable d
When someone updates DictionaryTable I want to update only that part of the index which depends on the updated row.
My target is a real-time interface for my costumes.
The size of the database is very large.
What can I do to make my analytic query faster?
Should I use search engines and make reindex mechanisms or I should use more suitable technologies?
Sounds like maybe a SPhinx Real Time index might be suitable.
http://sphinxsearch.com/docs/current.html#rt-indexes
You can just send updates for certain documents, rather than having to rebuild the whole index to update a few documents.
But you can only update all fields on a document. Can't just update d.col1 on lots of documents, will need to provide all the data for all fields (and attributes) for all affected documents.
You can however just update select attributes of documents, without touching the fields and/or other attributes.
Another idea is instead of one big index, break down the index into bits - ie 'shard' the index. You can even use a distributed index to make querying all the shards at once easy. (ie to the application it sees just one index, you don't need to manually search separate shards)
http://sphinxsearch.com/files/tutorials/sphinx_config_tips_and_tricks.pdf
... that way can update the shards on a rolling basis. Ie rather than one '10 minute' query, split into 4 shards and have much smaller updates.
(ranged queries may even be used to break down into lots of smaller queries, rather than one 2.5minute query)
http://sphinxsearch.com/docs/current.html#ranged-queries

I have 2 index as, which were 2 tables in sql as I can perform an inner joins query

how I have 2 index one is called assignment and the other user in sql had a data field fk but I do not know how to perform an inner join in elasticsearch someone can support me
So you have a couple options which might be useful, without knowing your specific use case I'm going to list a potentially useful links.
1)
parent child mapping, really useful when you want to return all documents associated with a specific document. To make the mapping process a bit easier I typically index the data the retrieve the mapping using the /_mapping endpoint, modify the mapping, delete the index, then reingest the data. Sometimes that isn't an option in the case of short lived data.
https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-mapping.html
after updating the current mapping it's possible to use one of the joining queries.
https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html
2)
When deleting the index and re ingesting the data isn't an option, create a new index, modify the data as described above, but instead of deleting the index use the re index API to get the information to the new index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
3)
It might also be possible to use an ingest processor to join the tables
https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-processors.html
4)
possibly the quickest until you get your head wrapped around how elasticsearch works is to either join the information prior to ingesting or write a script joining the tables using one of the sdk's.
https://elasticsearch-py.readthedocs.io/en/master/
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/index.html
plus a lot more build by the community.

getting complete tree from elastic search in single call

I want to store employee hierarchy in elastic search. where CFO, CTO, COO etc report to CEO. And each employee can have their own reportees.
I think above can be done using elastic search parent-child relationship. Can we write a query to get the all reportees(direct reportees and sub-reportees) in a single call.
For example if we query for CEO we should get all employees and for CFO we should get employees in finance dept.
Something similar exists in RDMS like SQL server's CTE.
Parent-child relations in ES is:
Parent knows nothing about children
Children must provide _parent to connect with it and to be routed accodringly.
Parent-child mapping is handled by ES via mapping in memory.
Parent/child documents is independent in any other aspect.
So, there is no easy way to do it (there's no way to actually store normal form of any relational data as well, because ES in non-relational DB). Workarounds about this:
query documents with has_parent/has_child queries (only 1 level of relation works for this)
store documents as nested objects (pay attention, that this model reindexes whole document if any of members changes)
denormalize data (most natural way for non-relational storages, IMO)
First and foremost, avoid thinking about ES in a relational database way. ES isn't so suited for joins/relations, though it can achieve similar effect via the parent/child relations. Don't even think about joins that might involve a undetermined number of depths. CTE can handle without much difficulty but not all relational databases support CTE AFAIK (MySQL being one).
The parent-child relations is more trouble than its worth IMMO. Child docs are routed to shards where their parents reside. In your case of a tree, all documents will eventually trace back to the root document, which will result all your documents to reside in a single shard. The depth of your tree could be quite large (more than 4 or 5 in a not-so-small organization). Also, if you go with this solution, it is quite inconvenient to retrieve (via the GET API) a particular child doc from ES based on its ID, because you have to specify its parent IDs all the way up to its root.
I think it's best to store the PATH from root up to but not including the current employee as a list of IDs. So each employee has a field like:
"superiors": [CEO_ID, CTO_ID, ... , HER_DIRECT_MANAGER_ID],
So it is completely denormalized and your application has to prepare for this list.
With this setup, to get all subordinates of an employee:
filtering out IDs in this employee's own superiors field plus her own ID, either using a filter agg or a filtered query.
do a terms agg on the superiors field and you will have all subordinates of this employee.
I must admit that at least two queries are needed. The first one is a GET request to retrieve the superiors field of this employee and then the second query to do what I described above.
Also, don't worry about the duplications due to denormalization. ES can handle way more data than you can save here.

Resources