How to pushdpown order by clause in presto elasticsearch - elasticsearch

I am running a SQL query in starburst-presto. It's connected to elasticsearch using the relevant connector.
The SQL has an "order by" clause. This clause is not pushing down to elasticsearch. Basically, I want to sort the data in elasticsearch based on a specific field and return the result. The query with "order by" is taking a lot of time using presto. Is it possible to manage is somehow to get an optimal performance?
SQL: select e.employee_id from elasticsearch.es."employee:id:""2390571"" && (doj_timestamp:(>=15965454 && <=15972366)) sort=employee_id:desc" e offset 0 limit 5;
The above query is returning random results.
Can anyone please help here?

Your query has both ORDER BY and LIMIT, so in Presto it is called a Top N query.
Presto currently does not provide Top N pushdown, but this feature is in the works.
Umbrella issue for connector pushdown: https://github.com/prestosql/presto/issues/18
A draft PR for Top N pushdown (engine & SPI support): https://github.com/prestosql/presto/pull/4784
Please file an issue for Elasticsearch connector TopN pushdown. We will implement it anyway, but direct user feedback helps understand issue priorities.
You can learn more on the #pushdown channel on Presto community slack.

Related

How can design search engine that can query very fast?

Im designing an architecture of search engine according to condition base search.
Each record contains multiple columns which can be, string, number, date... And I want to query records as fast as possible using queries that are condition based. (a client will query to view current records according to his filter/query)
For example:
query = (record.date > sysdate - 5 AND record.name like '%TEST%') OR (record.priority > 2 AND record.date > sysdate -2)...
What is the best way to do it?
I thought about using elasticsearch but will it be fast enough for the client?
It should be noted that the system is dynamic and records are always change, added and removed. Also, there are a lot of records stored in the system.
based on the post and comments, I think Elasticsearch will work well for this
how to make it fast?
make sure your mappings and queries are optimised
provide adequate hardware resources for the cluster
monitor your cluster and queries

Can I add a calculated boolean column to an Elasticsearch Kibana query from data from another query?

Let's imagine that we have an Elastic index and we want to get all the documents of that index and a calculated field with the result of a filtering a different Elastic index.
I will better explain that in SQL code so even if Elastic is NoSQL, I can share the goal:
select id, name, (id IN (select customer_id from invoices where customer_id = 123)) as hasBought
from customers;
Elasticsearch doesn't support table joining. You'll need to denormalize your data one way or another, even it results in data duplication. That's the "downside" of NoSQL like ES.
Quoting the docs:
Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. Instead, Elasticsearch offers two forms of join which are designed to scale horizontally.

How to join two elasticsearch inserts?

I am very new to elasticsearch and come from a SQL background. We are trying to use a ELK stack to monitor a Jenkins server. We use the elasticsearch report plugin to send a bunch of information about the job. However, we also have some custom information that we would need to send. However, how can I join these two pieces of information in Kibana? In a SQL database, I would have two tables, then join them based on a key. However, I don't how to do it in elasticsearch. Any suggestions?
Generally speaking, join is the strong-suite of relational DBs (aka SQL DBs) and the weak-spot of the NoSQL (Elasticsearch among them). Having said that, ES does support such operations and if performance is not critical, you can try it: Elasticsearch joining queries. In a nutshell:
Create a join-field mapping. This is the equivalent for a foreign key constraint in SQL. Since you have control over the logstash part, I suggest you make it the parent and the ES report info the child.
Use the has_child query when you query the logs. This type of query acts like the join query in SQL.

Paging problems in Elasticsearch SQL API

My existing system has some search SQL procedures that returns the data based on some filters. Now, to improve searches we have decided to use Elasticsearch for all our searches. We are in phase of making a prototype for now.
Below is what i have done till now:-
De-normalize all the data from my RDBMS and store into Elasticsearch using Logstash.
Query data from Elasticsearch based on the parameters using Elastisearch SQL API.
The main problem is the Pagination. Elasticsearch Sql has support for sending fetch_size parameter and in result it returns the cursor for the next set of records.
Cursor is fine if you want to get to the next paged set of results, but if a user wants to go from page 10 to page 100, how can we achieve that ?
I also searched for offset and skip support in elasticsearch SQL but could not find any references.
Has anyone faced such an issue ? I would appreciate any help or suggestions.
I tried to follow the link https://www.elastic.co/guide/en/elasticsearch/reference/current/sql-pagination.html
{
"query" : "Select client_clientid, clientpolicy_policyname from client_paged_list group by client_clientid, clientpolicy_policyname",
"fetch_size": 5
}

How to create SQL style join on two Elasticsearch indexes?

I have two Elasticsearch indexes. I want to be able to search them in a similar way to an SQL join.
One index stores data for Lessons and contains a reference to the Locations index using the id of the location document.
What I'm trying to do in essence is a typical SQL join.
SELECT * FROM Lessons L JOIN Locations LC ON L.location_id = LC.id
My first solution would be to add the locations info into the Lesson index when I update a document. This would be the correct approach in the methodology of Elasticsearch - flat data. However the problem is that the two sets of data are maintained independently. So when a Location is updated all the relevant Lessons documents would need to be updated.
The second solution I've looked at are joining queries in Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html , however from what I understand from the documentation this not able between different indexes.
You might be able to use terms lookup .
https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-terms-query.html#query-dsl-terms-lookup
But if there are too many terms involved in join - performance would be a concern. Also elastic limits them to 65536 ( in newer versions i guess)

Resources