Kibana composite query pagination - elasticsearch

I have a composite aggregation query doing exactly what I want (the details of said query should not matter). I would like very much to visualise the results in Vega as a nice time-based chart, but I've hit a very stupid roadblock: I cannot find how to ask Vega to fetch all results. Composite aggregation results are paged (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html#_pagination) and therefore, in order to get all results, multiple queries should be done. So I can display one page of data, which is not enough in my case.
Is there a way to fetch all pages with Vega or Vega-Lite? If not, perhaps in another graph module of Kibana? A quick search gave no definitive answers… And finally, I have the latest version of everything.
Thanks!

Yup, no. Dynamic elastic URLs are not doable (basing a query of another query), I think I put through a feature request for this a while back, but unfortunately Vega and Kibana integrations get pushed to the way side for improvements in Lens.
Hopefully in the future this is something they do because it would severely improve the Vega-Kibana capibilities. I guess it depends on what you are actually trying to do, and whether you can find a way to get the data through in one search - this would be my advice.

Related

Is it possible to know when some data is available for being searched in Elasticsearch?

I'm implementing a software in which data is sent to some web server, stored in an Elasticsearch and then queried right away. I know that Elasticsearch is a NoSQL following BASE (Basically Available, soft State, eventual consistency) principles which means there's no guarantee when your data will be available for searching.
That's why when I query for the data just being added to Elasticsearch, I have to wait for some time before it is found. Right now all I can do is to implement a polling mechanism to detect when data is completely applied. It is worth mentioning that if I'm using _id to retrieve a document, it is found right away. But if I'm searching for it using some type of Elasticsearch query (like term or query_string), it will take a while before the document is found.
So my question is: Is there a cheaper way to detect when data is completely indexed in Elasticsearch?
This part is done by the Refresh API, this API does not provide a way to know when the indexed data is available. But the folks of elastic are working in a hack to let the request wait for a refresh.
I think should be better if you take a look here: https://www.elastic.co/blog/refreshing_news
This post have a good overview of the issues and the stuffs that they are working to improve.
Hope it help :D

How to make the search most efficient?

For a property sale/rent website, a search function should be provided. At the same time, users can use the filters to get the result they want most.
Normally, there are many attributions of a property, like the price, address, the year built, area, many amenties such as balcony, washing-machine and so on. maybe it's over 100.
So how to design the database(mysql or other nosql) and artitecher to make the search performance to be the most efficient?
Sounds like your application requires a lot more search queries than update queries, and that the search queries are quite diverse.
In this case, try ElasticSearch: You choose some database where you store and modify your data. Then, you should propagate any update to an ElasticSearch index, where you upload a denormalized view of the data, which is closer to what users will expect to get when searching.
https://www.quora.com/Whats-the-best-way-to-setup-MySQL-to-Elasticsearch-replication

Elasticsearch - Autocomplete return word/term/token suggestions instead of whole documents

I am trying to implement a simple auto completion for query terms.
There are many different approaches but most of them do return documents instead of terms
- or the authors simply stopped explaining from that point and i am not able to adapt.
A user is typing in a query - e.g. phil
What i want is to provide a list of term completion suggestions like philipp, philius, philadelphia, ...
I am able to get document matches via (edge)ngrams, phrase_prefix and so on but i am am stuck at retrieving matching terms (completion suggestions).
Can someone give me a hint?
I have documents like this {"title":"...", "description":"...", "content":"..."}
All fields have larger string values but especially the field content contains fulltext content.
I do not want to suggest the whole title of a document containing e.g. Philadelphia. Just the word "Philadelphia".
Looking for something like that, myself.
In SOLR it was relatively simple to configure (although a pain to build and keep up-to-date) using solr.SpellCheckComponent. Somehow the same underlying Lucene functionality is used differently between SOLR and ElasticSearch, and in ElasticSearch it is geared towards finding whole documents (or whole field values, if you will) or so it seems...
Despite the profusion of "elasticsearch autocomplete" articles, none appears to deal with this particular issue. Like it doesn't exist. Maybe their use case is different and ElasticSearch works for them just fine, who knows?
At this point I think that preparing the exact field values to use with ElasticSearch autocomplete (yes, that's the input field values, not analyzer tokens) maybe the only way to solve the problem. Which is terrible, because the performance is going to be very low.
Try term suggester:
The term suggester suggests terms based on edit distance. The provided
suggest text is analyzed before terms are suggested. The suggested
terms are provided per analyzed suggest text token. The term suggester
doesn’t take the query into account that is part of request.

How to handle pagination when the source data changes frequently

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.
Elasticsearch provides methods to paginate search results with handy from and to parameters.
So I run a query get me the most recent data from result 1 to 10
This works great.
The user clicks "next page" and the query is:
get me the most recent data from result 11 to 20
The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).
What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.
A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.
The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.
You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.
(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)
Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.
There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.
The ES documentation for the Scroll API provides good details on all of the above.
Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.
Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.
You need to use scan API for this. Scan and scroll API let's you do point in time search and pagination.
Scan API -

Does Elasticsearch stream results?

Does Elasticsearch stream the query results as they are "calculated" or does it calculate everything and then return the final response back to the client?
By default elasticsearch will only return a limited set of results for a query. (i.e. searching for * will only return the default count set regardless of the number of matches).
Generally to implement "streaming" , you make an initial search to get total count of matching documents and then ask for documents in ranges ( i.e. first 10, next 10, etc.. )
See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html
for how to request the number of documents returned.
Have you tried scroll query?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html much easier to deal with than pagination.
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
Answer to the question in the comments:
So question would this be the right way to export large results for a
"report" type system? I'm not talking about frond end? I'm talking
about a back end application that will execute a custom query and
build a file with 300000 + result
I'm sure there might be a valid reasons for doing this, but to me it sounds like you're using a hammer to drive screws. Much of the point of using elasticsearch is to use it's aggregations features to do more of the computing in the data store.
Aggregations Documentation
If you really need the raw data of 300000 records, then thats what you need. However, if it's a report, that implies you're doing some manipulation of the data into metrics. Much of the point of ES is that it allows you to build "custom reports" on the fly. I suspect it will be much faster to put as much logic as you can into the query, rather simply manipulating the raw data.
Without knowing more about the requirements, I can't come up with any better answer than that.
No, Elastic so far does not support this. The Elastic API uses a traditional request/response model. The query results are paginated, buffered on the server-side, and sent back to the client. A truly read of the response body in a streaming fashion does not seem to be in the Elastic roadmap.
With that said, for big result sets the scroll API has been deprecated and was never intended for real-time user queries. At the moment the best option is the search_after that could be seen as a cursor in traditional RDBMS.

Resources