Horsing around the CosmosDB .NET SDK, I found an intriguing issue.
I wanted to use the SDK LINQ support for the following query:
SELECT VALUE COUNT(1) FROM c WHERE <predicate>
Well, immediately after writing the LINQ query, I realized there might be no way to handle such queries as document queries. However, since document queries allow you to capture query metrics and unblock a thread between pages of results, I strongly prefer it. Here's the code:
client.CreateDocumentQuery<TEntity>.Where(<predicate>).Count()
Even though I understand that the result type of Count() isn't IQueryable, is there a way to handle "count" queries as document queries?
Of course there is a way to do that.
There is a CountAsync extension built into the SDK.
Simply construct your IQueryable and use .CountAsync() when you are ready to get the count.
Keep in mind however that there will be no metric collection as the result is aggregated in the SDK.
If you really need metrics then you can use the usual DocumentQuery created from SQL rather than LINQ and the while.HasMoreResults, ExecuteNextAsync logic and capture the metrics from the paginated result, per iteration.
Related
I have to delete the queries I don't use any longer in Power Query. One way is to identify which query is not called by any other query (let's call it top query) and see if it is still needed. If it is not then I can delete it safely.
One way to achieve this is to go through each query and see if another query is called. If another query is called than that query is not a top query
Besides this manual method, is there a better one ?
If a query is being referenced by another query, it won't let you delete it. Thats the best way to trace things back other then looking at code
In Elasticsearch, can I associate each document with a (different) function that must be satisfied by parameters I supply on a search, in order to be returned on that search?
The particular functions I would particularly like to use involve a loop, some kind of simple branching (if-statement of switch-statement), an array-like data structure, strings comparisons, and simple boolean operators.
couple of keynotes here:
At query time:
- If your looking to shape the relevancy function, meaning the actual relevancy score of each document, you could use a script score query.
- If you're only looking to filter out unwanted documents, you could use a script query that allows you to do just that.
Both of those solutions enables you to compute a score comparing incoming query parameters against existing previously indexed values.
Take note that usage of scripts at query time can lead to increased memory usage and performance issues.
Elastic can also handle a second batch of filtering rules that are applied to the actual query result in the form of a post filter. Can come in handy sometime if you're not in a position of stream processing the output at API view level.
At index time:
There is such a thing called script fields that allows you to store a function that computes a result based on other fields value and incoming query parameters. they can be really powerful given the fact that they are assigned at index time. I think they might be what you are looking for.
I would not be using those if i weren't to have those field values compared against query params. Reason is that I like my index process to be lean and fast so I tend to compute those kinds of values at stream level, in upstream from the actual bulk indexing query.
Although convenient, those custom scripts results are likely to be achievable with a combination of regular queries and filters. In each release, the elasticsearch teams is adding new query and field types that let you do what you use to do via scripted queries whiteout the risk of blowing out you memory. a good example of this is the rank feature datatype recently introduced in the 7.x release.
A piece of advice for you. think of your elasticsearch service as a regular API in your datalayer. As such you can do query processing before the actual call to elastic and you can do data processing from the actual elastic results. If you really can't fit your business rules in there, that would be your last resort.
Fell free to contact me if you still have any questions. All the best.
Is it possible to use Slice via solrTemplate ?
actually I am struggling to see if it will even make a difference because even without using spring, there doesnt appear to be any way of telling Solr to exclude its "numFound" (total results) from a query
And when I use a normal spring data Page<..> query , when I look under the hood I only see one query issued to solr, i.e. no extra one for count. Or is the count simply done inside Solr somehow in an extra step ?
confused
Total document count is part of the Solr query. No additional query is required. Therefore, there is no advantage to Slice vs. Page.
The only related concept is when somebody wants to export a significant amount of data, in which case built-in paging becomes slower the further is data requested. For that, Solr has exporting functionality.
Does Elasticsearch stream the query results as they are "calculated" or does it calculate everything and then return the final response back to the client?
By default elasticsearch will only return a limited set of results for a query. (i.e. searching for * will only return the default count set regardless of the number of matches).
Generally to implement "streaming" , you make an initial search to get total count of matching documents and then ask for documents in ranges ( i.e. first 10, next 10, etc.. )
See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html
for how to request the number of documents returned.
Have you tried scroll query?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html much easier to deal with than pagination.
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
Answer to the question in the comments:
So question would this be the right way to export large results for a
"report" type system? I'm not talking about frond end? I'm talking
about a back end application that will execute a custom query and
build a file with 300000 + result
I'm sure there might be a valid reasons for doing this, but to me it sounds like you're using a hammer to drive screws. Much of the point of using elasticsearch is to use it's aggregations features to do more of the computing in the data store.
Aggregations Documentation
If you really need the raw data of 300000 records, then thats what you need. However, if it's a report, that implies you're doing some manipulation of the data into metrics. Much of the point of ES is that it allows you to build "custom reports" on the fly. I suspect it will be much faster to put as much logic as you can into the query, rather simply manipulating the raw data.
Without knowing more about the requirements, I can't come up with any better answer than that.
No, Elastic so far does not support this. The Elastic API uses a traditional request/response model. The query results are paginated, buffered on the server-side, and sent back to the client. A truly read of the response body in a streaming fashion does not seem to be in the Elastic roadmap.
With that said, for big result sets the scroll API has been deprecated and was never intended for real-time user queries. At the moment the best option is the search_after that could be seen as a cursor in traditional RDBMS.
Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)