GSA Search Result count difference for Sort by date Vs Sort by relevance - google-search-appliance

When we search for any term in GSA the result count for Sort by date Vs Sort by relevance are not same. Is this is default behavior of GSA? Any help would be of a great help.
Thanks,
Manju

Result count in GSA is an estimate rather than a actual count. you can force GSA to give accurate count by adding rc parameter in your query. Using this you can get accurate result count for upto 1M documents. You can read more about GSA result counting here
I am not sure why result count for sort by date and sort by relevance varies.Try to add rc param in both the queries and see if that helps you.
Regards,
Mohan

Related

Elasticsearch why i cant do a accurate distinct on a field?

Try to get a distinct count on an index containing field value more then once. i have differences between the SQL data and the elastic data, why is that?
Cardinality Aggreagtion gives a
count of distinct values. Values can be extracted either from specific
fields in the document or generated by a script.
Refer this to know more about Cardinality Aggregation
You can even refer this thread to know more about how to count distinct on elasticsearch
Accurate Distinct Count can be found from ES. You can refer the article on "Accurate Distinct Count and Values from Elasticsearch" for complete solution and comparison against Cardinality.

return documents with more terms in ElasticSearch

I'm working on a project that needs elastic search to return the documents with more terms. I know from the official guide es will return with more exact document(which means returning matched docs but with fewer terms).
So is there any chance for me to sort in that way?
Thanks!
You can't sort by the number of terms in a document, so as you suggested, you'd need some sort of script to compute that and attach it to each document as a field.
Sorting on nesting fields is not a problem though. Just refer to sort fields using dot notation - e.g. outer_object.inner_object.number_of_terms

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

Is there a way to remove/ignore matches that have a low score in elasticsearch?

I am getting results that have scores of over 40 for a search but I am also getting way more items with scores under 5. Is there a way to set a bottom threshold so I am not displaying these low-score results but only the high scoring ones? Or is there a better way of doing this? Thank you!
Have a look at min_score option in Search API.
Hope this helps.

Sphinx - How to index only a limited number of words?

I have limited number of industries (around 300 industries), I would like to create an index which will give the frequency of these keywords in the indexed documents. Is there any way for doing this in sphinx?
Not really.
But the --buildstops function of indexer, will produce a list of the most common keywords in an index.
So can just look at the output of that, and compare with your industry list. In theory I would think your industries should near the top of the list, so dont have to make it too long.
There is a trick in Sphinx to get keyword statistics from the index. BuildKeywords API call ( http://sphinxsearch.com/docs/current.html#api-func-buildkeywords ) with hits flag set will return per keyword frequencies from given index.
Hope this helps

Resources