Make a count for distinct values in Elastic search - elasticsearch

I have duplicate ids in my db and wish to get the count for distinct values only similar to SELECT COUNT(DISTINCT column) FROM table in SQL.
public SearchSourceBuilder createQueryForCount(QueryBuilder queryBuilder, int start, boolean fetchSource, String field){
logger.info("Creating aggregation count ");
QueryBuilder finalQuery = QueryBuilders.boolQuery().must(queryBuilder);
AggregationBuilder aggregationCount = AggregationBuilders.terms("agg").field(USER_ID)
.subAggregation(AggregationBuilders.topHits("top").explain(false).from(start))
.subAggregation(AggregationBuilders.count("count").field(field));
return new SearchSourceBuilder()
.query(finalQuery)
.fetchSource(fetchSource)
.from(start)
.aggregation(aggregationCount);
}
Is there a way to do a distinct count in Elastic search?

You should look for Cardinality Aggregation. Javadocs are available here. Also, remember elasticsearch use approximation to tradeoff for performance. You can control this using precision_threshold up to some extent. Good explanation available here.

To get the distinct values count in elastic search use cardinality aggregation.

Related

Single query to return documents sorted by distance based on one documents Id rather than its geopoint

I have an index in elasticsearch which contains a Id field and a geopoint.
right now in order to get the nearest documents I have to make two queries, one to get the original document by its id and after that use its coordinates to do a geosort. I was wondering if there is anyway to execute this as a single query.
public IEnumerable<RestaurantSearchItem> GetNearbyRestaurants(double latitude, double longitude)
{
var query = _elasticClient.Search<RestaurantSearchItem>(s =>
s.Index(RestaurantSearchItem.IndexName)
.Sort(
ss =>ss.GeoDistance(
g => g
.Field(p => p.Location)
.DistanceType(GeoDistanceType.Plane)
.Unit(DistanceUnit.Meters)
.Order(SortOrder.Ascending)
.Points(new GeoLocation(latitude,longitude)))));
var nearByRestaurants = query.Documents;
foreach (var restaurant in nearByRestaurants)
{
restaurant.Distance = Convert.ToDouble(query.Hits.Single(x => x.Id == restaurant.Id).Sorts.Single());
}
return nearByRestaurants;
}
I don't think it's possible to do this in one query; the latitude and longitude used for sorting can't be looked up from elsewhere in the data, so needs to be supplied in the request.
As of my knowledge, the only Elasticsearch query that accepts id of a document as its parameter is terms query, which fetches list of terms for the query from the given document.
But you want to find relevant documents based on location, not exact terms.
This can be achieved with denormalization of your data. It might look like storing the list of nearby restaurants in a nested field.
In the case of denormalization you will have to pre-compute all nearby restaurants before inserting the document in the index.

How to get the total documents count, containing a specific field, using aggregations?

I am moving from ElasticSearch 1.7 to 2.0. Previously while calculating Term Facets I got the Total Count as well. This will tell in how many documents that field exists. This is how I was doing previously.
TermsFacet termsFacet = (TermsFacet) facet;
termsFacet.getTotalCount();
It worked with Multivalue field as well.
Now in current version for Term Aggregation we don't have anything as Total Count. I am getting DocCount inside Aggregation bucket. But that will not work for muti-valued fields.
Terms termsAggr = (Terms) aggr;
for (Terms.Bucket bucket : termsAggr.getBuckets()) {
String bucketKey = bucket.getKey();
totalCount += bucket.getDocCount();
}
Is there any way I can get Total count of the field from term aggregation.
I don't want to fire exists Filter query. I want result in single query.
I would use the exists query:
https://www.elastic.co/guide/en/elasticsearch/reference/2.x/query-dsl-exists-query.html
For instance to find the documents that contain the field user you can use:
{
"exists" : { "field" : "user" }
}
There is of course also a java API:
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-term-level-queries.html#java-query-dsl-exists-query
QueryBuilder qb = existsQuery("name");

How can I multiply the score of two queries together in Elasticsearch?

In Solr I can use the query function query to return a numerical score for a query and I can user that in the context of a bf parameter something like bf=product(query('cat'),query('dog')) to multiply two relevance scores together.
Elasticsearch has search API that is generally more flexible to work with, but I can't figure out how I would accomplish the same feat. I can use _score in a script_function of a function_query but I can only user the _score of the main query. How can I incorporate the score of another query? How can I multiply the scores together?
You could script a TF*IDF scoring function using a function_score query. Something like this (ignoring Lucene's query and length normalization):
"script": "tf = _index[field][term].tf(); idf = (1 + log ( _index.numDocs() / (_index[field][term].df() + 1))); return sqrt(tf) * pow(idf,2)"
You'd take the product of those function results for 'cat' and 'dog' and add them to your original query score.
Here's the full query gist.
Alternately, if you've got something in that bf that's heavyweight enough you'd rather not run it across the entire set of matches, you could use rescore requests to modify the score of the top N ranked ORIGINAL QUERY results using subsequent scoring passes with your (cat, dog, etc...) scoring-queries.

How to use the elasticseach java api for dynamic searches?

So I'm trying to use elasticsearch for dynamic query building. Imagine that I can have a query like:
a = "something" AND b >= "other something" AND (c LIKE "stuff" OR c LIKE "stuff2" OR d BETWEEN "x" AND "y");
or like this:
(c>= 23 OR d<=43) AND (a LIKE "text" OR a LIKE "text2") AND f="text"
Should I use the QueryBuilder or the FilterBuilder, how do you match both? The official documentation says that for exact values we should use the filter approach? I assume I should use filters for equal comparisons? what about dates and numbers? Should I use the Filter or Query?
For the Like/Equals for the number/number problem I tried this:
#Field(type = String, index = FieldIndex.analyzed, pattern = "(\\d+\\/\\d+)|(\\d+\\/)|(\\d+)|(\\/\\d+)")
public String processNumber;
The pattern would deal with the structure number + slash + number, but also number and number + slash.
But when using either the term filter or the match_query I can't get only hits with the exact structure like 20/2014, if I type 20 I would still get hits on the term filter.
Query is the main component when you search for something, it takes into consideration ranking and other features such as stemming, synonyms and other things. Filter, on the other hand, just filters the result set you get from your query.
I suggest that if you don't care about the ranking use filters because they are faster. Otherwise, use query.

Explain scoring in lucene while sorting involved

I am getting null value in topDocs.scoreDocs for some documents while searching in lucene index.
Please explain me about value in [ ] in topDocs.scoreDocs
SortField sortFieldObj = new SortField(sortField, SortField.STRING, sortOrder);
Sort sort = new Sort(sortFieldObj);
TopDocs topDocs = searcher.search(query, null, sizeNeeded, sort);
Document docNew = searcher.doc(topDocs.scoreDocs[i].doc);
System.out.println(topDocs.scoreDocs[i]);
output:
doc=2 score=NaN[null]
doc=44 score=NaN[testString]
Well, the reson is indirectly you told Lucene to ignore its document scores and use your own sort order. Scoring is used to bring topdocs, but you chose to bring docs in the sort order you specified, hence NAN.
If you want to force Lucene to give you scores when you specified your own sort order use another overloaded method for search :
search(Query query, Filter filter, int n,
Sort sort, boolean doDocScores, boolean doMaxScore)
If doDocScores is true then the score of each hit will becomputed and returned.
If doMaxScore true then the maximum score over all collected hits will be computed.
So you would do something like : searcher.search(query, null, sizeNeeded, sort,true,true);

Resources