Count query with PHP Elastica and Symfony2 FosElasticaBundle - elasticsearch

I'm on a Symfony 2.5.6 project using FosElasticaBundle (#dev).
In my project, i just need to get the total hits count of a request on Elastic Search. That is, i'm querying Elastic Search with a normal request, but through the special "count" URL:
localhost:9200/_search?search_type=count
Note the "search_type=count" URL param.
Here's the example query:
{
"query": {
"filtered": {
"query": {
"match_all": []
},
"filter": {
"bool": {
"must": [
{
"terms": {
"media.category.id": [
681
]
}
}
]
}
}
}
},
"sort": {
"published_at": {
"order": "desc"
}
},
"size": 1
}
The results contains a normal JSON response but without any documents in the hits part. From this response i easily get the total count:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 81,
"max_score": 0,
"hits": [ ]
}
}
Okay, hits.total == 81.
Now, i couldn't find any solution to do the same through FOSElasticaBundle, from a repository.
I tried this:
$query = (...) // building the Elastica query here
$count = $this->finder->findPaginated(
$query,
array(ES::OPTION_SEARCH_TYPE => ES::OPTION_SEARCH_TYPE_COUNT)
)->getNbResults();
But i get an Attempted to load class "Pagerfanta". I don't want Pagerfanta.
Then this:
$count = $this->finder->createPaginatorAdapter(
$query,
array(ES::OPTION_SEARCH_TYPE => ES::OPTION_SEARCH_TYPE_COUNT)
)->getTotalHits();
But it would always give me 0.
Would be easy if i had access to the Elastica Finder service from the repository (i could then get a ResultSet from the query search, and this ResultSet has a correct getTotalHits() method). But services from repository... you know.
Thank you for any help or clue!

I faced the same challenge, getting access to the searchable interface from inside the repo. Here's what I ended up with:
Create AcmeBundle\Elastica\ExtendedTransformedFinder. This just extends the TransformedFinder class and makes the searchable interface accessible.
<?php
namespace AcmeBundle\Elastica;
use FOS\ElasticaBundle\Finder\TransformedFinder;
class ExtendedTransformedFinder extends TransformedFinder
{
/**
* #return \Elastica\SearchableInterface
*/
public function getSearch()
{
return $this->searchable;
}
}
Make the bundle use our new class; in service.yml:
parameters:
fos_elastica.finder.class: AcmeBundle\Elastica\ExtendedTransformedFinder
Then in a repo use the getSearch method of our class and do what you want :)
class SomeSearchRepository extends Repository
{
public function search(/* ... */)
{
// create and set your query as you like
$query = Query::create();
// ...
// then run a count query
$count = $this->finder->getSearch()->count($query);
}
}
Heads up this works for me with version 3.1.x. Should work starting with 3.0.x.

Ok, so, here we go: it is not possible.
You cannot, as of version 3.1.x-dev (2d8903a), get the total matching document count returned by elastic search from FOSElasticaBundle, because this bundle does not expose this value.
The RawPaginatorAdapter::getTotalHits() method contains this code:
return $this->query->hasParam('size')
? min($this->totalHits, (integer) $this->query->getParam('size'))
: $this->totalHits;
which prevents to get the correct $this->totalHits without actually requiring any document. Indeed, if you set size to 0, to tell elasticsearch not to return any document, only meta information, RawPaginatorAdapter::getTotalHits() will return 0.
So FosElasticaBundle doesn't provide a way to know this total hits count, you could only do that through the Elastica library directly. Of course with the downisde that Elastica finders are natively available in \FOS\ElasticaBundle\Repository. You'd had to make a new service, do some injection, and inovke your service instead of the FOSElasticaBundle one for repositories... ouch.
I chose another path, i forked https://github.com/FriendsOfSymfony/FOSElasticaBundle and changed the method code as follow:
/**
* Returns the number of results.
*
* #param boolean $genuineTotal make the function return the `hits.total`
* value of the search result in all cases, instead of limiting it to the
* `size` request parameter.
* #return integer The number of results.
*/
public function getTotalHits($genuineTotal = false)
{
if ( ! isset($this->totalHits)) {
$this->totalHits = $this->searchable->search($this->query)->getTotalHits();
}
return $this->query->hasParam('size') && !$genuineTotal
? min($this->totalHits, (integer) $this->query->getParam('size'))
: $this->totalHits;
}
$genuineTotal boolean restores the elasticsearch behaviour, without introducing any BC break. I could also have named it $ignoreSize and use it the opposite way.
I opened a Pull Request: https://github.com/FriendsOfSymfony/FOSElasticaBundle/pull/748
We'll see! If that could help just one person i'd be happy already!

While, you can get the index instance as a service (fos_elastica.index.INDEX_NAME.TYPE_NAME) and ask for count() method.
Joan

Related

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

Elasticsearch From and Size on aggregation for pagination

First of all, I want to say that the requirement I want to achieve is working very well on SOLR 5.3.1 but not on ElasticSearch 6.2 as a service on AWS.
My actual query is very large and complex and it is working fine on kibana but not when I cross the from = 100 and size = 50 as it is showing error on kibana console,
What I know:
For normal search, the maximum from can be 10000 and
for aggregated search, the maximum from can be 100
If I cross that limit then I've to change the maximum limit which is not possible as I am using ES on AWS as a service OR I've use scroll API with scroll id feature to get paginated data.
The Scroll API works fine as I've used it to another part of my project but when I try the same Scroll with aggregation it is not working as expected.
Here with Scroll API, the first search gets the aggregated data but the second calling with scroll id not returns the Aggregated results only showing the Hits result
Query on Kibana
GET /properties/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"published": true
}
},
{
"match": {
"country": "South Africa"
}
}
]
}
},
"aggs": {
"aggs_by_feed": {
"terms": {
"field": "feed",
"order": {
"_key": "desc"
}
},
"aggs": {
"tops": {
"top_hits": {
from: 100,
size: 50,
"_source": [
"id",
"feed_provider_id"
]
}
}
}
}
},
"sort": [
{
"instant_book": {
"order": "desc"
}
}
]
}
With Search on python: The problem I'm facing with this search, first time the search gets the Aggregated data along with Hits data but for next calling with scroll id it is not returning the Aggregated data only showing the Hits data.
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type,scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
With Elasticsearch-DSL on python: With this approach I'm struggling to select the _source fields names like id and feed_provider_id on the second aggs i.g tops->top_hits
es = init_es()
s = Search(using=es, index=index_name,doc_type=doc_type)
s.aggs.bucket('aggs_by_feed', 'terms', field='feed').metric('top','top_hits',field = 'id')
response = s.execute()
print('Hit........')
for hit in response:
print(hit.meta.score, hit.feed)
print(response.aggregations.aggs_by_feed)
print('AGG........')
for tag in response.aggregations.aggs_by_feed:
print(tag)
So my question is
Is it not possible to get data using from and size field on for the aggregated query above from=100?
if it is possible then please give me a hint with normal elasticsearch way or elasticsearch-dsl python way as I am not well known with elasticsearch-dsl and elasticsearch bucket, matric etc.
Some answer on SO told to use partition but I don't know how to use it on my scenario How to control the elasticsearch aggregation results with From / Size?
Some others says that this feature is not currently supported by ES (currently on feature request). If that's not possible, what else can be done in place of grouping in Solr?

"filtered query does not support sort" when using Hibernate Search

I'm trying to issue a query which includes sorting
from Hibernate Search 5.7.1.Final
to ElasticSearch 2.4.2.
When I'm using curl I get the results:
curl -XPOST 'localhost:9200/com.example.app.model.review/_search?pretty' -d '
{
"query": { "match" : { "authors.name" : "Puczel" } },
"sort": { "title": { "order": "asc" } }
}'
But when I issue the query from code:
protected static Session session;
public static void prepareSession()
{
SessionFactory sessionFactory = new Configuration().configure()
.buildSessionFactory();
session = sessionFactory.openSession();
}
...
protected static void testJSONQueryWithSort()
{
FullTextSession fullTextSession = Search.getFullTextSession(session);
QueryDescriptor query = ElasticsearchQueries.fromJson(
"{ 'query': { 'match' : { 'authors.name' : 'Puczel' } }, 'sort': { 'title': { 'order': 'asc' } } }");
List<?> result = fullTextSession.createFullTextQuery(query, Review.class).list();
System.out.println("\n\nSearch results for 'author.name:Puczel':");
for(Object object : result)
{
Review review = (Review) object;
System.out.println(review.toString());
}
}
I get an Exception:
"[filtered] query does not support [sort]"
I understand where it comes from, because the query
that Hibernate Search issues is different than my curl query
- specifying the type is realised differently:
{
"query":
{
"filtered":
{
"query":
{
"match":{"authors.name":"Puczel"}
},
"sort":{"title":{"order":"asc"}},
"filter":{"type":{"value":"com.example.app.model.Review"}}
}
}
}
But I don't know how to change it.
I tried using the sort example from Hibernate documentation:
https://docs.jboss.org/hibernate/search/5.7/reference/en-US/html_single/#__a_id_elasticsearch_query_sorting_a_sorting
But the example is not full. I don't know:
which imports to use (there are multiple matching),
what are the types of the undeclared variables, like s,
how to initalise the variable luceneQuery.
I will appreciate any remarks on this.
Yes, as mentioned in the javadoc of org.hibernate.search.elasticsearch.ElasticsearchQueries.fromJson(String):
Note that only the 'query' attribute is supported.
So you must use the Hibernate Search API to perform sorts.
which imports to use (there are multiple matching),
Sort is the one from Lucene (org.apache.lucene), List is from java.util, and all the other imports should be from Hibernate Search (org.hibernate.search).
what are the types of the undeclared variables, like s
s is a FullTextSession retrieved through org.hibernate.search.Search.getFullTextSession(Session). It will also work with a FullTextEntityManager retrieved through org.hibernate.search.jpa.Search.getFullTextEntityManager(EntityManager).
how to initalise the variable luceneQuery
You'll have to use the query builder (qb):
Query luceneQuery = qb.keyword().onField("authors.name").matching("Puczel").createQuery();
If you intend to use the Hibernate Search API, and you're not comfortable with it yet, I'd recommend reading the general documentation first (not just the Elasticsearch part, which only mentions Elasticsearch specifics): https://docs.jboss.org/hibernate/search/5.7/reference/en-US/html_single/#search-query

Different set of results for "significant terms" in Elasticsearch using REST Api or Transportclient

We use the new significant terms plugin in elasticsearch. Using the transport client I get less results compared to that when I use the REST API. I don't understand why. Using the node client is unfortunately not possible, since my service using ES is not in the same network. Why are the results different?
Here is the REST call:
POST /searchresults_sharded/article/_search
{
"query": {
"match": {
"titlebody": {
"query": "japanische hundenamen",
"operator": "and"
}
}
},
"aggregations": {
"searchresults": {
"significant_terms": {
"field": "titlebody",
"size": 100
}
}
}
}
and here the scala request building code:
val builder = reqBuilder.searchReqBuilder
builder.setIndices(indexCoords.indexName)
builder.setTypes(indexCoords.typeName)
builder.setQuery(QueryBuilders.matchQuery(indexCoords.field, keywords.mkString(" ")).operator(MatchQueryBuilder.Operator.AND))
val sigTermAggKey: String = "significant-term"
val sigTermBuilder = new SignificantTermsBuilder(sigTermAggKey)
sigTermBuilder.field(indexCoords.field)
sigTermBuilder.size(size)
builder.addAggregation(sigTermBuilder)
I used toString on the Builder and found out: Reason was the different size parameter of the two requests. Both request sizes (20 and 100) were bigger than the returned signif.term-aggregation bucket size (7 compared to 1) but it seems that the query size param has an impact on the returned size even if it's far below the query size parameter

Filter facet returns count of all documents and not range

I'm using Elasticsearch and Nest to create a query for documents within a specific time range as well as doing some filter facets. The query looks like this:
{
"facets": {
"notfound": {
"query": {
"term": {
"statusCode": {
"value": 404
}
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
In the specific case, the total hits of the search is 21 documents, which fits the documents within that time range in Elasticsearch. But the "notfound" facet returns 38, which fits the total number of ErrorDocuments with a StatusCode value of 404.
As I understand the documentation, facets collects data from withing the search. In this case, the "notfound" facet should never be able to return a count higher that 21.
What am I doing wrong here?
There's a distinct difference between filter/query/filtered_query/facet filter which is good to know.
Top level filter
{
filter: {}
}
This acts as a post-filter, meaning it will filter the results after the query phase has ended. Since facets are part of the query phase filters do not influence the documents that are facetted over. Filters do not alter score and are therefor very cacheable.
Top level query
{
query: {}
}
Queries influence the score of a document and are therefor less cacheable than filters. Queries run in the query phase and thus also influence the documents that are facetted over.
Filtered query
{
query: {
filtered: {
filter: {}
query: {}
}
}
}
This allows you to run filters in the query phase taking advantage of their better cacheability and have them influence the documents that are facetted over.
Facet filter
"facets" : {
"<FACET NAME>" : {
"<FACET TYPE>" : {
...
},
"facet_filter" : {
"term" : { "user" : "kimchy"}
}
}
}
this allows you to apply a filter to the documents that the facet is run over. Remember that the it'll be a combination of the queryphase/facetfilter unless you also specify global:true on the facet as well.
Query Facet/Filter Facet
{
"facets" : {
"wow_facet" : {
"query" : {
"term" : { "tag" : "wow" }
}
}
}
}
Which is the one that #thomasardal is using in this case which is perfectly fine, it's a facet type which returns a single value: the query hit count.
The fact that your Query Facet returns 38 and not 21 is because you use a filter for your time range.
You can fix this by either doing the filter in a filtered_query in the query phase or apply a facet filter(not a filter_facet) to your query_facet although because filters are cached better you better use facet filter inside you filter facet.
Confusingly Filter Facets are specified using .FacetFilter() on the search object. I will change this in 1.0 to avoid future confusion.
Sadly: .FacetFilter() and .FacetQuery() in NEST do not allow you to specify a facet filter like you can with other facets:
var results = typedClient.Search<object>(s => s
.FacetTerm(ft=>ft
.OnField("myfield")
.FacetFilter(f=>f.Term("filter_facet_on_this_field", "value"))
)
);
You issue here is that you are performing a Filter Facet and not a normal facet on your query (which will follow the restrictions applied via the query filter). In the JSON, the issue is because of the "query" between the facet name "notfound" and the "terms" entry. This is telling Elasticsearch to run this as a separate query and facet on the results of this separate query and not your main query with the date range filter. So your JSON should look like the following:
{
"facets": {
"notfound": {
"term": {
"statusCode": {
"value": 404
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
Since I see you have this tagged with NEST as well, in your call using NEST, you are probably using FacetFilter on your search request, switch this to just Facet to get the desired result.

Resources