Elasticsearch: get total of GROUP BY and COUNT - elasticsearch

I'm using the elasticsearch java client and I want to work out how many distinct items items there are matching a combination. I'm currently using facets like this:
client.prepareSearch(indexName)
.addFacet(Efb.termsFacet("hotels")
.field("hotel.specialCode")
.script("term + ';;;' + _source.hotel.name + ';;;' + _source.offer.destinationId")
.facetFilter(filter)
.size(50))
I then get the total using:
case tf: TermsFacet =>
val entries = tf.getEntries
val totalHotels = entries.size()
This is wrong because I am asking for the total returned, which in my query I limited to 50 - so this number will never exceed 50.
How can I get the total number of facets, without having to pull them all back into memory? Or is there a better way to do the equivalent of a SQL GROUP BY and COUNT in elasticsearch?

Related

Match query with relationship is taking too long to retrieve results does it mean we need to upgrade Neo4j or memory allocated?

I'm trying to understand why below query is taking too long to retrieve results. I have mocked up the values used but the below query is right and is returning 40 records (a node has 8 diff values and z node has 5 diff values so total 40 combinations). It's taking 2.5 min to return those 40 records. Please let me know what the issue is here. I'm suspecting this to be Neo4j version and infrastructure we're using right now in production.
After the below query we have algo.kShortestPaths.stream so the whole thing together is taking more than 5 min. What do you suggest? Is there no other way where we can handle such combinations (a and z node combinations > 40) within 5 min?
Infrastructure details: Neo4j 3.5 community edition
2 separate datacenters, sync job - 64GB mem 16GB CPU 4 cores
Cypher Query:
MATCH (s:SiteNode {siteName: 'siteName1'})-[rl:CONNECTED_TO]-(a:EquipmentNode)
WHERE a.locationClli = s.siteName AND toUpper(a.networkType) = 'networkType1' AND NOT (toUpper(a.equipmentTid) CONTAINS 'TEST')
WITH a.equipmentTid AS tid_A
MATCH pp = (a:EquipmentNode)-[rel:CONNECTED_TO]-(a1:EquipmentNode)
WHERE a.equipmentTid = tid_A AND ALL( t IN relationships(pp)
WHERE t.type IN ['Type1'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId1'] AND t.status IN ['status1', 'status2'] )
WITH a
MATCH (d:SiteNode {siteName: 'siteName2'})-[rl:CONNECTED_TO]-(z:EquipmentNode)
WHERE z.locationClli = d.siteName AND toUpper(z.networkType) = 'networkType2' AND NOT (toUpper(z.equipmentTid) CONTAINS 'TEST')
WITH z.equipmentTid AS tid_Z, a
MATCH pp = (z:EquipmentNode)-[rel:CONNECTED_TO]-(z1:EquipmentNode)
WHERE z.equipmentTid=tid_Z AND ALL(t IN relationships(pp)
WHERE t.type IN ['Type2'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId2'] AND t.status IN ['status1', 'status2'])
WITH DISTINCT z, a
return a.equipmentTid, z.equipmentTid
This query was built to handle small combinations upto 4 total a and z node combinations but today we might have combinations greater than 10 or 40 or 100 so this is timing out. I'm not sure if there's a better way to write the query to improve performance assuming the community edition is good enough for our case.

NRediSearch - Getting total documents matched count

Is there a way to get a total results count when calling Aggregate function?
Note that I'm not using Aggregate function to aggregate results, but as an advanced search query, because Search function does not allow to sort by multiple fields.
RediSearch returns total documents matched count, but I can't find a way to get this number using NRediSearch library.
With NRediSearch
Using NRediSearch, you would need to build and execute aggregation that will run a GROUPBY 0 and the COUNT reducer, say you have a person-idx index and you want to count all the Person documents in Redis:
var client = new Client("person-idx", muxer.GetDatabase());
var result = await client.AggregateAsync(new AggregationBuilder().GroupBy(new List<string>(), new List<Reducer>{Reducers.Count()}));
Console.WriteLine(result.GetResults().First().Values.First());
Will get the count you are looking for.
With Redis.OM
There's a newer library Redis.OM which you can also use to make these aggregations a bit simpler, the same operation would be done with the following:
var peopleAggregations = provider.AggregationSet<Person>();
Console.WriteLine(peopleAggregations.Count());

Elasticsearch 5 - Field distinct values without aggregation

I'm working with a time-based index storing syslog events.
All the data is coming from different sources (PCs).
Suppose I have this kind of events:
timestamp = 0
source = PC-1
event = event_type_1
timestamp = 1
source = PC-1
event = event_type_1
timestamp = 1
source = PC-2
event = event_type_1
I want to make a query that will retrieve all the distinct value of "source" field for documents where match event = event_type_1
I am expecting to have all exact values (no approximations).
To achieve it I have written a cardinality query with an aggregation specifying the correct size, because I have no prior knowledge of the number of distinct sources. I think this is a expensive work to do as it consumes a lot of memory.
Is there any other alternative to get this done?

Slicing neo4j Cypher results in chunks

I want to slice Cypher results in chunks of 100 rows, and be able to retrieve a specific chunk.
At the moment, the only way to ensure that rows are not mixed-up is to user ORDER BY which makes the query very inefficient ( 3sec. for me is too much)
MATCH (p:Person) RETURN p.id ORDER BY p.id SKIP {chunk}*100 LIMIT 100
where {chunk} is an external parameter to identify a specific chunk.
Any suggestions?
PS: the property p.id is indexed.
You may try something like adding label to Person before extracting chunks and then using query like
Match (p:Chunk:Person) with p LIMIT 100
Match (p) remove p:Chunk
Return *
If the p.id values are unique and dense (say, the value starts at 1 and increments, without any gaps), then this query will take advantage of the index on :Person(id) to efficiently get each hundred-Person chunk:
WITH (({chunk} - 1) * 100 + 1) AS startId
MATCH (p:Person)
WHERE p.id IN RANGE(startId, startId + 99)
RETURN p.id
ORDER BY p.id
Now, practically speaking, your id space will probably not remain dense, even if it started out that way. Person nodes will be deleted over time. In that case, the above query can return fewer than 100 rows. So, you can make your chunk size bigger than 100 and do some post-processing to get the 100 you need. In the worst case, you may need to make multiple requests to get the 100 you need, but each request will be fast. (Ideally, you would want to assign no-longer-unused id values to new Person nodes, to fill up gaps in the id space -- but this would require you to scan for the gaps.)

mongoDB geoNear command with count

I am using the geoNear commang with mongoid in order to retrive a document collection ordered by distance. I need the distance for each document in the collection which is why I am having to resort to the geoNear command.
Given the following command:
category_ids = ["list", "of", "ids"]
cmd = Hash.new
cmd[:geoNear] = :poi
cmd[:near] = [params[:location][:x], params[:location][:y]]
cmd[:query] = {
"$or" => [
{primary_category_id: {"$in" => category_ids}},
{category_ids: {"$in" => category_ids}}
]
}
cmd[:spherical] = true
cmd[:num] = num
res = Poi.collection.database.command cmd
My problem is that I require the total number of results in the collection. Sure I could just run another query that just counts the number of items that satisfy the query part of the command, however that would be pretty inefficient and also not very extendible as every change I make in the command would have to be reflected in the count query. Just adding a maxDistance would land me in a whole heap of trouble.
Another option would be to go with find and calculate the distance manually but again I would like to avoid that.
So my question is there a clever way of getting the number of documents returned by the command (minus the num) without having to run a separate query or having to calculate the distance manually and go with find.
You can use facet for the same after geoNear use facet one will project the documents and in other you can use group by _id null and use the count in group to count the total number of documents.

Resources