Using a single query for multiple searches in ElasticSearch - elasticsearch

I have a dataset with documents that are identifiable by three fields, let's say "name","timestamp" and "country". Now, I use elasticsearch-dsl-py, but I can read native elasticsearch queries, so I can accept those as answers as well.
Here's my code to get a single document by the three fields:
def get(name, timestamp, country):
search = Item.search()
search = search.filter("term", name=name)
search = search.filter("term", timestamp=timestamp)
search = search.filter("term", country=country)
search = search[:1]
return search.execute()[0]
This is all good, but sometimes I'll need to get 200+ items and calling this function means 200 queries to ES.
What I'm looking for is a single query that will take a list of the three field-identifiers and return all the documents matching it, no matter the order.
I've tried using ORs + ANDs but unfortunately the performance is still poor, although at least I'm not making 200 round trips to the server.
def get_batch(list_of_identifiers):
search = Item.search()
batch_query = None
for ref in list_of_identifiers:
sub_query = Q("match", name=ref["name"])
sub_query &= Q("match", timestamp=ref["timestamp"])
sub_query &= Q("match", country=ref["country"])
if not batch_query:
batch_query = sub_query
else:
batch_query |= sub_query
search = search.filter(batch_query)
return search.scan()
Is there a faster/better approach to this problem?
Is using a multi-search going to be the faster option than using should/musts (OR/ANDs) in a single query?
EDIT: I tried multi-search and there was virtually no difference in the time. We're talking about seconds here. For 6 items it takes 60ms to get the result, for 200 items we're talking about 4-5 seconds.

Related

NRediSearch - Getting total documents matched count

Is there a way to get a total results count when calling Aggregate function?
Note that I'm not using Aggregate function to aggregate results, but as an advanced search query, because Search function does not allow to sort by multiple fields.
RediSearch returns total documents matched count, but I can't find a way to get this number using NRediSearch library.
With NRediSearch
Using NRediSearch, you would need to build and execute aggregation that will run a GROUPBY 0 and the COUNT reducer, say you have a person-idx index and you want to count all the Person documents in Redis:
var client = new Client("person-idx", muxer.GetDatabase());
var result = await client.AggregateAsync(new AggregationBuilder().GroupBy(new List<string>(), new List<Reducer>{Reducers.Count()}));
Console.WriteLine(result.GetResults().First().Values.First());
Will get the count you are looking for.
With Redis.OM
There's a newer library Redis.OM which you can also use to make these aggregations a bit simpler, the same operation would be done with the following:
var peopleAggregations = provider.AggregationSet<Person>();
Console.WriteLine(peopleAggregations.Count());

Result number for Boolean queries with Apache Lucene

When benchmarking Apache Lucene v7.5 I noticed a strange behavior:
I indexed the English Wikipedia dump (5,677,776 docs) using Lucene with the SimpleAnalyzer (No stopwords, no stemming)
Then I searched the index with the following queries:
the totalHits=5,382,873
who totalHits=1,687,254
the who totalHits=5,411,305
"the who" totalHits=8,827
The result number for the Boolean query the who is both larger than the result number for the single term the and the result number for the single term who, when it should be smaller than both.
Is there an explanation for that?
Code snippet:
analyzer = new SimpleAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[]{"title", "content","domain","url"},analyzer);
// Parse
Query q = parser.parse(querystr);
// top-10 results
int hitsPerPage = 10;
IndexReader indexReader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(indexReader);
// Ranker
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
// Search
searcher.search(q, collector);
// Retrieve the top-10 documents
TopDocs topDocs=collector.topDocs();
ScoreDoc[] hits = topDocs.scoreDocs;
totalHits=topDocs.totalHits;
System.out.println("query: "+querystr + " " + hits.length+" "+String.format("%,d",totalHits));
The explanation is that the default operator is OR and not AND as you assume. Searching for the who returns documents that have either the or who or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who also contains the, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)

Searching without duplication - aggregations and tophit

I am beginning with ElasticSearch and really like it, hovewer I am stuck with quite simple scenario.
I am indexing such structure of a Worker:
NAME SURENAME ID AGE SEX NAME_SURENAME BIRTH_DATE
NAME_SURENAME - not analyzed - this field is indexed for grouping purposes
NAME, SURENAME - analyzed
The task is simple - search 5 unique workers sorted by birth_date (unique means the same name and surename, even if they are in different age and are different people)
I read about aggregation queries and as I understand, I can get only aggregations without documents. Unfortunatelly I aggregate by name and surename so I won't have other fields in results in buckets, like for example document ID field at least. But I also read about TopHit aggregation, that it returns document, and i tried it - the second idea below.
I have two ideas
1) Not use aggregations, just search 5 workers, filter duplicates in java and again search workers and filter duplicates in Java till I reach 5 unique results
2) Use aggregations. I event tried it like below, it even works on test data but since it is my first time, please advice, whether it works accidentially or it is done correctly? So generally I thought I could get 5 buckets with one TopHit document. I have no idea how TopHit document is chosen but it seems to work. Below is the code
String searchString = "test";
BoolQueryBuilder query = boolQuery().minimumNumberShouldMatch(1).should(matchQuery("name", searchString).should(matchQuery("surename", searchString));
TermsBuilder terms = AggregationBuilders.terms("namesAgg").size(5);
terms.field("name_surename");
terms.order(Terms.Order.aggregation("birthAgg", false)).subAggregation(AggregationBuilders.max("birthAgg")
.field("birth_date")
.subAggregation(AggregationBuilders.topHits("topHit").setSize(1).addSort("birth_date", SortOrder.DESC));
SearchRequestBuilder searchRequestBuilder = client.prepareSearch("workers")
.addAggregation(terms).setQuery(query).setSize(1).addSort(SortBuilders.fieldSort("birth_date")
.order(SortOrder.DESC));
Terms aggregations = searchRequestBuilder.execute().actionGet().getAggregations().get("namesAgg");
List<Worker> results = new ArrayList<>();
for (Terms.Bucket bucket : aggregations.getBuckets()) {
Optional<Aggregation> first = bucket.getAggregations().asList().stream().filter(aggregation -> aggregation instanceof TopHits).findFirst();
SearchHit searchHitFields = ((TopHits) first.get()).getHits().getHits()[0];
Transformer<SearchHit, Worker> transformer = transformers.get(Worker.class);
Worker transform = transformer.transform(searchHitFields);
results.add(transform);
}
return results;//

Linq Query Where Contains

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

lucene.net, document boost not working

i am a beginner & developing my very first project with lucene.net i.e. an address search utility, lucene.net 3.0.3
using standard analyzer, query parser, (suppose i have a single field, Stored & Analyzed as well)
- sample data : (every row is a document with a single field)
(Postcode and street column concatenated)
UB6 9AH Greenford Road something
UB6 9AP Greenford Road something
UB1 3EB Greenford Road something
PR8 3JT Greenford Road something
HA1 3QD something Greenford Road
SM1 1JY something Greenford Road something
Searching
StringBuilder customQuery = new StringBuilder();
customQuery.Append(_searchFieldName + ":\"" + searchTerm + "\"^" + (wordsCount));
// this is for phrase matching
foreach (var word in words.Where(word => !string.IsNullOrEmpty(word)))
{
customQuery.Append(" +" + _searchFieldName + ":" + word + "*");
}
// this is prefix match for each word
Query query = _parser.Parse(customQuery.ToString());
_searcher.Search(query, collector);
all above (searching) working fine
Question
if i search for "Greenford road" ,
i may want that row that has 'SM1' should come up (means i want to priorities result as per postcode)
i have tested Query-Time-Boost and it works fine
but i may have a long list of priority postcodes sometimes (so i don't want to loop over each postcode and set its priority at query time
I WANT DOCUMENT TIME BOOSTING
but whatever document boost i set (at the time of indexing), it doesn't effect my search results
doc.Add(new Field(SearchFieldName, SearchField, Field.Store.YES, Field.Index.ANALYZED));
if (condition == true)
{
doc.Boost = 2; // or 5 or 200 etc (nothing works)
}
please HELP
i tried to understand similarity and scoring, but its too much mathematics there...
please help....
I recently had this problem myself and I think it might be due to wildcard queries (It was in my case at least). There is another post here that explains the issue better, and provides a possible solution:
Lucene .net Boost not working when using * wildcard

Resources