NEST elasticsearch.NET search query not returning results (part 2) - elasticsearch

I'm using the object initializer syntax with NEST to form a search query. When I include the second pdfQuery with the logical OR operator, I get no results. If I exclude it, I get results.
QueryContainer titleQuery = new MatchQuery
{
Field = Property.Path<ElasticBook>(p => p.Title),
Query = query,
Boost = 50,
Slop = 2,
MinimumShouldMatch = "55%"
};
QueryContainer pdfQuery = new MatchQuery
{
Field = Property.Path<ElasticBook>(p => p.Pdf),
Query = query,
CutoffFrequency = 0.001
};
var result = _client.Search<ElasticBook>(new SearchRequest("bookswithstop", "en")
{
From = 0,
Size = 10,
Query = titleQuery || pdfQuery,
Timeout = "20000",
Fields = new []
{
Property.Path<ElasticBook>(p => p.Title)
}
});
If I debug and inspect the result var, I copy-value one of request properties to get:
{
"timeout": "20000",
"from": 0,
"size": 10,
"fields": [
"title"
],
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "Proper Guide To Excel 2010",
"slop": 2,
"boost": 50.0,
"minimum_should_match": "55%"
}
}
},
{
"match": {
"pdf": {
"query": "Proper Guide To Excel 2010",
"cutoff_frequency": 0.001
}
}
}
]
}
}
}
The problem is that if I copy that query into sense - it returns about 100 results (albeit slowly). I've checked the header info and that seems to be correct from NEST as well:
ConnectionStatus = {StatusCode: 200,
Method: POST,
Url: http://elasticsearch-blablablamrfreeman/bookswithstop/en/_search,
Request: {
"timeout": "20000",
"from": 0,
"size": 10,
"fields": [
"title"
],
"query": {
"bool": {
"shoul...
The pdf field uses the elastic search attachment plugin (located # https://github.com/elastic/elasticsearch-mapper-attachments) and I was getting Newtonsoft.JSON system.outofmemoryexceptions being thrown before (but not now for some reason).
My only suggestion therefore is that perhaps there's some serialization issue via my query and NEST? If that were the case I'm not sure why it would just execute successfully with a 200 code and give 0 documents in the Documents property
Could anyone please explain to me how I would go about troubleshooting this please? It clearly doesn't like my second search query (pdfQuery) but I'm not sure why - and the resultant JSON request syntax seems to be correct as well!

I think this part is causing problems
Fields = new []
{
Property.Path<ElasticBook>(p => p.Title)
}
When do you use Fields option, elasticsearch is not returning _source field, so you can't access results through result.Documents. Instead, you have to use result.FieldSelections, which is quite unpleasant.
If you want to return only specific fields from elasticsearch and still be able to use result.Documents you can take advantage of source includes / excludes. With NEST you can do this as follows:
var searchResponse = client.Search<Document>(s => s
.Source(source => source.Include(f => f.Number))
.Query(q => q.MatchAll()));
Hope this helps you.

Related

Liferay Elastic Search: Query fields using StringQuery

I need to query two fields from elastic search using query_string: e.g. name and age.
It works if I do it via the following query directly to Liferay's Elastic Search:
{
"query": {
"bool": {
"must": [
{ "match": { "recordSetId": "123" }},
{ "query_string":
{
"query": "Maximil*",
"fields": ["name", "age"]
}
}
]
}
}
}
Does anyone know how I can send the above query from Liferay Java Code?
I would need something like:
StringQuery query = queries.string(searchTerms);
query.setFields(); // this does not exist...
SearchContext searchContext = new SearchContext();
searchContext.setStart(QueryUtil.ALL_POS);
searchContext.setEnd(QueryUtil.ALL_POS);
searchContext.setUserId(userId);
searchContext.setEntryClassNames("Enter the name of the class you want to get data from");
(Like : "JournalArticle.class.getName()")
BooleanQuery query = new BooleanQueryImpl();
query.add(new TermQueryImpl("name", "Maximil*",BooleanClauseOccur.MUST);
query.add(new TermQueryImpl("age","Maximil*",BooleanClauseOccur.MUST);
Hits hits = IndexSearcherHelperUtil.search(searchContext, query);
List<Document> document = hits.toList();

ElasticSearch: How to query by multiple conditions in different locations?

I've been trying to build this ElasticSearch Query on the Danish CVR database API so far without success. Basically I'm trying to find companies where
The company has a relationship with "deltager" (participant) with "enhedsNummer" (ID) equal NUMBER
The relationship is still active, i.e. the "end of period" field is null
How do I construct a query that has multiple conditions like this?
'query': {
'bool': {
'must': [
{
'term': {'Vrvirksomhed.deltagerRelation.deltager.enhedsNummer': NUMBER},
AND
'term': {'Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil': null}
},
],
},
},
}
FYI: database mapping may be found at http://distribution.virk.dk/cvr-permanent/_mapping
You can try:
GET /cvr-permanent/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": {
"value": "your_value_here"
}
}
}
],
"must_not": [
{
"exists": {
"field": "Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil"
}
}
]
}
}
}
Trick here is to use must_not/exist for nil values.
P.S. I cannot check it because it requires authorisation.
It doesn't appear like ElasticSearch Queries are as dynamic as I had wanted (or I don't know how use them). Instead, it appears that the Python code below is the best choice for generating the desired outcome:
import requests
import pandas as pd
# creation of empty lists:
virksomhedsnavne = []
virksomhedscvr = []
relation_fra = []
relation_til = []
# Pulling data (apparently limited to 3000 elements at a time):
for i in range(20):
if i == 0:
highestcvrnummer = 0
else:
highestcvrnummer = max(virksomhedscvr)
headers = {
'Content-Type': 'application/json',
}
json_data = {
"_source": ["Vrvirksomhed.cvrNummer", "Vrvirksomhed.navne", "Vrvirksomhed.virksomhedMetadata.nyesteNavn.navn", "Vrvirksomhed.deltagerRelation"],
"sort" : [{"Vrvirksomhed.cvrNummer" : {"order":"asc"}}],
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": "some_value"
}
},
{
"range":{
"Vrvirksomhed.cvrNummer": {
"gt": highestcvrnummer
}
}
}
]
}
},
'size': 3000
}
response = requests.post('http://distribution.virk.dk/cvr-permanent/virksomhed/_search', headers=headers, json=json_data, auth=('USERNAME', 'PASSWORD'))
json_data = response.json()['hits']['hits']
# Aggregate and format data neatly
for data in json_data:
virksomhed_data = data['_source']['Vrvirksomhed']
virksomhedscvr.append(virksomhed_data['cvrNummer'])
try:
virksomhedsnavne.append(virksomhed_data['virksomhedMetadata']['nyesteNavn']['navn'])
except:
virksomhedsnavne.append(virksomhed_data['navne'][0]['navn'])
# Loop through all "deltagere" and find match with value
for relation in virksomhed_data['deltagerRelation']:
# If match found
if relation['deltager']['enhedsNummer'] == some_value:
# Make sure most recent period is chosen
antalopdateringer = len(relation['organisationer'])-1
relation_gyldig = relation['organisationer'][antalopdateringer]['medlemsData'][0]['attributter'][0]['vaerdier'][0]['periode']
relation_fra.append(relation_gyldig['gyldigFra'])
relation_til.append(relation_gyldig['gyldigTil'])
break
#export to excel
dict = {'CVR nummer':virksomhedscvr, 'navn':virksomhedsnavne, 'Relation fra':relation_fra, 'Relation til':relation_til}
df = pd.DataFrame(dict)
df.to_excel("output.xlsx")
If anyone else is working with the Danish CVR register's API, I hope this helps!
Also, if you find a better solution, please let me know :)

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Trying to filter some Elasticsearch results where the field might not exist

I have some data and I'm trying to add an extra filter that will exclude/filter-out any results which is where the key/value is foo.IsMarried == true.
Now, there's heaps of documents that don't have this field. If the field doesn't exist, then I'm assuming that the value is foo.IsMarried = false .. so those documents will be included in the result set.
Can anyone provide any clues, please?
I'm also using the .NET 'NEST' nuget client library - so I'll be really appreciative if the answer could be targeting that, but just happy with any answer, really.
Generally, within elasticsearch, for a boolean field, if the field doesn't exist, it doesn't mean that it's value is false. It could be that there is no value against it.
But, based on the assumption you are making in this case - we can check if the field foo.isMarried is explicitly false OR it does not exist in the document itself.
The query presented by Rahul in the other answer does the job. However since you wanted a NEST version of the same, the query can be constructed using the below snippet of code.
// Notice the use of not exists here. If you do not want to check for the 'false' value,
// you can omit the first term filter here. 'T' is the type to which you are mapping your index.
// You should pass the field based on the structure of 'T'.
private static QueryContainer BuildNotExistsQuery()
{
var boolQuery = new QueryContainerDescriptor<T>().Bool(
b => b.Should(
s => s.Term(t => t.Field(f => f.foo.IsMarried).Value(false)),
s => !s.Exists(ne => ne.Field(f => f.foo.IsMarried))
)
);
}
You can trigger the search through the NEST client within your project as shown below.
var result = client.Search<T>(
.From(0)
.Size(20)
.Query(q => BuildNotExistsQuery())
// other methods that you want to chain go here
)
You can use a should query with following conditions.
IsMarried = false
must not exists IsMarried
POST test/person/
{"name": "p1", "IsMarried": false}
POST test/person/
{"name": "p2", "IsMarried": true}
POST test/person/
{"name": "p3"}
Raw DSL query
POST test/person/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"IsMarried": false
}
},
{
"bool": {
"must_not": {
"exists": {
"field": "IsMarried"
}
}
}
}
]
}
}
}
I hope you can convert this raw DSL query to NEST!

ElasticSearch NEST Query

I'm trying to mimic a query that I wrote in Sense (chrome plugin) using NEST in C#. I can't figure out what the difference between the two queries is. The Sense query returns records while the nest query does not. The queries are as follows:
var searchResults = client.Search<File>(s => s.Query(q => q.Term(p => p.fileContents, "int")));
and
{
"query": {
"term": {
"fileContents": {
"value": "int"
}
}
}
What is the difference between these two queries? Why would one return records and the other not?
You can find out what query NEST uses with the following code:
var json = System.Text.Encoding.UTF8.GetString(searchResults.RequestInformation.Request);
Then you can compare the output.
I prefer this slightly simpler version, which I usually just type in .NET Immediate window:
searchResults.ConnectionStatus;
Besides being shorter, it also gives the url, which can be quite helpful.
? searchResults.ConnectionStatus;
{StatusCode: 200,
Method: POST,
Url: http://localhost:9200/_all/filecontent/_search,
Request: {
"query": {
"term": {
"fileContents": {
"value": "int"
}
}
}
}
Try this:
var searchResults2 = client.Search<File>(s => s
.Query(q => q
.Term(p => p.Field(r => r.fileContents).Value("int")
)
));
Followup:
RequestInformation is not available in newer versions of NEST.
I'd suggest breaking down your code in steps (Don't directly build queries in client.Search() method.
client.Search() takes Func<SearchDescriptor<T>, ISearchRequest> as input (parameter).
My answer from a similar post:
SearchDescriptor<T> sd = new SearchDescriptor<T>()
.From(0).Size(100)
.Query(q => q
.Bool(t => t
.Must(u => u
.Bool(v => v
.Should(
...
)
)
)
)
);
And got the deserialized JSON like this:
{
"from": 0,
"size": 100,
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
...
]
}
}
]
}
}
}
It was annoying, NEST library should have something that spits out the JSON from request. However this worked for me:
using (MemoryStream mStream = new MemoryStream()) {
client.Serializer.Serialize(sd, mStream);
Console.WriteLine(Encoding.ASCII.GetString(mStream.ToArray()));
}
NEST library version: 2.0.0.0.
Newer version may have an easier method to get this (Hopefully).

Resources