Elastic Search Suggestions Return Zero Results - elasticsearch

Im trying to setup ElasticSearch using the elasticsearch_dsl python library. I have been able to setup the Index, and I am able to search using the .filter() method, but I cannot get the .suggest method to work.
I am trying to use the completion mapping type, and the suggest query method since this is going to be used for an autocomplete field (recommended on elastic's docs).
I am new to elastic, so I am guessing I am missing something.
Any guidance will be greatly appreciated!
What I have done so far
I did not find a tutorial that had exactly what I wanted, but I read through the documentation on ElasticSearch.com and elasticsearch_dsl, and looked at some examples
hereand here
PS: I am using Searchbox Elasticsearch on Heroku
Index / Mappings Setup:
# imports [...]
edge_ngram_analyzer = analyzer(
'edge_ngram_analyzer',
type='custom',
tokenizer='standard',
filter=[
'lowercase',
token_filter(
'edge_ngram_filter', type='edgeNGram',
min_gram=1, max_gram=20
)
]
)
class DocumentIndex(ElasticDocument):
title = Text()
title_suggest = Completion(
analyzer=edge_ngram_analyzer,
)
class Index:
name = 'documents-index'
# [...] Initialize index
# [...] Upload Documents (5,000 documents)
# DocumentIndex.init()
# [DocumentIndex(**doc).save() for doc in mydocs]
Mappings Output:
This is the mapping as shown in the web console:
{
"documents-index": {
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text"
},
"title_suggest": {
"type": "completion",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
}
}
}
}
}
}
Attempting to Search
Verify Index exists:
>>> search = Search(index='documents-index')
>>> search.count() # Returns correct amount of documents
5000
>>> [doc for doc in search.scan()][:3]
>>> [<Hit(documents-index/doc/1): ...} ...
Test Search - Works:
>>> query = search.filter('match', title='class')
>>> query.execute()
>>> result.hits
<Response: [<Hit(documents-in [ ... ]
>>> len(result.hits)
10
>>> query.to_dict() # see query payload
{
"query":{
"bool":{
"filter":[
{
"fuzzy":{
"title":"class"
}
}
]
}
}
}
The part that fails
I cannot get any of the .suggest() methods to work.
Note:
* I am following the official library docs
Test Suggest:
>>> query = search.suggest(
'title-suggestions',
'class',
completion={
'field': 'title_suggest',
'fuzzy': True
})
>>> query.execute()
<Response: {}>
>>> query.to_dict() # see query payload
{
"suggest": {
"title-suggestions": {
"text": "class",
"completion": { "field": "title_suggest" }
}
}
}
I also tried the code below, and obviously many different types of queries and values, but the results were similar. (note with .filter() I always get the expected result).
>>> query = search.suggest(
'title-suggestions',
'class',
term=dict(field='title'))
>>> query.to_dict() # see query payload
{
"suggest": {
"title-suggestions": {
"text": "class",
"term": {
"field": "title"
}
}
}
}
>>> query.execute()
<Response: {}>
Update
Per Honza's suggestion, I updated the title_suggest mapping to be only Completion, with no custom analyzers. I also deleted the index and reindexed from scratch
class DocumentIndex(ElasticDocument):
title = Text()
title_suggest = Completion()
class Index:
name = 'documents-index'
Unfortunately, the problem remains. Here are some more tests:
Verify title_suggest is being indexed properly
>>> search = Search(index='documents-index)
>>> search.index('documents-index').count()
23369
>>> [d for d in search.scan()][0].title
'AnalyticalGrid Property'
>>> [d for d in search.scan()][0].title_suggest
'AnalyticalGrid Property'
Tried searching again:
>>> len(search.filter('term', title='class').execute().hits)
10
>>> search.filter('term', title_suggest='Class').execute().hits
[]
>>> search.suggest('suggestions', 'class', completion={'field':
'title_suggest'}).execute().hits
[]
Verify Mapping:
>>> pprint(index.get_mapping())
{
"documents-index": {
"mappings": {
"doc": {
"properties": {
"title": { "type": "text" },
"title_suggest": {
"analyzer": "simple",
"max_input_length": 50,
"preserve_position_increments": True,
"preserve_separators": True,
"type": "completion"
}
}
}
}
}
}

For completion fields you do not want to be using ngram analyzers. The completion field will automatically index all prefixes and optimize for prefix queries so you are doing the work twice and confusing the system. Start with empty completion field and go from there.

I wanted to formalize the solution which was provided by Honza on one of the comments for another answer.
The problem was not the mapping, but simply the fact that results from the
.suggest() method are not returned under hits.
The suggestions are now visible in the dictionary returned by:
>>> response = query.execute()
>>> print(response)
<Response: {}>
>>> response.to_dict()
# output is
# {'query': {},
# 'suggest': {'title-suggestions': {'completion': {'field': 'title_suggest'},
# [...]
I have also found additional details on this github issue:
HonzaKral commented 27 days ago
The Response object provides access to any and all fields that have
been returned by elasticsearch. For convenience there is a shortcut
that allow to iterate over the hits as that is both most common and
also easy to do. For other parts of the response, like aggregations or
suggestions, you need to access them explicitly like
response.suggest.foo.options.

Related

ElasticSearch: How to query by multiple conditions in different locations?

I've been trying to build this ElasticSearch Query on the Danish CVR database API so far without success. Basically I'm trying to find companies where
The company has a relationship with "deltager" (participant) with "enhedsNummer" (ID) equal NUMBER
The relationship is still active, i.e. the "end of period" field is null
How do I construct a query that has multiple conditions like this?
'query': {
'bool': {
'must': [
{
'term': {'Vrvirksomhed.deltagerRelation.deltager.enhedsNummer': NUMBER},
AND
'term': {'Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil': null}
},
],
},
},
}
FYI: database mapping may be found at http://distribution.virk.dk/cvr-permanent/_mapping
You can try:
GET /cvr-permanent/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": {
"value": "your_value_here"
}
}
}
],
"must_not": [
{
"exists": {
"field": "Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil"
}
}
]
}
}
}
Trick here is to use must_not/exist for nil values.
P.S. I cannot check it because it requires authorisation.
It doesn't appear like ElasticSearch Queries are as dynamic as I had wanted (or I don't know how use them). Instead, it appears that the Python code below is the best choice for generating the desired outcome:
import requests
import pandas as pd
# creation of empty lists:
virksomhedsnavne = []
virksomhedscvr = []
relation_fra = []
relation_til = []
# Pulling data (apparently limited to 3000 elements at a time):
for i in range(20):
if i == 0:
highestcvrnummer = 0
else:
highestcvrnummer = max(virksomhedscvr)
headers = {
'Content-Type': 'application/json',
}
json_data = {
"_source": ["Vrvirksomhed.cvrNummer", "Vrvirksomhed.navne", "Vrvirksomhed.virksomhedMetadata.nyesteNavn.navn", "Vrvirksomhed.deltagerRelation"],
"sort" : [{"Vrvirksomhed.cvrNummer" : {"order":"asc"}}],
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": "some_value"
}
},
{
"range":{
"Vrvirksomhed.cvrNummer": {
"gt": highestcvrnummer
}
}
}
]
}
},
'size': 3000
}
response = requests.post('http://distribution.virk.dk/cvr-permanent/virksomhed/_search', headers=headers, json=json_data, auth=('USERNAME', 'PASSWORD'))
json_data = response.json()['hits']['hits']
# Aggregate and format data neatly
for data in json_data:
virksomhed_data = data['_source']['Vrvirksomhed']
virksomhedscvr.append(virksomhed_data['cvrNummer'])
try:
virksomhedsnavne.append(virksomhed_data['virksomhedMetadata']['nyesteNavn']['navn'])
except:
virksomhedsnavne.append(virksomhed_data['navne'][0]['navn'])
# Loop through all "deltagere" and find match with value
for relation in virksomhed_data['deltagerRelation']:
# If match found
if relation['deltager']['enhedsNummer'] == some_value:
# Make sure most recent period is chosen
antalopdateringer = len(relation['organisationer'])-1
relation_gyldig = relation['organisationer'][antalopdateringer]['medlemsData'][0]['attributter'][0]['vaerdier'][0]['periode']
relation_fra.append(relation_gyldig['gyldigFra'])
relation_til.append(relation_gyldig['gyldigTil'])
break
#export to excel
dict = {'CVR nummer':virksomhedscvr, 'navn':virksomhedsnavne, 'Relation fra':relation_fra, 'Relation til':relation_til}
df = pd.DataFrame(dict)
df.to_excel("output.xlsx")
If anyone else is working with the Danish CVR register's API, I hope this helps!
Also, if you find a better solution, please let me know :)

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Query for sub-set of word in MongoDB Atlas full text search

My objective is to create an index + search pipeline, so I can find the following document by searching for "reprod":
{ name: "can you find this and reproduce?" }
What I have:
I'm using the default index.
My search pipeline looks like this:
$search: {
text: {
query: 'reprod',
path: 'name',
},
},
But this does not work - I only get the full result when I provide the entire word in the query. I would like the document to be returned, even if I only provide a subset of a word.
If you can change your index definition, I'd suggest using the autocomplete operator but fields on which you query using this operator currently require them to be statically defined in the index definition. For this example, your index definition could look something like this:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"type": "autocomplete"
}
}
}
}
And then your query would look like this:
{
$search: {
"autocomplete": {
"path": "name",
"query": "reprod"
}
}
}
If you don't want a statically defined index definition, you could use the Wildcard Operator instead with the Allow Analyzed Field set to true.

how to create a join relation using elasticsearch python client

I am looking for any examples that implement the parent-child relationship using the python interface.
I can define a mapping such as
es.indices.create(
index= "docpage",
body= {
"mappings": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"my_document": "my_page"
}
}
}
}
}
)
I am then indexing a document using
res = es.index(index="docpage",doc_type="_doc",id = 1, body=jsonDict) ,
where jsonDict is a dict structure of document's text,
jsonDict['my_join_field']= 'my_document', and other relevant info.
Reference example.
I tried adding pageDict where the page is a string containing text on a page in a document, and
pageDict['content']=page
pageDict['my_join_field']={}
pageDict['my_join_field']['parent']="1"
pageDict['my_join_field']['name']="page"
res = es.index(index="docpage",doc_type="_doc",body=pageDict)
but I get a parser error:
RequestError(400, 'mapper_parsing_exception', 'failed to parse')
Any ideas?
This worked for me :
res=es.index(index="docpage",doc_type="_doc",body={"content":page,
"my-join-field":{
"name": "my_page",
"parent": "1"}
})
The initial syntax can work if the parent is also repeated in the "routing" key of the main query body:
res = es.index(index="docpage",doc_type="_doc",body=pageDict, routing=1)

How can i get unique suggestions without duplicates when i use completion suggester?

I am using elastic 5.1.1 in my environment. I have chosen completion suggester on a field name post_hashtags with an array of strings to have suggestion on it. I am getting response as below for prefix "inv"
Req:
POST hashtag/_search?pretty&&filter_path=suggest.hash-suggest.options.text,suggest.hash-suggest.options._source
{"_source":["post_hashtags" ],
"suggest": {
"hash-suggest" : {
"prefix" : "inv",
"completion" : {
"field" : "post_hashtags"
}
}
}
Response :
{
"suggest": {
"hash-suggest": [
{
"options": [
{
"text": "invalid",
"_source": {
"post_hashtags": [
"invalid"
]
}
},
{
"text": "invalid",
"_source": {
"post_hashtags": [
"invalid",
"coment_me",
"daya"
]
}
}
]
}
]
}
Here "invalid" is returned twice because it is also a input string for same field "post_hashtags" in other document.
Problems is if same "invalid" input string present in 1000 documents in same index then i would get 1000 duplicated suggestions which is huge and not needed.
Can I apply an aggregation on a field of type completion ?
Is there any way I can get unique suggestion instead of duplicated text field, even though if i have same input string given to a particular field in multiple documents of same index ?
ElasticSearch 6.1 has introduced the skip_duplicates operator. Example usage:
{
"suggest": {
"autocomplete": {
"prefix": "MySearchTerm",
"completion": {
"field": "name",
"skip_duplicates": true
}
}
}
}
Edit: This answer only applies to Elasticsearch 5
No, you cannot de-duplicate suggestion results. The autocomplete suggester is document-oriented in Elasticsearch 5 and will thus return suggestions for all documents that match.
In Elasticsearch 1 and 2, the autocomplete suggester automatically de-duplicated suggestions. There is an open Github ticket to bring back this functionality, and it looks like it is possible to do so in a future version.
For now, you have two options:
Use Elasticsearch version 1 or 2.
Use a different suggestion implementation not based on the autocomplete suggester. The only semi-official suggestion I have seen so far involve putting your suggestion strings in a separate index.

Resources