Find the keywords that matched the query in Elastic Search - elasticsearch

I'm using Elastic Search to search several indices.
When the user performs a query, the matches are split between 1 or 2 keywords that yield results. I'd like to be able to know for every hit, which keyword it originated from.
So if the user searched for "ventolin for asthma", I'd like to know which hits are for "ventolin" and which are for "asthma".
That is, for this query:
{
'query': {
'multi_match': {
'query': 'ventolin for asthma',
'fuzziness': 2,
'prefix_length': 1,
'type': 'best_fields',
'fields': ['term*']
}
}
}
And these hits:
{
...
'hits': {
'total': {
'value': 287,
'relation': 'eq'
},
'max_score': 10.301256,
'hits': [{
'_index': 'normalized-term-mapping',
'_type': '_doc',
'_id': '194526',
'_score': 10.301256,
'_source': {
'term': 'Ventolin Evohaler',
...
}
}, {
'_index': 'normalized-term-mapping',
'_type': '_doc',
'_id': '194362',
'_score': 8.529675,
'_source': {
'term': 'Childhood Asthma',
...
}
},
...
]
}
}
I want to match the first hit with the keyword Ventolin and the second hit with Asthma.
Note that:
I use fuzziness == 2, so the keywords may not exactly match the hit term
The indices use an analyzer (not a complex one but not trivial)
I can try and write code to match the terms with the query, but that would effectively mean reimplementing the elastic analysis in code which is not a great solution.
Is there a way to get the matched term from the original query from Elastic?

Yes, there is a way to get the matched terms using the Highlight API.
You're using a multi_match query so the default highlight options may be sufficient for you. You do need to specify the fields you want to highlight with something like this:
{
'query': {
'multi_match': {
'query': 'ventolin for asthma',
'fuzziness': 2,
'prefix_length': 1,
'type': 'best_fields',
'fields': ['term*']
}
},
'highlight': {
'fields': {
'term*': {}
}
}
}
However, this won't return an array of matched items. Instead, you will get the fields with existing matches marked (usually with HTML, but you can customize it). You could use that markup to post-process and isolate the individual matches if you need them.

Related

Sort by date + show past results after upcoming results

In Elastic I'd like to sort results by start_date ascending, but with past dates showing up after upcoming dates.
Example desired results:
[
{id: 5, start_date: '3000-01-01'},
{id: 7, start_date: '3001-01-01'},
{id: 8, start_date: '3002-01-01'},
{id: 1, start_date: '1990-01-01'},
{id: 4, start_date: '1991-01-01'},
{id: 3, start_date: '1992-01-01'},
]
Something like this would be possible in SQL:
ORDER BY (start_date > NOW()) DESC, start_date ASC
But I'm not sure how to accomplish this in Elastic. The only thing I can think of would be to set a boolean is_upcoming flag and reindex that every day.
Also I could be limiting and paginating the # of search results, so fetching them in reverse start_date order and then manipulating the results in my code isn't really doable.
It's perfectly possible using a sort script if your start_date is of type date and its format is yyyy-MM-dd (I found YYYY-... to not work properly).
GET future/_search
{
"sort": [
{
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "return doc['start_date'].value.millis > params.now ? (doc['start_date'].value.millis - params.now) : Long.MAX_VALUE",
"params": {
"now": 1594637988236
}
},
"order": "asc"
}
},
{
"start_date": {
"order": "asc"
}
}
]
}
The parametrized now is needed for synchronization reasons as described here.

RASA NLU: Can't extract entity

I've trained my rasa nlu model in a way that It recognizes the content in between square brackets as pst entity. For the training part, I had covered both the scenarios with more than 50 examples.
There are two scenarios(only space difference):
When I pass http://www.google.comm, 1283923, [9283911,9309212,9283238], it is considering only [ bracket as the pst entity.
When I pass http://www.google.comm, 1283923, [9283911, 9309212, 9283238], it is working fine and recognizing [9283911, 9309212, 9283238] as the pst entity as expected.
For the scenario 1, I've tried all the possible pipelines, but it only recognizes the first square bracket [ as the pst entity
In the response, I am getting this output:
{
'intent': {
'name': None,
'confidence': 0.0
},
'entities': [
{
'start': 0,
'end': 22,
'value': 'http://www.google.comm',
'entity': 'url',
'confidence': 0.8052099168500071,
'extractor': 'ner_crf'
},
{
'start': 24,
'end': 31,
'value': '1283923',
'entity': 'defect_id',
'confidence': 0.8334249141074151,
'extractor': 'ner_crf'
},
{
'start': 33,
'end': 34,
'value': '[',
'entity': 'pst',
'confidence': 0.5615805162522188,
'extractor': 'ner_crf'
}
],
'intent_ranking': [],
'text': 'http://www.google.comm, 1283923, [9283911,9309212,9283238]'
}
So, Can anyone tell me what I am missing in the configuration? The problem is happening because of spacing only, and my model should have the knowledge of spacing as I am providing the training data with both scenarios.
It is good idea to use Regex for your purpose. Rasa NLU supports extraction of Entities by Regex. Normal NLU training data will have something like below
{
"rasa_nlu_data": {
"common_examples": [
{
"text": "Hi",
"intent": "greet",
"entities": []
}]
}
}
You can provide Regex data for training as below in the NLU json file.
{
"rasa_nlu_data": {
"regex_features": [
{
"name": "pst",
"pattern": "\[..*\]"
},
]
}
}
Reference: Regular Expression in Rasal NLU

Elasticsearch Scripting: updating every array element

es version 5.4.1
I have a document like that
{'_id': 'AWUe4rSpgJ6eZRtHLwRC',
'_index': 'test',
'_score': 1,
'_source': {'data': [1, 2, 3, 4]},
'_type': 'test'}
and I am trying to add a number to all elements in data.
what I try is :
{
'query':{
'match_all':{}
},
'script':{
'inline': "for (int i=0;i<ctx._source.data.size();i++) {ctx._source.data[i]=ctx._source.data[i] + 1000000};",
'lang':'painless'
}
}
but I got
{'reason': "unexpected token [';'] was expecting one of [<EOF>]."}
I am new to painless. How can I make it current?

mgo with aggregation and grouping

I am trying to perform a query using golang mgo
to effectively get distinct values from a join, I understand that this might not be the best paradigm to work with in Mongo.
Something like this:
pipe := []bson.M{
{
"$group": bson.M{
"_id": bson.M{"user": "$user"},
},
},
{
"$match": bson.M{
"_id": bson.M{"$exists": 1},
"user": bson.M{"$exists": 1},
"date_updated": bson.M{
"$gt": durationDays,
},
},
},
{
"$lookup": bson.M{
"from": "users",
"localField": "user",
"foreignField": "_id",
"as": "user_details",
},
},
{
"$lookup": bson.M{
"from": "organizations",
"localField": "organization",
"foreignField": "_id",
"as": "organization_details",
},
},
}
err := d.Pipe(pipe).All(&result)
If I comment out the $group section, the query returns the join as expected.
If I run as is, I get NULL
If I move the $group to the bottom of the pipe I get an array response with Null values
Is it possible to do do an aggregation with a $group (with the goal of simulating DISTINCT) ?
The reason you're getting NULL is because your $match filter is filtering out all of documents after the $group phase.
After your first stage of $group the documents are only as below example:
{"_id": { "user": "foo"}},
{"_id": { "user": "bar"}},
{"_id": { "user": "baz"}}
They no longer contains the other fields i.e. user, date_updated and organization. If you would like to keep their values, you can utilise Group Accumulator Operator. Depending on your use case you may also benefit from using Aggregation Expression Variables
As an example using mongo shell, let's use $first operator which basically pick the first occurrence. This may make sense for organization but not for date_updated. Please choose a more appropriate accumulator operator.
{"$group": {
"_id":"$user",
"date_updated": {"$first":"$date_updated"},
"organization": {"$first":"$organization"}
}
}
Note that the above also replaces {"_id":{"user":"$user"}} with simpler {"_id":"$user"}.
Next we'll add $project stage to rename our result of _id field from the group operation back to user. Also carry along the other fields without modifications.
{"$project": {
"user": "$_id",
"date_updated": 1,
"organization": 1
}
}
Your $match stage can be simplified, by just listing the date_updated filter. First we can remove _id as it's no longer relevant up to this point in the pipeline, and also if you would like to make sure that you only process documents with user value you should placed $match before the $group. See Aggregation Pipeline Optimization for more.
So, all of those combined will look something as below:
[
{"$group":{
"_id": "$user",
"date_updated": { "$first": "$date_updated"},
"organization": { $first: "$organization"}
}
},
{"$project":{
"user": "$_id",
"date_updated": 1,
"organization": 1
}
},
{"$match":{
"date_updated": {"$gt": durationDays } }
},
{"$lookup":{
"from": "users",
"localField": "user",
"foreignField": "_id",
"as": "user_details"
}
},
{"$lookup":{
"from": "organizations",
"localField": "organization",
"foreignField": "_id",
"as": "organization_details"
}
}
]
(I know you're aware of it) Lastly, based on the database schema above with users and organizations collections, depending on your application use case you may re-consider embedding some values. You may find 6 Rules of Thumb for MongoDB Schema Design useful.

Elasticsearch: does not give back result when searching for a simple 'a' character

I want to store tags for messages in ElasticSearch. I've defined the tags field as this:
{
'tags': {
'type': 'string',
'index_name': 'tag'
}
}
For a message I've stored the following list in the tags field:
['a','b','c']
Now if I try to search for tag 'b' with the following query, it gives back the message and the tags:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'b'
}
}
],
'minimum_number_should_match': 1
}
}
}
There goes the same with tag 'c'.
But if I search for tag 'a' with this:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'a'
}
}
],
'minimum_number_should_match': 1
}
}
}
It gives back no results at all!
The answer is:
{
'hits': {
'hits': [],
'total': 0,
'max_score': None
},
'_shards': {
'successful': 5,
'failed': 0,
'total': 5
},
'took': 1,
'timed_out': False
}
What am I doing wrong? (It doesn't matter that the 'a' is the first element of the list, the same goes for ['b','a','c']. It seems it has problems only with a single 'a' character.
If you didn't set any analyzer and mapping to your index, Elasticsearch uses its own analyzer by default. Elasticsearch's default_analyzer has stopwords filter that defaultly ignores English stopwords such as:
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
Before going for more just check ElasticSearch mapping and analyzer guides:
Analyzer Guide
Mapping Guide
There might be some stemming or stop word lists involved. Try making sure the field is not analyzed.
'tags': {'type': 'string', 'index_name': 'tag', "index" : "not_analyzed"}
Similar: matching whole string with dashes in elasticsearch

Resources