Elasticsearch "match_phrase" query and "fuzzy" query - can both be used in conjunction - elasticsearch

I need a query using match_phrase along with fuzzy matching. However I'm not able to find any documentation to construct such a query. Also, when I try combining the queries(one within another), it throws errors. Is it possible to construct such a query?

You would need to make use of Span Queries.
The below query would perform phrase match+fuzzy query for champions league say for e.g. on a sample field name which is of type text
If you'd want multiple fields, then add another must clause.
Notice I've mentioned slop:0 and in_order:true which would do exact phrase match, while you achieve fuzzy behaviour using fuzzy queries inside match query.
Sample Documents
POST span-index/mydocs/1
{
"name": "chmpions leage"
}
POST span-index/mydocs/2
{
"name": "champions league"
}
POST span-index/mydocs/3
{
"name": "chompions leugue"
}
Span Query:
POST span-index/_search
{
"query":{
"bool":{
"must":[
{
"span_near":{
"clauses":[
{
"span_multi":{
"match":{
"fuzzy":{
"testField":"champions"
}
}
}
},
{
"span_multi":{
"match":{
"fuzzy":{
"testField":"league"
}
}
}
}
],
"slop":0,
"in_order":true
}
}
]
}
}
}
Response:
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.5753642,
"hits": [
{
"_index": "span-index",
"_type": "mydocs",
"_id": "2",
"_score": 0.5753642,
"_source": {
"name": "champions league"
}
},
{
"_index": "span-index",
"_type": "mydocs",
"_id": "1",
"_score": 0.5753642,
"_source": {
"name": "chmpions leage"
}
},
{
"_index": "span-index",
"_type": "mydocs",
"_id": "3",
"_score": 0.5753642,
"_source": {
"name": "chompions leugue"
}
}
]
}
}
Let me know if this helps!

Related

Elasticsearch - Find documents missing two fields

I'm trying to create a query that returns information about how many documents that don't have data for two fields (date.new and date.old). I have tried the query below, but it works as OR-logic, where all documents missing either date.new or date.old are returned. Does anyone know how I can make this only return documents missing both fields?
{
"aggs":{
"Missing_field_count1":{
"missing":{
"field":"date.new"
}
},
"Missing_field_count2":{
"missing":{
"field":"date.old"
}
}
}
}
Aggregations is not the feature to use for this. You need to use the exists query wrapped within a bool/must_not query, like this:
GET index/_count
{
"size": 0,
"bool": {
"must_not": [
{
"exists": {
"field": "date.new"
}
},
{
"exists": {
"field": "date.old"
}
}
]
}
}
hits.total.value indicates the count of the documents that match the search request. The value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte)
Index Data:
{
"data": {
"new": 1501,
"old": 10
}
}
{
"title": "elasticsearch"
}
{
"title": "elasticsearch-query"
}
{
"date": {
"new": 1400
}
}
The search query given by #Val answers on how to achieve your use case.
Search Result:
"hits": {
"total": {
"value": 2, <-- note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "65112793",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"title": "elasticsearch"
}
},
{
"_index": "65112793",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"title": "elasticsearch-query"
}
}
]
}

Get specific fields from index in elasticsearch

I have an index in elastic-search.
Sample structure :
{
"Article": "Article7645674712",
"Genre": "Genre92231455",
"relationDesc": [
"Article",
"Genre"
],
"org": "user",
"dateCreated": {
"date": "08/05/2015",
"time": "16:22 IST"
},
"dateModified": "08/05/2015"
}
From this index i want to retrieve selected fields: org and dateModified.
I want result like this
{
"took": 265,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 1,
"hits": [
{
"_index": "couchrecords",
"_type": "couchbaseDocument",
"_id": "3",
"_score": 1,
"_source": {
"doc": {
"org": "user",
"dateModified": "08/05/2015"
}
}
},
{
"_index": "couchrecords",
"_type": "couchbaseDocument",
"_id": "4",
"_score": 1,
"_source": {
"doc": {
"org": "user",
"dateModified": "10/05/2015"
}
}
}
]
}
}
How to query elastic-search to get only selected specific fields ?
You can retrieve only a specific set of fields in the result hits using the _source parameter like this:
curl -XGET localhost:9200/couchrecords/couchbaseDocument/_search?_source=org,dateModified
Or in this format:
curl -XPOST localhost:9200/couchrecords/couchbaseDocument/_search -d '{
"_source": ["doc.org", "doc.dateModified"], <---- you just need to add this
"query": {
"match_all":{} <----- or whatever query you have
}
}'
That's easy. Considering any query of this format :
{
"query": {
...
},
}
You'll just need to add the fields field into your query which in your case will result in the following :
{
"query": {
...
},
"fields" : ["org","dateModified"]
}
{
"_source" : ["org","dateModified"],
"query": {
...
}
}
Check ElasticSearch source filtering.

Should I include spaces in fuzzy query fields?

I have this data:
name:
first: 'John'
last: 'Smith'
When I store it in ES, AFAICT it's better to make it one field. However, should this one field be:
name: 'John Smith'
or
name: 'JohnSmith'
?
I'm thinking that the query should be:
query:
match:
name:
query: searchTerm
fuzziness: 'AUTO'
operator: 'and'
Example search terms are what people might type in a search box, like
John
Jhon Smi
J Smith
Smith
etc.
You will probably want a combination of ngrams and a fuzzy match query. I wrote a blog post about ngrams for Qbox if you need a primer: https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch. I'll swipe the starter code at the end of the post to illustrate what I mean here.
Also, I don't think it matters much whether you use two fields for name, or just one. If you have some other reason you want two fields, you may want to use the _all field in your query. For simplicity I'll just use a single field here.
Here is a mapping that will get you the partial-word matching you want, assuming you only care about tokens that start at the beginning of words (otherwise use ngrams instead of edge ngrams). There are lots of nuances to using ngrams, so I'll refer to you the documentation and my primer if you want more info.
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
One thing to note here, in particular: "min_gram": 1. This means that single-character tokens will be generated from indexed values. This will cast a pretty wide net when you query (lots of words begin with "j", for example), so you may get some unexpected results, especially when combined with fuzziness. But this is needed to get your "J Smith" query to work right. So there are some trade-offs to consider.
For illustration, I indexed four documents:
PUT /test_index/doc/_bulk
{"index":{"_id":1}}
{"name":"John Hancock"}
{"index":{"_id":2}}
{"name":"John Smith"}
{"index":{"_id":3}}
{"name":"Bob Smith"}
{"index":{"_id":4}}
{"name":"Bob Jones"}
Your query mostly works, with a couple of caveats.
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "John",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
}
this query returns three documents, because of ngrams plus fuzziness:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.90169895,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.90169895,
"_source": {
"name": "John Hancock"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.90169895,
"_source": {
"name": "John Smith"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.6235822,
"_source": {
"name": "Bob Jones"
}
}
]
}
}
That may not be what you want. Also, "AUTO" doesn't work with the "Jhon Smi" query, because "Jhon" is an edit distance of 2 from "John", and "AUTO" uses an edit distance of 1 for strings of 3-5 characters (see the docs for more info). So I have to use this query instead:
POST /test_index/_search
{
"query": {
"match": {
"name": {
"query": "Jhon Smi",
"fuzziness": 2,
"operator": "and"
}
}
}
}
...
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4219328,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.4219328,
"_source": {
"name": "John Smith"
}
}
]
}
}
The other queries work as expected. So this solution isn't perfect, but it will get you close.
Here's all the code I used:
http://sense.qbox.io/gist/ba5a6741090fd40c1bb20f5d36f3513b4b55ac77

Elasticsearch: get multiple specified documents in one request?

I am new to Elasticsearch and hope to know whether this is possible.
Basically, I have the values in the "code" property for multiple documents. Each document has a unique value in this property. Now I have the codes of multiple documents and hope to retrieve them in one request by supplying multiple codes.
Is this doable in Elasticsearch?
Regards.
Edit
This is the mapping of the field:
"code" : { "type" : "string", "store": "yes", "index": "not_analyzed"},
Two example values of this property:
0Qr7EjzE943Q
GsPVbMMbVr4s
What is the ES syntax to retrieve the two documents in ONE request?
First, you probably don't want "store":"yes" in your mapping, unless you have _source disabled (see this post).
So, I created a simple index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"code": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
added the two docs with the bulk API:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"code":"0Qr7EjzE943Q"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"code":"GsPVbMMbVr4s"}
There are a number of ways I could retrieve those two documents. The most straightforward, especially since the field isn't analyzed, is probably a with terms query:
POST /test_index/_search
{
"query": {
"terms": {
"code": [
"0Qr7EjzE943Q",
"GsPVbMMbVr4s"
]
}
}
}
both documents are returned:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.04500804,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.04500804,
"_source": {
"code": "0Qr7EjzE943Q"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.04500804,
"_source": {
"code": "GsPVbMMbVr4s"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/a3e3e4f05753268086a530b06148c4552bfce324

How to use _timestamp in a scripted update

I was trying to come up with an elegant answer to this question and ran into an unexpected problem. The basic idea is to update a document based on its current timestamp. Seems straightforward enough, but I seem to be missing something. At the bottom of the Update API page, the ES docs say:
It also allows to update the ttl of a document using ctx._ttl and timestamp using ctx._timestamp. Note that if the timestamp is not updated and not extracted from the _source it will be set to the update date.
The ES documentation is often enigmatic at best, especially when it comes to scripting, but I took this to mean that I could use the _timestamp field in an update script.
So I set up a simple index with a timestamp:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"_timestamp": {
"enabled": true,
"store": true,
"path": "doc_date",
"format" : "YYYY-MM-dd"
},
"properties": {
"doc_date": {
"type": "date",
"format" : "YYYY-MM-dd"
},
"doc_text": {
"type": "string"
}
}
}
}
}
and added some docs:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"doc_text":"doc1", "doc_date":"2015-2-5"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"doc_text":"doc2", "doc_date":"2015-2-10"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"doc_text":"doc3", "doc_date":"2015-2-15"}
If I query for the first doc, I get back what I expect:
POST /test_index/_search
{
"query": {
"match": {
"doc_text": "doc1"
}
},
"fields": [
"_timestamp",
"_source"
]
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.4054651,
"_source": {
"doc_text": "doc1",
"doc_date": "2015-2-5"
},
"fields": {
"_timestamp": 1423094400000
}
}
]
}
}
So far so good. Now I want to conditionally update the first doc, based on its timestamp. First I tried this, and got an error:
POST /test_index/doc/1/_update
{
"script": "if(ctx._timestamp < new_ts){ctx._source.doc_date=new_date;ctx._source.doc_text=new_text}",
"params": {
"new_ts": 1423526400000,
"new_date": "2015-2-10",
"new_text": "doc1-updated"
}
}
...
{
"error": "ElasticsearchIllegalArgumentException[failed to execute script]; nested: PropertyAccessException[[Error: could not access: _timestamp; in class: java.util.HashMap]\n[Near : {... if(ctx._timestamp < new_ts){ctx._ ....}]\n ^\n[Line: 1, Column: 4]]; ",
"status": 400
}
Then I tried this:
POST /test_index/doc/1/_update
{
"script": "if(ctx[\"_timestamp\"] < new_ts){ctx._source.doc_date=new_date;ctx._source.doc_text=new_text}",
"params": {
"new_ts": 1423526400000,
"new_date": "2015-2-10",
"new_text": "doc1-updated"
}
}
...
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_version": 2
}
I didn't get an error, but the update didn't happen:
POST /test_index/_search
{
"query": {
"match": {
"doc_text": "doc1"
}
},
"fields": [
"_timestamp",
"_source"
]
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.287682,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.287682,
"_source": {
"doc_text": "doc1",
"doc_date": "2015-2-5"
},
"fields": {
"_timestamp": 1423094400000
}
}
]
}
}
Just out of curiosity, I inverted the conditional:
POST /test_index/doc/1/_update
{
"script": "if(ctx[\"_timestamp\"] > new_ts){ctx._source.doc_date=new_date;ctx._source.doc_text=new_text}",
"params": {
"new_ts": 1423526400000,
"new_date": "2015-2-10",
"new_text": "doc1-updated"
}
}
with the same result: no update.
Okay, so as a sanity check I tried to set the timestamp, and got an error:
POST /test_index/doc/1/_update
{
"script": "ctx._source.doc_date=new_date;ctx._source.doc_text=new_text;ctx._timestamp=new_ts",
"params": {
"new_ts": 1423526400000,
"new_date": "2015-2-10",
"new_text": "doc1-updated"
}
}
...
{
"error": "ClassCastException[java.lang.Long cannot be cast to java.lang.String]",
"status": 500
}
I also tried it with "ctx[\"_timestamp\"]=new_ts;", and got the same error.
So it seems that the _timestamp field is not available to the script, even though the documentation says it is. What am I doing wrong?
I also tried updating without the conditional or updating the timestamp, and it worked as expected.
I used Elasticsearch version 1.3.4 (with dynamic scripting enabled, obviously), running on an Ubuntu 12 VM.
Here is the code I used to set this up:
http://sense.qbox.io/gist/ca2b3c6b84572e5f87d57d22f8c38252fa4ee216

Resources