How to apply default post filter with ElasticSearch? - elasticsearch

I would like to implement an engine of backtesting using elasticsearch. To be able to do that I would need to filter the hits by excluding the ones that are posterior to the testing date and I would like to do that by default because the algorithm (that I want to backtest) is not supposed to know about the backtesting.
In other words, is it possible to apply a default post filter to ElasticSearch queries?
For example, let's say that those documents are in ES:
{ name: 'Jean', weight: 70, date: 2012-01-01 }
{ name: 'Jules', weight: 70, date: 2010-01-01 }
{ name: 'David', weight: 80, date: 2010-01-01 }
I want to apply a default post filter to exclude documents posterior to 2011 in a way that if I do a query to get every persons with a weight of 70, the only result I have is Jules.

You can do that with Filtered Aliases. When you query through the alias, the filter is automatically applied to your query...which hides it from your application:
// Insert the data
curl -XPOST "http://localhost:9200/people/data/" -d'
{ "name": "Jean", "weight" : 70, "date": "2012-01-01" }'
curl -XPOST "http://localhost:9200/people/ata" -d'
{ "name": "Jules", "weight" : 70, "date": "2010-01-01" }'
curl -XPOST "http://localhost:9200/people/data/" -d'
{ "name": "David", "weight" : 80, "date": "2010-01-01" }'
// Add a filtered alias
curl -XPOST "http://localhost:9200/_aliases" -d'
{
"actions" : [
{
"add" : {
"index" : "people",
"alias" : "filtered_people",
"filter" : {
"range" : {
"date" : { "gte" : "2011-01-01"}
}
}
}
}
]
}'
Now you execute the search against filtered_people instead of the underlying people index:
curl -XGET "http://localhost:9200/filtered_people/_search" -d'
{
"query": {
"filtered": {
"filter": {
"term": {
"weight": 70
}
}
}
}
}'
Which will return just the doc you are interested in:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "people",
"_type": "ata",
"_id": "AUudZPUfCSiheYJkTW-h",
"_score": 1,
"_source": {
"name": "Jules",
"weight": 70,
"date": "2010-01-01"
}
}
]
}
}

Related

Query on Elastic Search on multiple criterias

I have this document in elastic search
{
"_index" : "master",
"_type" : "_doc",
"_id" : "q9IGdXABeXa7ITflapkV",
"_score" : 0.0,
"_source" : {
"customer_acct" : "64876457056",
"ssn_number" : "123456789",
"name" : "Julie",
"city" : "NY"
}
I wanted to query the master index , with the customer_acct and ssn_number to retrive the entire document. I wanted to disable scoring and relevance , I have used the below query
curl -X GET "localhost/master/_search/?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"term": {
"customer_acct": {
"value":"64876457056"
}
}
}
}'
I need to include the second criteria in the term query as well which is the ssn_number, how would I do that? , I want to turn off scoring and relevance would that be possible, I am new to Elastic Search and how would I fit the second criteria on ssn_number in the above query that I have tried?
First, you need to define the proper mapping of your index. your customer_acct and ssn_number are of numeric type but you are storing it as a string. Also looking at your sample I can see you have to use long to store them. and then you can just use filter context in your query as you don't need score and relevance in your result. Read more about filter context in official ES doc as well as below snippet from the link.
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data,
which is exactly your use-case.
1. Index Mapping
{
"mappings": {
"properties": {
"customer_acct": {
"type": "long"
},
"ssn_number" :{
"type": "long"
},
"name" : {
"type": "text"
},
"city" :{
"type": "text"
}
}
}
}
2. Index sample docs
{
"name": "Smithe John",
"city": "SF",
"customer_acct": 64876457065,
"ssn_number": 123456790
}
{
"name": "Julie",
"city": "NY",
"customer_acct": 64876457056,
"ssn_number": 123456789
}
3. Main search query to filter without the score
{
"query": {
"bool": {
"filter": [ --> only filter clause
{
"term": {
"customer_acct": 64876457056
}
},
{
"term": {
"ssn_number": 123456789
}
}
]
}
}
}
Above search query gives below result:
{
"took": 186,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "so-master",
"_type": "_doc",
"_id": "1",
"_score": 0.0, --> notice score is 0.
"_source": {
"name": "Smithe John",
"city": "SF",
"customer_acct": 64876457056,
"ssn_number": 123456789
}
}
]
}
}

Retrieving top terms query in Elasticsearch

I am using Elasticsearch 1.1.0 and trying to retrieve the top 10 terms in a field called text
I've tried the following, but it instead returned all of the documents:
{
"query": {
"match_all": {}
},
"facets": {
"text": {
"terms": {
"field": "text",
"size": 10
}
}
}
}
EDIT
the following is an example of the result that is returned:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2747,
"max_score": 1,
"hits": [
{
"_index": "index_name",
"_type": "type_name",
"_id": "621637640908050432",
"_score": 1,
"_source": {
"metadata": {
"result_type": "recent",
"iso_language_code": "en"
},
"in_reply_to_status_id_str": null,
"in_reply_to_status_id": null,
"created_at": "Thu Jul 16 11:08:57 +0000 2015",
.
.
.
.
What am I doing wrong?
Thanks.
First of all, don't use facets. They are deprecated. Even though you use OLD version of Elasticsearch, switch to aggregations. Quoting documentation:
Faceted search refers to a way to explore large amounts of data by
displaying summaries about various partitions of the data and later
allowing to narrow the navigation to a specific partition.
In Elasticsearch, facets are also the name of a feature that allowed
to compute these summaries. facets have been replaced by aggregations
in Elasticsearch 1.0, which are a superset of facets.
Use this query instead:
POST /your_index/your_type/_search?search_type=count
{
"aggs" : {
"text" : {
"terms" : {
"field" : "text",
"size" : 10
}
}
}
}
This will work fine
Try this:
GET /index_name/type_name/_search?search_type=count
{
"query": {
"match_all": {}
},
"facets": {
"text": {
"terms": {
"field": "text",
"size": 10
}
}
}
}

Get specific fields from index in elasticsearch

I have an index in elastic-search.
Sample structure :
{
"Article": "Article7645674712",
"Genre": "Genre92231455",
"relationDesc": [
"Article",
"Genre"
],
"org": "user",
"dateCreated": {
"date": "08/05/2015",
"time": "16:22 IST"
},
"dateModified": "08/05/2015"
}
From this index i want to retrieve selected fields: org and dateModified.
I want result like this
{
"took": 265,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 1,
"hits": [
{
"_index": "couchrecords",
"_type": "couchbaseDocument",
"_id": "3",
"_score": 1,
"_source": {
"doc": {
"org": "user",
"dateModified": "08/05/2015"
}
}
},
{
"_index": "couchrecords",
"_type": "couchbaseDocument",
"_id": "4",
"_score": 1,
"_source": {
"doc": {
"org": "user",
"dateModified": "10/05/2015"
}
}
}
]
}
}
How to query elastic-search to get only selected specific fields ?
You can retrieve only a specific set of fields in the result hits using the _source parameter like this:
curl -XGET localhost:9200/couchrecords/couchbaseDocument/_search?_source=org,dateModified
Or in this format:
curl -XPOST localhost:9200/couchrecords/couchbaseDocument/_search -d '{
"_source": ["doc.org", "doc.dateModified"], <---- you just need to add this
"query": {
"match_all":{} <----- or whatever query you have
}
}'
That's easy. Considering any query of this format :
{
"query": {
...
},
}
You'll just need to add the fields field into your query which in your case will result in the following :
{
"query": {
...
},
"fields" : ["org","dateModified"]
}
{
"_source" : ["org","dateModified"],
"query": {
...
}
}
Check ElasticSearch source filtering.

Elasticsearch return unique values for a field

I am trying to build an Elasticsearch query that will return only unique values for a particular field.
I do not want to return all the values for that field nor count them.
For example, if there are 50 different values currently contained by the field, and I do a search to return only 20 hits (size=20). I want each of the 20 results to have a unique result for that field, but I don't care about the 30 other values not represented in the result.
For example with the following search (pseudo code - not checked):
{
from: 0,
size: 20,
query: {
bool: {
must: {
range: { field1: { gte: 50 }},
term: { field2: 'salt' },
/**
* I want to return only unique values for "field3", but I
* don't want to return all of them or count them.
*
* How do I specify this in my query?
**/
unique: 'field3',
},
mustnot: {
match: { field4: 'pepper'},
}
}
}
}
You should be able to do this pretty easily with a terms aggregation.
Here's an example. I defined a simple index, containing a field that has "index": "not_analyzed" so we can get the full text of each field as a unique value, rather than terms generated from tokenizing it, etc.
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then I add a few docs with the bulk API.
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"first doc"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"second doc"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"third doc"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"third doc"}
Now we can run our terms aggregation:
POST /test_index/_search?search_type=count
{
"aggs": {
"unique_vals": {
"terms": {
"field": "title"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_vals": {
"buckets": [
{
"key": "third doc",
"doc_count": 2
},
{
"key": "first doc",
"doc_count": 1
},
{
"key": "second doc",
"doc_count": 1
}
]
}
}
}
I'm very surprised a filter aggregation hasn't been suggested. It goes back all the way to ES version 1.3.
The filter aggregation is similar to a regular filter query but can instead be nested into an aggregation chain to filter out counts of documents that don't meet a particular criteria and give you sub-aggregation results based only on the documents that meet the criteria of the query.
First, we'll put our mapping.
curl --request PUT \
--url http://localhost:9200/items \
--header 'content-type: application/json' \
--data '{
"mappings": {
"item": {
"properties": {
"field1" : { "type": "integer" },
"field2" : { "type": "keyword" },
"field3" : { "type": "keyword" },
"field4" : { "type": "keyword" }
}
}
}
}
'
Then let's load some data.
curl --request PUT \
--url http://localhost:9200/items/_bulk \
--header 'content-type: application/json' \
--data '{"index":{"_index":"items","_type":"item","_id":1}}
{"field1":50, "field2":["salt", "vinegar"], "field3":["garlic", "onion"], "field4":"paprika"}
{"index":{"_index":"items","_type":"item","_id":2}}
{"field1":40, "field2":["salt", "pepper"], "field3":["onion"]}
{"index":{"_index":"items","_type":"item","_id":3}}
{"field1":100, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"pepper"}
{"index":{"_index":"items","_type":"item","_id":4}}
{"field1":90, "field2":["vinegar"], "field3":["chives", "garlic"]}
{"index":{"_index":"items","_type":"item","_id":5}}
{"field1":900, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"paprika"}
'
Notice, that only the documents with id's 1 and 5 will pass the criteria and so we will be left to aggregate on these two field3 arrays and four values total. ["garlic", "chives"], ["garlic", "onion"]. Also notice that field3 can be an array or single value in the data but I'm making them arrays to illustrate how the counts will work.
curl --request POST \
--url http://localhost:9200/items/item/_search \
--header 'content-type: application/json' \
--data '{
"size": 0,
"aggregations": {
"top_filter_agg" : {
"filter" : {
"bool": {
"must":[
{
"range" : { "field1" : { "gte":50} }
},
{
"term" : { "field2" : "salt" }
}
],
"must_not":[
{
"term" : { "field4" : "pepper" }
}
]
}
},
"aggs" : {
"field3_terms_agg" : { "terms" : { "field" : "field3" } }
}
}
}
}
'
After running the conjuncted filter/terms aggregation. We only have a count of 4 terms on field3 and three unique terms altogether.
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"top_filter_agg": {
"doc_count": 2,
"field3_terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "garlic",
"doc_count": 2
},
{
"key": "chives",
"doc_count": 1
},
{
"key": "onion",
"doc_count": 1
}
]
}
}
}
}

Search query for elasticsearch when child element is array of string

I created a documents in elasticsearch in the following format
curl -XPUT "http://localhost:9200/my_base.main_candidate/" -d'
{
"specific_location": {
"location_name": "Mumbai",
"location_tags": [
"Mumbai"
],
"tags": [
"Mumbai"
]
}
}'
My requirement is to search for location_tags containing one of the given options like ["Mumbai", "Pune"]. How do I do this?
I tried:
curl -XGET "http://localhost:9200/my_base.main_candidate/_search" -d '
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"terms": {
"specific_location.location_tags" : ["Mumbai"]
}
}
}
}
}'
which didn't work.
I got this output :
{
"took": 72,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
There are a several ways to solve this. Perhaps the most immediate one is to search for mumbai instead of Mumbai.
If I create the index with no mapping,
curl -XDELETE "http://localhost:9200/my_base.main_candidate/"
curl -XPUT "http://localhost:9200/my_base.main_candidate/"
then add a doc:
curl -XPUT "http://localhost:9200/my_base.main_candidate/doc/1" -d'
{
"specific_location": {
"location_name": "Mumbai",
"location_tags": [
"Mumbai"
],
"tags": [
"Mumbai"
]
}
}'
then run your query with the lower-case term
curl -XPOST "http://localhost:9200/my_base.main_candidate/_search" -d'
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"terms": {
"specific_location.location_tags": [
"mumbai"
]
}
}
}
}
}'
I get back the expected doc:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_base.main_candidate",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"specific_location": {
"location_name": "Mumbai",
"location_tags": [
"Mumbai"
],
"tags": [
"Mumbai"
]
}
}
}
]
}
}
This is because, since no explicit mapping was used, Elasticsearch uses defaults, which means the location_tags field will be analyzed with the standard analyzer, which will convert terms to lower-case. So the term Mumbai does not exist, but mumbai does.
If you want to be able to use upper-case terms in your query, you will need to set up an explicit mapping that tells Elasticsearch not to analyze the location_tags field. Maybe something like this:
curl -XDELETE "http://localhost:9200/my_base.main_candidate/"
curl -XPUT "http://localhost:9200/my_base.main_candidate/" -d'
{
"mappings": {
"doc": {
"properties": {
"specific_location": {
"properties": {
"location_tags": {
"type": "string",
"index": "not_analyzed"
},
"tags": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
curl -XPUT "http://localhost:9200/my_base.main_candidate/doc/1" -d'
{
"specific_location": {
"location_name": "Mumbai",
"location_tags": [
"Mumbai"
],
"tags": [
"Mumbai"
]
}
}'
curl -XPOST "http://localhost:9200/my_base.main_candidate/_search" -d'
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"terms": {
"specific_location.location_tags": [
"Mumbai"
]
}
}
}
}
}'
Here is all the above code in a handy place:
http://sense.qbox.io/gist/74844f4d779f7c2b94a9ab65fd76eb0ffe294cbb
[EDIT: by the way, I used Elasticsearch 1.3.4 when testing the above code]

Resources