How to specify ElasticSearch copy_to order? - elasticsearch

ElasticSearch has the ability to copy values to other fields (at index time), enabling you to search on multiple fields as if it were one field (Core Types: copy_to).
However, there doesn't seem to be any way to specify the order in which these values should be copied. This could be important when phrase matching:
curl -XDELETE 'http://10.11.12.13:9200/helloworld'
curl -XPUT 'http://10.11.12.13:9200/helloworld'
# copy_to is ordered alphabetically!
curl -XPUT 'http://10.11.12.13:9200/helloworld/_mapping/people' -d '
{
"people": {
"properties": {
"last_name": {
"type": "string",
"copy_to": "full_name"
},
"first_name": {
"type": "string",
"copy_to": "full_name"
},
"state": {
"type": "string"
},
"city": {
"type": "string"
},
"full_name": {
"type": "string"
}
}
}
}
'
curl -X POST "10.11.12.13:9200/helloworld/people/dork" -d '{"first_name": "Jim", "last_name": "Bob", "state": "California", "city": "San Jose"}'
curl -X POST "10.11.12.13:9200/helloworld/people/face" -d '{"first_name": "Bob", "last_name": "Jim", "state": "California", "city": "San Jose"}'
curl "http://10.11.12.13:9200/helloworld/people/_search" -d '
{
"query": {
"match_phrase": {
"full_name": {
"query": "Jim Bob"
}
}
}
}
'
Only "Jim Bob" is returned; it seems that the fields are copied in field-name alphabetical order.
How would I switch the copy_to order such that the "Bob Jim" person would be returned?

This is more deterministically controlled by registering a transform script in your mapping.
something like this:
"transform" : [
{"script": "ctx._source['full_name'] = [ctx._source['first_name'] + " " + ctx._source['last_name'], ctx._source['last_name'] + " " + ctx._source['first_name']]"}
]
Also, transform scripts can be "native", i.e. java code, made available to all nodes in the cluster by making your custom classes available in the elasticsearch classpath and registered as native scripts by the settings:
script.native.<name>.type=<fully.qualified.class.name>
in which case in your mapping you'd register the native script as a transform like so:
"transform" : [
{
"script" : "<name>",
"params" : {
"param1": "val1",
"param2": "val2"
},
"lang": "native"
}
],

Related

ElasticSearch create an index with dynamic properties

Is it possible to create an index, restricting indexing a parent property?
For example,
$ curl -XPOST 'http://localhost:9200/actions/action/' -d '{
"user": "kimchy",
"message": "trying out Elasticsearch",
"actionHistory": [
{ "timestamp": 123456789, "action": "foo" },
{ "timestamp": 123456790, "action": "bar" },
{ "timestamp": 123456791, "action": "buz" },
...
]
}'
I don't want actionHistory to be indexed at all. How can this be done?
For the above document, I believe the index would be created as
$ curl -XPOST localhost:9200/actions -d '{
"settings": {
"number_of_shards": 1
},
"mappings": {
"action": {
"properties" : {
"user": { "type": "string", "index" : "analyzed" },
"message": { "type": "string": "index": "analyzed" },
"actionHistory": {
"properties": {
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"action": { "type": "string", "index": "analyzed" }
}
}
}
}
}
}'
Would removing properties from actionHistory and replace it with "index": "no" be the proper solution?
This is an example, however my actual situation are documents with dynamic properties (i.e. actionHistory contains various custom, non-repeating properties across all documents) and my mapping definition for this particular type has over 2000 different properties, making searches extremely slow (i.e. worst than full text search from the database).
You can probably get away by using dynamic templates, match on all actionHistory sub-fields and set "index": "no" for all of them.
PUT actions
{
"mappings": {
"action": {
"dynamic_templates": [
{
"actionHistoryRule": {
"path_match": "actionHistory.*",
"mapping": {
"type": "{dynamic_type}",
"index": "no"
}
}
}
]
}
}
}

How do I get Elasticsearch to ignore terms emptied by a char_filter?

I have a set of US street addresses that I've indexed. The source data is imperfect and sometimes fields contain junk. Specifically, I have zip5 and zip4 fields and a pattern_replace char_filter that strips any non-numeric characters. When that char_filter ends up replacing everything (yielding an empty string), matching still seems to look at that field. The same happens if the original field is just an empty string (as opposed to null). How could I set this up such that it'll just disregard fields that are empty strings (either by source or by the result of a char_filter)?
Example
First, let's create an index with a digits_only pattern replacer and an analyzer that uses it:
curl -XPUT "http://localhost:9200/address_bug" -d'
{
"settings": {
"index": {
"number_of_shards": "4",
"number_of_replicas": "1"
},
"analysis": {
"char_filter" : {
"digits_only" : {
"type" : "pattern_replace",
"pattern" : "([^0-9])",
"replacement" : ""
}
},
"analyzer" : {
"zip" : {
"type" : "custom",
"tokenizer" : "keyword",
"char_filter" : [
"digits_only"
]
}
}
}
}
}'
Now, let's create a mapping that uses the analyzer (NB: I'm using with_positions_offsets for highlighting):
curl -XPUT "http://localhost:9200/address_bug/_mapping/address" -d'
{
"address": {
"properties": {
"zip5": {
"type" : "string",
"analyzer" : "zip",
"term_vector" : "with_positions_offsets"
},
"zip4": {
"type" : "string",
"analyzer" : "zip",
"term_vector" : "with_positions_offsets"
}
}
}
}'
Now that our index and type is set up, let's index some imperfect data:
curl -XPUT "http://localhost:9200/address_bug/address/1234" -d'
{
"zip5" : "02144",
"zip4" : "ABCD"
}'
Alright, let's search for it and ask it to explain itself. In this case the search term is Street because in my actual application I have a single field for full address searching.
curl -XGET "http://localhost:9200/address_bug/address/_search?explain" -d'
{
"query": {
"match": {
"zip4": "Street"
}
}
}'
And, here is the interesting part of the results:
"_explanation": {
"value": 0.30685282,
"description": "weight(zip4: in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
(Full response is in this gist.)
Expected Result
I wouldn't have expected any hits. If I instead index a document with "zip4" : null, it yields the expect results: no hits.
Help? Am I even taking the right approach here? In my full application, I'm using the same technique for a phone field and suspect I'd have the same issues with the results.
As #plmaheu mentioned, you can use the stop token filter to completely remove
empty strings, so for instance, this is a configuration that I tested that
works:
POST /myindex
{
"settings": {
"analysis": {
"char_filter" : {
"digits_only" : {
"type" : "pattern_replace",
"pattern" : "[^0-9]+",
"replacement" : ""
}
},
"filter": {
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
},
"analyzer" : {
"zip" : {
"type" : "custom",
"tokenizer" : "keyword",
"char_filter" : [
"digits_only"
],
"filter": ["remove_empty"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"zip": {
"type": "string",
"analyzer": "zip"
}
}
}
}
}
Here the remove_empty filter removes the stopword "", if you use the analyze
API on the string "abcd", you get back the response {"tokens":[]}, so no
tokens will be indexed if the zip code is entirely invalid.
I also tested this works when searching for "foo", no results are found.
You can use a length token filter like this:
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
}

Elasticsearch: what sample document would look like for this mapping?

I am new to Elasticsearch. I was given this mapping document:
{
"book": {
"properties": {
"title": {
"properties": {
"de": {
"type": "string",
"fields": {
"default": {
"type": "string",
"analyzer": "de_analyzer"
}
}
},
"fr": {
"type": "string",
"fields": {
"default": {
"type": "string",
"analyzer": "fr_analyzer"
}
}
}
}
}
}
}
}
I am wondering what a sample document conforming to this mapping would like so that I can generate right json string.
Thanks and regards.
It should look something similar to:
{
"book": {
"title": {
"de": "somestring",
"fr": "somestring"
}
}
}
For more information read this.
This is how your json will look like. I see you are confused by:
"fr": {
"type": "string",
"fields": {
"fr": {
"type": "string",
"analyzer": "fr_analyzer"
}
}
}
Actually this is used for indexing a single field in multiple ways. Effective you are telling elasticsearch that you want same field to index in two different ways. In your case I'm not getting why you are using same name "fr" and "de" for indexing multi-fields. If you will use same name, ES will index it only as a single field. So, in your case doing like this is a similar thing:
"fr": {
"type": "string",
"analyzer": "fr_analyzer"
}
For more information go through these links:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/most-fields.html
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-fields.html
Hope this helps.
Well, I'm way slow here, but, as dark_shadow points out, this looks like a convoluted attempt at exposing multiple analyzers for a field (formerly known as multi-field mapping).
I've written up an example of mapping this in two ways: giving title two properties, fr and de, for which explicit strings need to be provided in each document, and just having a single title property, with fr and de analyzer 'fields'. Note that I replaced the name analyzers with "standard" to keep me from having to load those analyzers locally.
The difference between these is apparent. In example one, you have the ability to provide the translated titles on each book, with the index/search analyzer appropriate to the language being use on the respective properties.
In example two, you are providing a single title string, which will be indexed using both analyzers, and then you can choose the analyser to search by in your query.
#!/bin/sh
echo "--- delete index"
curl -X DELETE 'http://localhost:9200/so_multi_map/'
echo "--- create index and put mapping into place"
curl -XPUT http://localhost:9200/so_multi_map/?pretty -d '{
"mappings": {
"example_one": {
"properties": {
"title": {
"properties": {
"de": {
"type": "string",
"analyzer": "standard"
},
"fr": {
"type": "string",
"analyzer": "standard"
}
}
}
}
},
"example_two": {
"properties": {
"title": {
"type": "string",
"fields": {
"de": {
"type": "string",
"analyzer": "standard"
},
"fr": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
},
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}'
echo "--- index a book by example_one by POSTing"
curl -XPOST http://localhost:9200/so_multi_map/example_one -d '{
"title": {
"de": "Auf der Suche nach der verlorenen Zeit",
"fr": "À la recherche du temps perdu"
}
}'
echo "--- index a book by example_two by POSTing"
curl -XPOST http://localhost:9200/so_multi_map/example_two -d '{
"title": "À la recherche du temps perdu"
}'
echo "\n --- flush indices"
curl -XPOST 'http://localhost:9200/_flush'
echo "--- show the books"
curl -XGET http://localhost:9200/so_multi_map/_search?pretty -d '
{
"query": {
"match_all": {}
}
}'
echo "--- how to query just the fr title in example one:"
curl -XGET http://localhost:9200/so_multi_map/example_one/_search?pretty -d '
{
"query": {
"match": {
"title.fr": "recherche"
}
}
}'
echo "--- how to query just the fr title in example two:"
curl -XGET http://localhost:9200/so_multi_map/example_two/_search?pretty -d '
{
"query": {
"match": {
"title.fr": "recherche"
}
}
}'

How to add subdocument to an ElasticSearch index

In ElasticSearch, given the following document, Is it possible to add items to the "Lists" sub-document without passing the parent attributes (i.e. Message and tags)?
I have several attributes in the parent document which I dont want to pass every time I want to add one item to the sub-document.
{
"tweet" : {
"message" : "some arrays in this tweet...",
"tags" : ["elasticsearch", "wow"],
"lists" : [
{
"name" : "prog_list",
"description" : "programming list"
},
{
"name" : "cool_list",
"description" : "cool stuff list"
}
]
}
}
What you are looking for is, how to insert a nested documents.
In your case, you can use the Update API to append a nested document to your list.
curl -XPOST localhost:9200/index/tweets/1/_update -d '{
"script" : "ctx._source.tweet.lists += new_list",
"params" : {
"new_list" : {"name": "fun_list", "description": "funny list" }
}
}'
To support nested documents, you have to define your mapping, which is described here.
Assuming your type is tweets, the follwoing mapping should work:
curl -XDELETE http://localhost:9200/index
curl -XPUT http://localhost:9200/index -d'
{
"settings": {
"index.number_of_shards": 1,
"index.number_of_replicas": 0
},
"mappings": {
"tweets": {
"properties": {
"tweet": {
"properties": {
"lists": {
"type": "nested",
"properties": {
"name": {
"type": "string"
},
"description": {
"type": "string"
}
}
}
}
}
}
}
}
}'
Then add a first entry:
curl -XPOST http://localhost:9200/index/tweets/1 -d '
{
"tweet": {
"message": "some arrays in this tweet...",
"tags": [
"elasticsearch",
"wow"
],
"lists": [
{
"name": "prog_list",
"description": "programming list"
},
{
"name": "cool_list",
"description": "cool stuff list"
}
]
}
}'
And then add your element with:
curl -XPOST http://localhost:9200/index/tweets/1/_update -d '
{
"script": "ctx._source.tweet.lists += new_list",
"params": {
"new_list": {
"name": "fun_list",
"description": "funny list"
}
}
}'

Elasticsearch - index only part of the object

Is it possible to index only some part of the object in elasticsearch?
Example:
$ curl -XPUT 'http://localhost:9200/test/item/1' -d '
{
"record": {
"city": "London",
"contact": "Some person name"
}
}
$ curl -XPUT 'http://localhost:9200/test/item/2' -d '
{
"record": {
"city": "London",
"contact": { "phone": "some-phone-number", "name": "Other person's name" }
}
}
$ curl -XPUT 'http://localhost:9200/test/item/3' -d '
{
"record": {
"city": "Oslo",
"headquarters": { "phone": "some-other-phone-number",
"address": "some address" }
}
}
I want only city name to be searchable, all remaining part of the object I want to leave unindexed and completely arbitrary. For example some fields can change it's type from object to object.
Is it possible to write mapping that allow such behaviour?
UPDATE
My final solution looks like this:
{
"test": {
"dynamic": "false",
"properties": {
"name": {
"type": "string"
}
}
}
}
I add "dynamic": "false" on the lowest level of my mapping and it works as expected.
You can achieve this by disabling dynamic mapping on entire type or just inner object record:
"mappings": {
"doc": {
"properties": {
"record": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"dynamic": false
}
}
}
}

Resources