Elasticsearch: _id based on document field? - elasticsearch

I am new to Elasticsearch. I have difficulty in using a field of the document for _id. Here is my mapping:
{
"product": {
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long",
"index": "not_analyzed",
"store": "yes"
},
"title": {
"type": "string",
"analyzer": "snowball",
"store": "no",
"index": "not_analyzed"
}
}
}
}
Here is a sample document:
{
"id": 1,
"title": "All Quiet on the Western Front"
}
When indexing this document, I got something like:
{
"_index": "myindex",
"_type": "book",
"_id": "PZQu4rocRy60hO2seUEziQ",
"_version": 1,
"created": true
}
Did I do anything wrong? How should this work?

EDIT: _id.path was deprecated in v1.5 and removed in v2.0.
EDIT 2: on versions where this is supported, there is a performance penalty in that the coordinating node is forced to parse all requests (including bulk) in order to determine the correct primary shard for each document.
Provide an _id.path in your mapping, as described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-id-field.html
Here is a full, working demonstration:
#!/bin/sh
echo "--- delete index"
curl -X DELETE 'http://localhost:9200/so_id_from_field/'
echo "--- create index and put mapping into place"
curl -XPUT http://localhost:9200/so_id_from_field/?pretty=true -d '{
"mappings": {
"tweet" : {
"_id" : {
"path" : "post_id"
},
"properties": {
"post_id": {
"type": "string"
},
"nickname": {
"type": "string"
}
}
}
},
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}'
echo "--- index some tweets by POSTing"
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "1305668",
"nickname": "Uncle of the month club"
}'
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "blarger",
"nickname": "Hurry up and spend my money"
}'
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "9",
"nickname": "Who is the guy with the shoe hat?"
}'
echo "--- get the tweets"
curl -XGET http://localhost:9200/so_id_from_field/tweet/1305668?pretty=true
curl -XGET http://localhost:9200/so_id_from_field/tweet/blarger?pretty=true
curl -XGET http://localhost:9200/so_id_from_field/tweet/9?pretty=true

Related

Elasticseach not using synonyms from synonym file

I am new to elasticsearch so before downvoting or marking as duplicate, please read the question first.
I am testing synonyms in elasticsearch (v 2.4.6) which I have installed on Ubuntu 16.04. I am giving synonyms through a file named synonym.txt which I have placed in config directory. I have created an index synonym_test as follows-
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}'
The index contains two fields- id and some_text. I configure the field some_text with the custom analyzer as follows-
curl -XPUT localhost:9200/synonym_test/rulers/_mapping -d '{
"properties": {
"id": {
"type": "double"
},
"some_text": {
"type": "string",
"search_analyzer": "my_synonyms"
}
}
}'
Then I have inserted some data as -
curl -XPUT localhost:9200/synonym_test/external/5 -d '{
"id" : "5",
"some_text":"apple is a fruit"
}'
curl -XPUT localhost:9200/synonym_test/external/7 -d '{
"id" : "7",
"some_text":"english is spoken in england"
}'
curl -XPUT localhost:9200/synonym_test/external/8 -d '{
"id" : "8",
"some_text":"Scotland Yard is a popular game."
}'
curl -XPUT localhost:9200/synonym_test/external/9 -d '{
"id" : "9",
"some_text":"bananas contain potassium"
}'
The synonym.txt file contains following-
"britain,england,scotland"
"fruit,bananas"
After doing all this, when I run the query for term fruit (which should also return the text containing bananas as they are synonyms in file), I get the text containing fruit only.
{
"took":117,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.8465736,
"hits":[
{
"_index":"synonym_test",
"_type":"external",
"_id":"5",
"_score":0.8465736,
"_source":{
"id":"5",
"some_text":"apple is a fruit"
}
}
]
}
}
I have also tried the following links, but none seem to have helped me -
Synonym analyzer not working ,
Elasticsearch synonym analyzer not working , How to apply synonyms at query time instead of index time in Elasticsearch , how to configure the synonyms_path in elasticsearch and many other links.
So, can anyone please tell me if I am doing anything wrong? Is there anything wrong with the settings or synonym file? I want the synonyms to work (query time) so that when I search for a term, I get all documents related to that term.
Please refer to following url: Custom Analyzer on how you should configure custom analyzers.
If we follow the guides from above documentation our schema will be as follows:
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"type": "custom"
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}
Which currently works on my elasticsearch instance.

How to specify ElasticSearch copy_to order?

ElasticSearch has the ability to copy values to other fields (at index time), enabling you to search on multiple fields as if it were one field (Core Types: copy_to).
However, there doesn't seem to be any way to specify the order in which these values should be copied. This could be important when phrase matching:
curl -XDELETE 'http://10.11.12.13:9200/helloworld'
curl -XPUT 'http://10.11.12.13:9200/helloworld'
# copy_to is ordered alphabetically!
curl -XPUT 'http://10.11.12.13:9200/helloworld/_mapping/people' -d '
{
"people": {
"properties": {
"last_name": {
"type": "string",
"copy_to": "full_name"
},
"first_name": {
"type": "string",
"copy_to": "full_name"
},
"state": {
"type": "string"
},
"city": {
"type": "string"
},
"full_name": {
"type": "string"
}
}
}
}
'
curl -X POST "10.11.12.13:9200/helloworld/people/dork" -d '{"first_name": "Jim", "last_name": "Bob", "state": "California", "city": "San Jose"}'
curl -X POST "10.11.12.13:9200/helloworld/people/face" -d '{"first_name": "Bob", "last_name": "Jim", "state": "California", "city": "San Jose"}'
curl "http://10.11.12.13:9200/helloworld/people/_search" -d '
{
"query": {
"match_phrase": {
"full_name": {
"query": "Jim Bob"
}
}
}
}
'
Only "Jim Bob" is returned; it seems that the fields are copied in field-name alphabetical order.
How would I switch the copy_to order such that the "Bob Jim" person would be returned?
This is more deterministically controlled by registering a transform script in your mapping.
something like this:
"transform" : [
{"script": "ctx._source['full_name'] = [ctx._source['first_name'] + " " + ctx._source['last_name'], ctx._source['last_name'] + " " + ctx._source['first_name']]"}
]
Also, transform scripts can be "native", i.e. java code, made available to all nodes in the cluster by making your custom classes available in the elasticsearch classpath and registered as native scripts by the settings:
script.native.<name>.type=<fully.qualified.class.name>
in which case in your mapping you'd register the native script as a transform like so:
"transform" : [
{
"script" : "<name>",
"params" : {
"param1": "val1",
"param2": "val2"
},
"lang": "native"
}
],

How to create an Elasticsearch index without word-splitting?

I created my ES index as follows (MongoDB via River) :
curl -XPUT 'http://localhost:9200/_river/mongodb/_meta' -d '{
"type": "mongodb",
"mongodb": {
"db": "testmongo",
"collection": "person"
},
"index": {
"name": "mongoindex",
"type": "my_type"
}
}'
I have some entries in my MongoDB which look like "MyString.70" oder "Test-133" and when ES is indexing those entries, it's always splitting them into "MyString", "70", "Test", "133".
How can I deactivate that?
By indexing make use of for example
...
"mongodb": {
"db": "testmongo",
"collection": "person",
"index": "not_analyzed"
},
...

Treatment of special characters in elasticsearch

I use the following analyzer:
curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'
Then when I try to insert some documents which contain special characters like % and etc, it converts in to hex.
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8 -> actual value
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
-> stored value.
Sample:
curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'
The problem started occurring only once the data crossed some million documents. Earlier it used store it as it is.
Now if I try to search using,
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
it is not able to retrieve the document. How do I deal with this? The behavior seems to non-deterministic in converting special character to hex.
I am unable to replicate the same issue on localmachine.
Can someone explain the mistake I am making?
That is not how the document is tokenized on my end with that analyzer:
curl -XGET localhost:9200/_analyze?tokenizer=keyword\&filters=trim,lowercase\&pretty -d '1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8'
{
"tokens" : [ {
"token" : "1%2fpjjp3jv2c24idfeu9xphbayxxh%2fdhtbmchb35sdznxo2g8vz4d7gtivy54imix_149c95f02a8",
"start_offset" : 0,
"end_offset" : 80,
"type" : "word",
"position" : 1
} ]
}
Reading the analyzer output above, your example text is converted into a single, lowercase-but-otherwise-identical token given the analyzer shown. Are you sure there is no character filter at play? That's what would do the HTML encoding.
You should be able to run it as:
curl -XGET localhost:9200/sample/_analyze?field=message' -d 'text to analyze'
Since it was not reproducing with the analyzer directly, I tried to reproduce this on my end by creating an index to test it:
curl -XPUT localhost:9200/indexed-analysis -d '
{
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]
}
}
}
}
},
"mappings": {
"indexed" : {
"properties": {
"text" : { "type" : "string" }
}
}
}
}'
curl -XPUT localhost:9200/indexed-analysis/indexed/1 -d '{
"text" :
"1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'
curl -XGET localhost:9200/indexed-analysis/indexed/1?pretty
This produced the correct, identical result:
{
"_index" : "indexed-analysis",
"_type" : "indexed",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source":{
"text" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
So, I tried _searching for it, and I found it appropriately.
curl -XGET localhost:9200/indexed-analysis/_search -d '{
"query": {
"match": {
"text": "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
}'
Result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "indexed-analysis",
"_type": "indexed",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text": "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
]
}
}
All of this leads back to three possibilities:
Your search analyzer is different from your index analyzer. This is almost always going to produce unexpected results.
Using default should force it to be used for both reading and writing, but you can/should verify that is actually being used (as opposed to default_index or default_search):
curl -XGET /sample/_settings
curl -XGET /sample/_mapping
If you see analyzers being configured in the mapping for the message field, then that should probably be a red flag.
You have a character filter messing with the indexed string (and it's probably not doing the same thing for your search string, thus pointing back to #1).
There is a bug in the version of Elasticsearch that you are using (hopefully not, but you never know). All of the tests above were done against version 1.3.2.

How to add subdocument to an ElasticSearch index

In ElasticSearch, given the following document, Is it possible to add items to the "Lists" sub-document without passing the parent attributes (i.e. Message and tags)?
I have several attributes in the parent document which I dont want to pass every time I want to add one item to the sub-document.
{
"tweet" : {
"message" : "some arrays in this tweet...",
"tags" : ["elasticsearch", "wow"],
"lists" : [
{
"name" : "prog_list",
"description" : "programming list"
},
{
"name" : "cool_list",
"description" : "cool stuff list"
}
]
}
}
What you are looking for is, how to insert a nested documents.
In your case, you can use the Update API to append a nested document to your list.
curl -XPOST localhost:9200/index/tweets/1/_update -d '{
"script" : "ctx._source.tweet.lists += new_list",
"params" : {
"new_list" : {"name": "fun_list", "description": "funny list" }
}
}'
To support nested documents, you have to define your mapping, which is described here.
Assuming your type is tweets, the follwoing mapping should work:
curl -XDELETE http://localhost:9200/index
curl -XPUT http://localhost:9200/index -d'
{
"settings": {
"index.number_of_shards": 1,
"index.number_of_replicas": 0
},
"mappings": {
"tweets": {
"properties": {
"tweet": {
"properties": {
"lists": {
"type": "nested",
"properties": {
"name": {
"type": "string"
},
"description": {
"type": "string"
}
}
}
}
}
}
}
}
}'
Then add a first entry:
curl -XPOST http://localhost:9200/index/tweets/1 -d '
{
"tweet": {
"message": "some arrays in this tweet...",
"tags": [
"elasticsearch",
"wow"
],
"lists": [
{
"name": "prog_list",
"description": "programming list"
},
{
"name": "cool_list",
"description": "cool stuff list"
}
]
}
}'
And then add your element with:
curl -XPOST http://localhost:9200/index/tweets/1/_update -d '
{
"script": "ctx._source.tweet.lists += new_list",
"params": {
"new_list": {
"name": "fun_list",
"description": "funny list"
}
}
}'

Resources