How to create an Elasticsearch index without word-splitting? - elasticsearch

I created my ES index as follows (MongoDB via River) :
curl -XPUT 'http://localhost:9200/_river/mongodb/_meta' -d '{
"type": "mongodb",
"mongodb": {
"db": "testmongo",
"collection": "person"
},
"index": {
"name": "mongoindex",
"type": "my_type"
}
}'
I have some entries in my MongoDB which look like "MyString.70" oder "Test-133" and when ES is indexing those entries, it's always splitting them into "MyString", "70", "Test", "133".
How can I deactivate that?

By indexing make use of for example
...
"mongodb": {
"db": "testmongo",
"collection": "person",
"index": "not_analyzed"
},
...

Related

How do I alter the schema without destroying data in elasticsearch?

This is my current schema
{
"mappings": {
"historical_data": {
"properties": {
"continent": {
"type": "string",
"index": "not_analyzed"
},
"country": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string"
},
"funding": {
"type": "long"
},
"year": {
"type": "integer"
},
"agency": {
"type": "string"
},
"misc": {
"type": "string"
},
"university": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
I have 700k records uploaded. Without destroying the data, how can I make the university index not "not_analysed" such that the change reflects in my existing data?
The mapping for an existing field cannot be modified.
However you can achieve the desired outcome in two ways .
Create another field. Adding fields is free using put _mapping API
curl -XPUT localhost:9200/YOUR_INDEX/_mapping -d '{
"properties": {
"new_university": {
"type": "string"
}
}
}'
Use multi-fields, add a sub-field to your not_analyzed field.
curl -XPUT localhost:9200/YOUR_INDEX/_mapping -d '{
"properties": {
"university": {
"type": "string",
"index": "not_analyzed",
"fields": {
"university_analyzed": {
"type": "string" // <-- ANALYZED sub field
}
}
}
}
}'
In both the case, you need to reindex in order to populate the new field. Use _reindex API
curl -XPUT localhost:9200/_reindex -d '{
"source": {
"index": "YOUR_INDEX"
},
"dest": {
"index": "YOUR_INDEX"
},
"script": {
"inline": "ctx._source.university = ctx._source.university"
}
}'
You are not exactly forced to "destroy" your data, what you can do is reindex your data as described in this article (I'm not gonna rip off the examples as they are particularly clear in the section Reindexing your data with zero downtime).
For reindexing, you can also take a look at the reindexing API, the simplest way being:
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
Of course it will take some resources to perform this operation, so I would suggest that you take a complete look at the changes you want to introduce in your mapping, and perform the operation when you have the least amount of activity on your servers (e.g. during the weekend, or at night...)

ElasticSearch create an index with dynamic properties

Is it possible to create an index, restricting indexing a parent property?
For example,
$ curl -XPOST 'http://localhost:9200/actions/action/' -d '{
"user": "kimchy",
"message": "trying out Elasticsearch",
"actionHistory": [
{ "timestamp": 123456789, "action": "foo" },
{ "timestamp": 123456790, "action": "bar" },
{ "timestamp": 123456791, "action": "buz" },
...
]
}'
I don't want actionHistory to be indexed at all. How can this be done?
For the above document, I believe the index would be created as
$ curl -XPOST localhost:9200/actions -d '{
"settings": {
"number_of_shards": 1
},
"mappings": {
"action": {
"properties" : {
"user": { "type": "string", "index" : "analyzed" },
"message": { "type": "string": "index": "analyzed" },
"actionHistory": {
"properties": {
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"action": { "type": "string", "index": "analyzed" }
}
}
}
}
}
}'
Would removing properties from actionHistory and replace it with "index": "no" be the proper solution?
This is an example, however my actual situation are documents with dynamic properties (i.e. actionHistory contains various custom, non-repeating properties across all documents) and my mapping definition for this particular type has over 2000 different properties, making searches extremely slow (i.e. worst than full text search from the database).
You can probably get away by using dynamic templates, match on all actionHistory sub-fields and set "index": "no" for all of them.
PUT actions
{
"mappings": {
"action": {
"dynamic_templates": [
{
"actionHistoryRule": {
"path_match": "actionHistory.*",
"mapping": {
"type": "{dynamic_type}",
"index": "no"
}
}
}
]
}
}
}

ElasticSearch - Upgraded indexed field to a Multi Field - new field is empty

Having noticed my sort on an indexed string field doesn't work properly, I've discovered that it's sorting analyzed strings so "bags of words" and if I want it to work properly I have to sort on the non-analyzed string. My plan was to just change the string field to a multi-field, using information I found in those two articles:
https://www.elastic.co/blog/changing-mapping-with-zero-downtime (Upgrade to a multi-field part)
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html
Using Sense I've created this field mapping
PUT myindex/_mapping/type
{
"properties": {
"Title": {
"type": "string",
"fields": {
"Raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
And then I try to sort my search results using the newly made field. I've put all of the name variations I could think of after reading the articles:
POST myindex/_search
{
"_source" : ["Title","titlemap.Raw","titlemap.Title","titlemap.Title.Raw","Title.Title","Raw","Title.Raw"],
"size": 6,
"query": {
"multi_match": {
"query": "title",
"fields": ["Title^5"
],
"fuzziness": "auto",
"type": "best_fields"
}
},
"sort": {
"Title.Raw": "asc"
}
}
And that's what I get in response:
{
"_index": "myindex_2015_11_26_12_22_38",
"_type": "type",
"_id": "1205",
"_score": null,
"_source": {
"Title": "The title of the item"
},
"sort": [
null
]
}
Only the Title field's value is shown in the response and the sort criterium is null for every result.
Am I doing something wrong, or is there another way to do that?
The index name is not the same after re-indexing and thus the default mapping gets installed... that's probably why.
I suggest using an index template instead, so you don't have to care when to create the index and ES will do it for you. The idea is to create a template with the proper mapping you need and then ES will create every new index whenever it deems necessary, add the myindex alias and apply the proper mapping to it.
curl -XPUT localhost:9200/_template/myindex_template -d '{
"template": "myindex_*",
"settings": {
"number_of_shards": 1
},
"aliases": {
"myindex": {}
},
"mappings": {
"type": {
"properties": {
"Title": {
"type": "string",
"fields": {
"Raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
Then whenever you launch your re-indexing process a new index with a new name will be created BUT with the proper mapping and the proper alias.

How to specify ElasticSearch copy_to order?

ElasticSearch has the ability to copy values to other fields (at index time), enabling you to search on multiple fields as if it were one field (Core Types: copy_to).
However, there doesn't seem to be any way to specify the order in which these values should be copied. This could be important when phrase matching:
curl -XDELETE 'http://10.11.12.13:9200/helloworld'
curl -XPUT 'http://10.11.12.13:9200/helloworld'
# copy_to is ordered alphabetically!
curl -XPUT 'http://10.11.12.13:9200/helloworld/_mapping/people' -d '
{
"people": {
"properties": {
"last_name": {
"type": "string",
"copy_to": "full_name"
},
"first_name": {
"type": "string",
"copy_to": "full_name"
},
"state": {
"type": "string"
},
"city": {
"type": "string"
},
"full_name": {
"type": "string"
}
}
}
}
'
curl -X POST "10.11.12.13:9200/helloworld/people/dork" -d '{"first_name": "Jim", "last_name": "Bob", "state": "California", "city": "San Jose"}'
curl -X POST "10.11.12.13:9200/helloworld/people/face" -d '{"first_name": "Bob", "last_name": "Jim", "state": "California", "city": "San Jose"}'
curl "http://10.11.12.13:9200/helloworld/people/_search" -d '
{
"query": {
"match_phrase": {
"full_name": {
"query": "Jim Bob"
}
}
}
}
'
Only "Jim Bob" is returned; it seems that the fields are copied in field-name alphabetical order.
How would I switch the copy_to order such that the "Bob Jim" person would be returned?
This is more deterministically controlled by registering a transform script in your mapping.
something like this:
"transform" : [
{"script": "ctx._source['full_name'] = [ctx._source['first_name'] + " " + ctx._source['last_name'], ctx._source['last_name'] + " " + ctx._source['first_name']]"}
]
Also, transform scripts can be "native", i.e. java code, made available to all nodes in the cluster by making your custom classes available in the elasticsearch classpath and registered as native scripts by the settings:
script.native.<name>.type=<fully.qualified.class.name>
in which case in your mapping you'd register the native script as a transform like so:
"transform" : [
{
"script" : "<name>",
"params" : {
"param1": "val1",
"param2": "val2"
},
"lang": "native"
}
],

Elasticsearch: _id based on document field?

I am new to Elasticsearch. I have difficulty in using a field of the document for _id. Here is my mapping:
{
"product": {
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long",
"index": "not_analyzed",
"store": "yes"
},
"title": {
"type": "string",
"analyzer": "snowball",
"store": "no",
"index": "not_analyzed"
}
}
}
}
Here is a sample document:
{
"id": 1,
"title": "All Quiet on the Western Front"
}
When indexing this document, I got something like:
{
"_index": "myindex",
"_type": "book",
"_id": "PZQu4rocRy60hO2seUEziQ",
"_version": 1,
"created": true
}
Did I do anything wrong? How should this work?
EDIT: _id.path was deprecated in v1.5 and removed in v2.0.
EDIT 2: on versions where this is supported, there is a performance penalty in that the coordinating node is forced to parse all requests (including bulk) in order to determine the correct primary shard for each document.
Provide an _id.path in your mapping, as described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-id-field.html
Here is a full, working demonstration:
#!/bin/sh
echo "--- delete index"
curl -X DELETE 'http://localhost:9200/so_id_from_field/'
echo "--- create index and put mapping into place"
curl -XPUT http://localhost:9200/so_id_from_field/?pretty=true -d '{
"mappings": {
"tweet" : {
"_id" : {
"path" : "post_id"
},
"properties": {
"post_id": {
"type": "string"
},
"nickname": {
"type": "string"
}
}
}
},
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}'
echo "--- index some tweets by POSTing"
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "1305668",
"nickname": "Uncle of the month club"
}'
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "blarger",
"nickname": "Hurry up and spend my money"
}'
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "9",
"nickname": "Who is the guy with the shoe hat?"
}'
echo "--- get the tweets"
curl -XGET http://localhost:9200/so_id_from_field/tweet/1305668?pretty=true
curl -XGET http://localhost:9200/so_id_from_field/tweet/blarger?pretty=true
curl -XGET http://localhost:9200/so_id_from_field/tweet/9?pretty=true

Resources