Issues mapping a document with ElasticSearch - elasticsearch

I have a document that I was hoping to store in ElasticSearch and be able to run queries against, but I think the document structure is possibly badly formed and as such I wont be able to do effective queries.
The document is trying to be generic and as such, has a set of repeating structures.
For example:
description : [
{ type : "port", value : 1234 }.
{ type : "ipaddress", value : "192.168.0.1" },
{ type : "path", value : "/app/index.jsp app/hello.jsp" },
{ type : "upsince", value : "2014-01-01 12:00:00" },
{ type : "location", value : "-40, 70" }
]
Note: Ive simplified the example, as in the real document the repeating structure has about 7 fields, of which 3 fields will explicitly identify the "type".
From the above example I can't see how I can write a mapping, as the "value" could either be an:
Integer
IP Address
A field that needs to be tokenized by only whitespace
A datetime
A GEO Point
The only solution I can see is that the document needs to be converted into another format that would more easily map with ElasticSearch ?

This case is somewhat described here: http://www.found.no/foundation/beginner-troubleshooting/#keyvalue-woes
You can't have different kinds of values in the same field. What you can do is to have different fields like location_value, timestamp_value, and so on.
Here's a runnable example: https://www.found.no/play/gist/ad90fb9e5210d4aba0ee
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"mappings": {
"type": {
"properties": {
"description": {
"type": "nested",
"properties": {
"integer_value": {
"type": "integer"
},
"type": {
"type": "string",
"index": "not_analyzed"
},
"timestamp_value": {
"type": "date"
}
}
}
}
}
}
}'
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"description":[{"type":"port","integer_value":1234},{"type":"upsince","timestamp_value":"2014-01-01T12:00:00"}]}
'

You're doing to save yourself a lot of headaches if you convert them documents like this first
{
"port": 1234,
"ipaddress" : "192.168.0.1" ,
"path" : "/app/index.jsp app/hello.jsp",
"upsince" : "2014-01-01 12:00:00",
"location" : "-40, 70"
}
Elasticsearch is designed to be flexible when it comes to fields and values, so it can already deal with pretty much any key/value combination you throw at it.
Optionally you can include the original document in a field that's explicitly stored but not indexed in case you need the orginal document returned in your queries.

Related

Elasticsearch: match exact keywords with special characters

I am storing tags as an array of keywords:
...
Tags: {
type: "keyword"
},
...
Resulting in arrays like this:
Tags: [
"windows",
"opengl",
"unicode",
"c++",
"c",
"cross-platform",
"makefile",
"emacs"
]
I thought that as I am using the keyword type I could easily do exact search terms, as it is not supposed to be using any analyser.
Apparently I was wrong! this gives me results:
body.query.bool.must.push({term: {"_all": "c"}}); # 38 results
But this doesn't:
body.query.bool.must.push({term: {"_all": "c++"}}); # 0 results
Although there are obviously instances of this tag, as seen above.
If I use body.query.bool.must.push({match: {"_all": search}}); instead (using match instead of term) then "c" and "c++" returns the exact same results, which is wrong as well.
The problem here is that you are using _all - Field, which uses an analyzer (standard by default). Make a small test with your data to be sure:
Test 1:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "_all": "c++"}
}
}'
Test 2:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "tags": "c++"}
}
}'
In my test second query returns documents, first not.
Do you really need to search with multiple fields? If so, you can override the default analyzer of _all field - for a quick test I put an index with settings like this:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"test" : {
"_all" : {"type" : "string", "index" : "not_analyzed", "analyzer" : "keyword"},
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
}
Or you can create Custom _all Field.
Solutions like Multi Field query, that allow to define list of fields to be searched over would rather behave like your example with body.query.bool.must.push({match: {"_all": search}});.

Elasticsearch mapping not working as expected

Having the following mapping:
curl -X PUT 'localhost:9200/cambio_indice?pretty=true' -d '{
"mappings" : {
"el_tipo" : {
"properties" : {
"name" : { "type" : "string" },
"age" : { "type" : "integer" },
"read" : { "type" : "integer" }
}}}}'
If I add the following code it works perfectly even though it doesn't match with the mapping (read is missing) but ES doesn't complain.
curl -X PUT 'localhost:9200/cambio_indice/el_tipo/1?pretty=true' -d '{
"name" : "Eduardo Inda",
"age" : 23
}'
And if I add the following entry, it also works.
curl -X PUT 'localhost:9200/cambio_indice/el_tipo/2?pretty=true' -d '{
"jose" : "stuff",
"ramon" : 23,
"garcia" : 1
}'
It seems that the mapping is not taking effect on the elements I'm adding. I'm doing something wrong when I try to map my type?
This is the default behaviour of Elasticsearch and is desirable in most of the cases. But for your case, if you do not want to allow indexing of fields not defined in your mapping, you need to update the mapping and set its "dynamic" property to "strict". Basically, your mapping definition should look like below:
{
"mappings": {
"el_tipo": {
"dynamic": "strict",
"properties": {
"name": {
"type": "string"
},
"age": {
"type": "integer"
},
"read": {
"type": "integer"
}
}
}
}
}
Then if you try to index fields like "jose", "ramon" or "garcia", Elasticsearch will throw with an appropriate message saying that the dynamic addition of these fields is prohibited.
As per documentation Of ES:
By default, Elasticsearch provides automatic index and mapping when data is added under an index that has not been created before. In other words, data can be added into Elasticsearch without the index and the mappings being defined a priori. This is quite convenient since Elasticsearch automatically adapts to the data being fed to it - moreover, if certain entries have extra fields, Elasticsearch schema-less nature allows them to be indexed without any issues.
So new fields added by you will get automatically added to your mappings.
See this for more info

Unable to retrieve a geopoint data field when searching

I have an index with a field called loc which is correctly mapped as a geopoint.
When running a search like:
curl -XGET 'http://localhost:9200/DB/_search'
I get 10 or so results and all of them appear to have loc inside the _source object.
If I try:
curl -XGET 'http://localhost:9200/DB/_search?fields=name'
I get a fields object with the name field correctly set up (name exists, it is another field, it is a string). Thing is, if I try the same thing with the loc field, as in:
curl -XGET 'http://localhost:9200/DB/_search?fields=loc'
I don't get anything back, neither the _source nor the fields objects.
How may I return the loc field when running this query?
Bonus question: Is there a way to return the loc field as a geohash?
Update, here's the mapping:
{
"geonames": {
"mappings": {
"place": {
"properties": {
"ele": {
"type": "string"
},
"geoid": {
"type": "string"
},
"loc": {
"type": "geo_point"
},
"name": {
"type": "string"
},
"pop": {
"type": "string"
},
"tz": {
"type": "string"
}
}
}
}
}
}
You should use source filtering instead of fields and you'll get the loc field as you expect.
curl -XGET 'localhost:9200/DB/_search?_source=loc'
Quoting from the official documentation on fields (emphasis added):
The fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.

copy_to and custom analyzer not working

(I'm doing this with a fresh copy of Elasticsearch 1.5.2)
I've defined a custom analyzer and it's working:
curl -XPUT 127.0.0.1:9200/test -d '{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"UrlTokenizer": {
"type": "pattern",
"pattern": "https?://([^/]+)",
"group": 1
}
},
"analyzer": {
"accesslogs": {
"tokenizer": "UrlTokenizer"
}
}
}
}
}
}'; echo
curl '127.0.0.1:9200/test/_analyze?analyzer=accesslogs&text=http://192.168.1.1/123?a=2#1111' | json_pp
Now I apply it to an index:
curl -XPUT 127.0.0.1:9200/test/accesslogs/_mapping -d '{
"accesslogs" : {
"properties" : {
"referer" : { "type" : "string", "copy_to" : "referer_domain" },
"referer_domain": {
"type": "string",
"analyzer": "accesslogs"
}
}
}
}'; echo
From the mapping I can see both of them are applied.
Now I try to insert some data,
curl 127.0.0.1:9200/test/accesslogs/ -d '{
"referer": "http://192.168.1.1/aaa.php",
"response": 100
}';echo
And the copy_to field, aka referer_domain was not generated and if I try to add a field with that name, the tokenizer is not applied either.
Any ideas?
copy_to works but, you are assuming that since you don't see the field being generated, it doesn't exist.
When you return your document back (with GET /test/accesslogs/1 for example), you don't see the field under _source. This contains the original document that has been indexed. And you didn't index any referer_domain field, just referer and response. And this is the reason why you don't see it.
But Elasticsearch does create that field in the inverted index. You can use it to query, compute or retrieve if you stored it.
Let me exemplify my statements:
you can query that field and you will get results back based on it. If you really want to see what has been stored in the inverted index, you can do this:
GET /test/accesslogs/_search
{
"fielddata_fields": ["referer","response","referer_domain"]
}
you can, also, retrieve that field if you stored it:
"referer_domain": {
"type": "string",
"analyzer": "accesslogs",
"store" : true
}
with this:
GET /test/accesslogs/_search
{
"fields": ["referer","response","referer_domain"]
}
In conclusion, copy_to modifies the indexed document, not the source document. You can query your documents having that field and it will work because the query looks at the inverted index. If you want to retrieve that field you need to store it, as well. But you will not see that field in the _source field because _source is the initial document that has been indexed. And the initial document doesn't contain referer_domain.

Setting up a Kibana terms panel for an Elasticsearch field that is a list of strings

I have a Kibana dashboard that contains a terms panel to show the number of instances for a particular field (let's call it field1). Field1 is, effectively, a list of strings. Each string usually contains multiple words. Since it's analyzed, Elasticsearch breaks the terms up into separate columns. I need to keep the text together, so I need a not_analyzed version. Here's my attempt to do that with a template, located at ~\config\templates\doc_template.json on a Windows box, which does not seem to be working. Elasticsearch is running as a Windows service.
{
"doc_template": {
"template": "*",
"mappings": {
"Type-*": {
"properties": {
"Field1": {
"type": "multi_field",
"fields": {
"Field1": { "index": "analyzed" },
"RawField1": { "index": "not_analyzed" }
}
}
}
}
}
}
}
In the terms panel, I expect the necessary field to be either RawField1 or Field1.RawField1, but I've tried other variations including and excluding .raw, with no luck.
New indexes are created daily. Field1 exists in 4 separate types, each of which begin with "Type-". I suspect my attempt at using a wildcard there is problematic, but I'm not sure. All data is being sent to Elasticsearch via NEST in a C# .NET application. Here's the mapping for Field1 as it currently exists for one of the types:
{
"index-2014.12.08" : {
"mappings" : {
"Type-1" : {
"properties" : {
"Field1" : {
"type" : "string"
},
"Field2" : {
"type" : "string"
},
"Field3" : {
"type" : "string"
}
}
}
}
}
}
Obviously, the mapping doesn't look like how I expect. What's the best way to remedy this issue?

Resources