copy_to and custom analyzer not working - elasticsearch

(I'm doing this with a fresh copy of Elasticsearch 1.5.2)
I've defined a custom analyzer and it's working:
curl -XPUT 127.0.0.1:9200/test -d '{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"UrlTokenizer": {
"type": "pattern",
"pattern": "https?://([^/]+)",
"group": 1
}
},
"analyzer": {
"accesslogs": {
"tokenizer": "UrlTokenizer"
}
}
}
}
}
}'; echo
curl '127.0.0.1:9200/test/_analyze?analyzer=accesslogs&text=http://192.168.1.1/123?a=2#1111' | json_pp
Now I apply it to an index:
curl -XPUT 127.0.0.1:9200/test/accesslogs/_mapping -d '{
"accesslogs" : {
"properties" : {
"referer" : { "type" : "string", "copy_to" : "referer_domain" },
"referer_domain": {
"type": "string",
"analyzer": "accesslogs"
}
}
}
}'; echo
From the mapping I can see both of them are applied.
Now I try to insert some data,
curl 127.0.0.1:9200/test/accesslogs/ -d '{
"referer": "http://192.168.1.1/aaa.php",
"response": 100
}';echo
And the copy_to field, aka referer_domain was not generated and if I try to add a field with that name, the tokenizer is not applied either.
Any ideas?

copy_to works but, you are assuming that since you don't see the field being generated, it doesn't exist.
When you return your document back (with GET /test/accesslogs/1 for example), you don't see the field under _source. This contains the original document that has been indexed. And you didn't index any referer_domain field, just referer and response. And this is the reason why you don't see it.
But Elasticsearch does create that field in the inverted index. You can use it to query, compute or retrieve if you stored it.
Let me exemplify my statements:
you can query that field and you will get results back based on it. If you really want to see what has been stored in the inverted index, you can do this:
GET /test/accesslogs/_search
{
"fielddata_fields": ["referer","response","referer_domain"]
}
you can, also, retrieve that field if you stored it:
"referer_domain": {
"type": "string",
"analyzer": "accesslogs",
"store" : true
}
with this:
GET /test/accesslogs/_search
{
"fields": ["referer","response","referer_domain"]
}
In conclusion, copy_to modifies the indexed document, not the source document. You can query your documents having that field and it will work because the query looks at the inverted index. If you want to retrieve that field you need to store it, as well. But you will not see that field in the _source field because _source is the initial document that has been indexed. And the initial document doesn't contain referer_domain.

Related

How to denormalize hierarchy in ElasticSearch?

I am new to ElasticSearch and I have a tree, which describes a path to a certain document (not real filesystem paths, just simple text fields categorizing articles, images, documents as one). Each path entry has a type, like.: Group Name, Assembly name or even Unknown. The types could be used in queries to skip certain entries in the path for example.
My source data is stored in SQL Server, the schema looks something like this:
Tree builds up by connecting the Tree.Id to Tree.ParentId, but each node must have a type. The Documents are connected to a leaf in the Tree.
I am not worried about querying the structure in SQL Server, however I should find an optimal approach to denormalize and search them in Elastic. If I flatten the paths and make a list of "descriptors" for a document, I can store each of the Document entries as an Elastic Document.:
{
"path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
"descriptors": [
{
"name": "NodeNameRoot",
"type": "type1"
},
{
"name": "NodeNameLevel_1",
"type": "type1"
},
{
"name": "NodeNameLevel_2",
"type": "type2"
},
{
"name": "NodeNameLevel_3",
"type": "type2"
},
{
"name": "NodeNameLevel_4",
"type": "type3"
}
],
"document": {
...
}
}
Can I query such a structure in ElasticSearch? Or Should I denormalize the paths in a different way?
My main questions:
Can query them based on type or text value (regex matching for example). For example: Give me all the type2->type3 paths (practically leave the type1 out), where the path contains X?
Is it possible to query based on levels? Like I would like the paths where there are 4 descriptors.
Can I do the searching with the built-in functionality or do I need to write an extension?
Edit
Based on G Quintana 's anwser, I made an index like this.:
curl -X PUT \
http://localhost:9200/test \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-d '{
"mappings": {
"path": {
"properties": {
"names": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"tokens": {
"type": "text",
"analyzer": "pathname_analyzer"
},
"depth": {
"type": "token_count",
"analyzer": "pathname_analyzer"
}
}
},
"types": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"tokens": {
"type": "text",
"analyzer": "pathname_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"pathname_analyzer": {
"type": "pattern",
"pattern": "#->>",
"lowercase": true
}
}
}
}
}'
And could query the depth like this.:
curl -X POST \
http://localhost:9200/test/path/_search \
-H 'content-type: application/json' \
-d '{
"query": {
"bool": {
"should": [
{"match": { "names.depth": 5 }}
]
}
}
}'
Which return correct results. I will test it a little more.
First of all you should identify all your query patterns to design how you will index your data.
From the example you gave, I would index documents of the form:
{
"path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
"types: "type1/type1/type2/type2/type3",
"document": {
...
}
}
Before indexing, you must configure mapping and analysis:
Field path:
use type text + analyzer based on pattern analyzer to split at / characters
use type token_count + same analyzer to compute path depth. Create a multi field (path.depth)
Field types
use type text + analyzer based on pattern analyzer to split at / characters
Configure index mappings and analysis to split the path and types fields and the , use a or a
Give me all the type2->type3 paths use a match_phrase query on the types field
where the path contains X use match query on the path field
where there are 4 descriptors use term query on path.depth sub field
Your descriptors field is not interesting.
The Path tokenizer might be interesting for some usecases.
You can apply multiple analyzer on the same field using multi-fields and then query if sub fields.

Unable to retrieve a geopoint data field when searching

I have an index with a field called loc which is correctly mapped as a geopoint.
When running a search like:
curl -XGET 'http://localhost:9200/DB/_search'
I get 10 or so results and all of them appear to have loc inside the _source object.
If I try:
curl -XGET 'http://localhost:9200/DB/_search?fields=name'
I get a fields object with the name field correctly set up (name exists, it is another field, it is a string). Thing is, if I try the same thing with the loc field, as in:
curl -XGET 'http://localhost:9200/DB/_search?fields=loc'
I don't get anything back, neither the _source nor the fields objects.
How may I return the loc field when running this query?
Bonus question: Is there a way to return the loc field as a geohash?
Update, here's the mapping:
{
"geonames": {
"mappings": {
"place": {
"properties": {
"ele": {
"type": "string"
},
"geoid": {
"type": "string"
},
"loc": {
"type": "geo_point"
},
"name": {
"type": "string"
},
"pop": {
"type": "string"
},
"tz": {
"type": "string"
}
}
}
}
}
}
You should use source filtering instead of fields and you'll get the loc field as you expect.
curl -XGET 'localhost:9200/DB/_search?_source=loc'
Quoting from the official documentation on fields (emphasis added):
The fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.

elasticsearch aggregations separated words

I simply run an aggregations in browser plugin(marvel) as you see in picture below there is only one doc match the query but aggregrated separated by spaces but it doesn't make sense I want aggregrate for different doc.. ın this scenario there should be only one group with count 1 and key:"Drow Ranger".
What is the true way of do this in elasticsearch..
It's probably because your heroname field is analyzed and thus "Drow Ranger" gets tokenized and indexed as "drow" and "ranger".
One way to get around this is to transform your heroname field to a multi-field with an analyzed part (the one you search on with the wildcard query) and another not_analyzed part (the one you can aggregate on).
You should create your index like this and specify the proper mapping for your heroname field
curl -XPUT localhost:9200/dota2 -d '{
"mappings": {
"agust": {
"properties": {
"heroname": {
"type": "string",
"fields": {
"raw: {
"type": "string",
"index": "not_analyzed"
}
}
},
... your other fields go here
}
}
}
}
Then you can run your aggregation on the heroname.raw field instead of the heroname field.
UPDATE
If you just want to try on the heroname field, you can just modify that field and not recreate the whole index. If you run the following command, it will simply add the new heroname.raw sub-field to your existing heroname field. Note that you still have to reindex your data though
curl -XPUT localhost:9200/dota2/_mapping/agust -d '{
"properties": {
"heroname": {
"type": "string",
"fields": {
"raw: {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then you can keep using heroname in your wildcard query, but your aggregation will look like this:
{
"aggs": {
"asd": {
"terms": {
"field": "heroname.raw", <--- use the raw field here
"size": 0
}
}
}
}

Issues mapping a document with ElasticSearch

I have a document that I was hoping to store in ElasticSearch and be able to run queries against, but I think the document structure is possibly badly formed and as such I wont be able to do effective queries.
The document is trying to be generic and as such, has a set of repeating structures.
For example:
description : [
{ type : "port", value : 1234 }.
{ type : "ipaddress", value : "192.168.0.1" },
{ type : "path", value : "/app/index.jsp app/hello.jsp" },
{ type : "upsince", value : "2014-01-01 12:00:00" },
{ type : "location", value : "-40, 70" }
]
Note: Ive simplified the example, as in the real document the repeating structure has about 7 fields, of which 3 fields will explicitly identify the "type".
From the above example I can't see how I can write a mapping, as the "value" could either be an:
Integer
IP Address
A field that needs to be tokenized by only whitespace
A datetime
A GEO Point
The only solution I can see is that the document needs to be converted into another format that would more easily map with ElasticSearch ?
This case is somewhat described here: http://www.found.no/foundation/beginner-troubleshooting/#keyvalue-woes
You can't have different kinds of values in the same field. What you can do is to have different fields like location_value, timestamp_value, and so on.
Here's a runnable example: https://www.found.no/play/gist/ad90fb9e5210d4aba0ee
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"mappings": {
"type": {
"properties": {
"description": {
"type": "nested",
"properties": {
"integer_value": {
"type": "integer"
},
"type": {
"type": "string",
"index": "not_analyzed"
},
"timestamp_value": {
"type": "date"
}
}
}
}
}
}
}'
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"description":[{"type":"port","integer_value":1234},{"type":"upsince","timestamp_value":"2014-01-01T12:00:00"}]}
'
You're doing to save yourself a lot of headaches if you convert them documents like this first
{
"port": 1234,
"ipaddress" : "192.168.0.1" ,
"path" : "/app/index.jsp app/hello.jsp",
"upsince" : "2014-01-01 12:00:00",
"location" : "-40, 70"
}
Elasticsearch is designed to be flexible when it comes to fields and values, so it can already deal with pretty much any key/value combination you throw at it.
Optionally you can include the original document in a field that's explicitly stored but not indexed in case you need the orginal document returned in your queries.

Why Elasticsearch "not_analyzed" field is split into terms?

I have the following field in my mapping definition:
...
"my_field": {
"type": "string",
"index":"not_analyzed"
}
...
When I index a document with value of my_field = 'test-some-another' that value is split into 3 terms: test, some, another.
What am I doing wrong?
I created the following index:
curl -XPUT localhost:9200/my_index -d '{
"index": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 2
},
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"_source": {
"compressed": true
},
"properties": {
"my_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
Then I index the following document:
curl -XPOST localhost:9200/my_index/my_type -d '{
"my_field": "test-some-another"
}'
Then I use the plugin https://github.com/jprante/elasticsearch-index-termlist with the following API:
curl -XGET localhost:9200/my_index/_termlist
That gives me the following response:
{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms": ["test","some","another"]}
Verify that mapping is actually getting set by running:
curl localhost:9200/my_index/_mapping?pretty=true
The command that creates the index seems to be incorrect. It shouldn't contain "index" : { as a root element. Try this:
curl -XPUT localhost:9200/my_index -d '{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 2
},
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"_source": {
"compressed": true
},
"properties": {
"my_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
In ElasticSearch a field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.

Resources