ElasticSearch what analyzer should be used for searching for both url fragment and exact url path - elasticsearch

I want to store uri in a mapping and I want to make it searchable the following way:
Exact match (i.e. if I stored: http://stackoverflow.com/questions then looking for the term http://stackoverflow.com/questions retrieves the item.
Bit like letter tokenizer all "words" should be searchable. So searching for either questions, stackoverflow or maybe com will bring back http://stackoverflow.com/questions as a hit.
Looking for '.' or '/' separated url fragments should be still searchable. So searching for stackoverflow.com will bring back http://stackoverflow.com/questions as a hit.
should be case insensitive. (like lowercase)
The html://, htmls://, www. etc. is optional for searching. So searching for either http://stackoverflow.com or stackoverflow.com will bring back http://stackoverflow.com/questions as a hit.
Maybe a solution should be something like chaining tokenizers or something like that. I'm quite new to ES so this is maybe a trivial question.
So what kind of analyzer should I use/build to achieve this functionality?
Any help would be greatly apprechiated.

You are absolutely, correct. You will want to set your field type as multi_field and then create analyzers for each scenario. At the core, you can then do a multi_match query:
=============type properties===============
{
"fun_documents": {
"properties": {
"url": {
"type": "multi_field",
"fields": {
"keyword": {
"type": "string",
"analyzer": "keyword"
},
"alphanum_only": {
"type": "string",
"analyzer": "my_custom_alpha_num_analyzer"
},
{
"etc": "etc"
}
}
}
}
}
}
==================query=====================
{
"query": {
"multi_match": {
"query": "stackoverflow",
"fields": [
"url.keyword",
"url.alphanum_only",
"url.optional_fun"
]
}
}
}
Note that you can get fancy with multi_field aliases and reusing the same name, but this is the simple demonstration.

Related

Elasticsearch: Is there a way to exclude synomyms from highlighting?

I'm trying to exclude synonyms from highlighting. I created a copy of my current analyzer with a synonym filter. So for each field I now have an analyzer and a search_analyzer. The search analyzer is the new analyzer with all the same filters plus the synonym filter.
Any ideas? I am using elasticsearch 5.2
Mapping:
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "custom_analyzer",
"search_analyzer": "custom_analyzer_with_synonyms",
"fields": {
"plain": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
Search Query:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"require_field_match": false
}
}
}
}
I am not sure about the reason behind the problem. I'd have thought that simply highlighting on a non-synonym-analyzed field would have done it. But according to the comments, it is still highlighting the synonyms. There are 2 possible reasons i can think of: (I haven't looked into the highlighter source code)
It could be because of the multi-word synonym problem mentioned in this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html It could be fixed now since the link is old. If not, it could be causing the highlighter to look at wrong position offsets.
And/Or, it could also be because of not using the highlight field in the query. The highlighter might be simply using the tokens emitted from the searched field's analyzer (which would contain synonyms) and looking for those tokens in the highlighted field.
If it's the 1st problem, you could try to change your synonyms to use simple contraction. See: https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html#synonyms-contraction But, it has its own problems with the frequencies of uncommon words and could be a lot of work.
Fixing for the second case would be to use the "body.plain" field in the query, but you cannot do that since it affects your scores. In that case, specifying a different query for the highlighter (so that scores are not affected) on the non-synonym field does the trick. It works even if the 1st case is the problem too since we are not using synonyms in the highlight field.
So your query should look something like this:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"highlight_query": {
"match": {"body.plain": "something"}
}
}
}
}
}
See: https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-highlighting.html#_highlight_query

Mapping in elasticsearch

Good morning, In my code I can't search data which contain separate words. If I search on one word all good. I think problem in mapping. I use postman. When I put in URL http://192.168.1.153:9200/sport_scouts/video/_mapping and use method GET I get:
{
"sport_scouts": {
"mappings": {
"video": {
"properties": {
"hashtag": {
"type": "string"
},
"id": {
"type": "long"
},
"sharing_link": {
"type": "string"
},
"source": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
},
"user_id": {
"type": "long"
},
"video_preview": {
"type": "string"
}
}
}
}
}
}
All good title have type string but if I search on two or more words I get empty massive. My code in Trait:
public function search($data) {
$this->client();
$params['body']['query']['filtered']['filter']['or'][]['term']['title'] = $data;
$search = $this->client->search($params)['hits']['hits'];
dump($search);
}
Then I call it in my Controller. Can you help me with this problem?
The reason that your indexed data can't be found is caused by a mismatch of the analyzing during indexing and a strict term filter when querying the data.
With your mapping configuration, you are using the default analyzing which (besides many other operations) does a tokenizing. So every multi-word data you insert is split at punctuation or whitespaces. If you insert for example "some great sentence", elasticsearch maps the following terms to your document: "some", "great", "sentence", but not the term "great sentence". So if you do a term filter on "great sentence" or any other part of the original value containing a whitespace, you will not get any results.
Please see the elasticsearch docs on how to configure your mapping for indexing without analyzing (https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2) or consider doing a match query instead of a term filter on the existing mapping (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html).
Please be aware that if you switch to not_analyzed you will be disabling many of the great fuzzy fulltext query functionality. Of course you can set up a mapping that does both, analyzed and not_analyzed in different fields. Then it's up on you to decide on which field you want to query on.

Boost field on index in Elastic

I'm using Elastic 1.7.3 and I would like to have a boost on some fields in a index with documents like this fictional example :
{
title: "Mickey Mouse",
content: "Mickey Mouse is a fictional ...",
related_articles: [
{"title": "Donald Duck"},
{"title": "Goofy"}
]
}
Here eg: title is really important, content too, related_articles is a bit more important. My real document have lot of fields and nested object.
I would like to give more weight to the title field than content, and more to content than related_articles.
I have seen the title^5 way, but I must use it at each query and I must (I guess) list all my fields instead of a "_all" query.
I do a lot of search but I found lot of deprecated solutions (_boost by eg).
As I used to work with Sphinx : I search something that works like the field weight option where you can give some weight to field that are really important in your index than others.
You're right that the _boost meta-field that you could use at the type level has been deprecated.
But you can still use the boost property when defining each field in your mapping, which will boost your field at indexing time.
Your mapping would look like this:
{
"my_type": {
"properties": {
"title": {
"type": "string", "boost": 5
},
"content": {
"type": "string", "boost": 4
},
"related_articles": {
"type": "nested",
"properties": {
"title": {
"type": "string", "boost": 3
}
}
}
}
}
}
You have to be aware, though, that it's not necessarily a good idea to boost your field at index time, because once set, you cannot change it unless you are willing to re-index all of your documents, whereas using query-time boosting achieves the same effect and can be changed more easily.

Excluding field from _source causes aggregation to not work

We're using Elasticsearch 1.7.2 and trying to use the "include/exclude from _source" feature as it's described here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
We have a field types that's 'pretty' and that we would like to return to the client but it's not well suited to aggregations, and a field types_int (and also a types_string but that's not relevant now) that's 'ugly' but optimized for search/aggregations which we don't want to return to the client but that we want to aggregate/filter on.
The field types_int doesn't need to be stored anywhere, it just needs to be indexed. We don't want to waste bandwidth in returning it to the client either, so we don't want to include it in _source.
The mapping for it looks like this:
"types_int": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"value_int": {
"type": "integer"
}
}
}
However, after we add the exclude, our filters/aggregations on it stop working.
The excludes looks like this:
"_source": {
"excludes": [
"types_int"
]
}
Without that in the mapping, everything works fine.
An example of a filter:
POST my_index/my_type/_search
{
"filter": {
"nested": {
"path": "types_int",
"filter": {
"term": {
"types_int.name": "<something>"
}
}
}
}
}
Again, removing the excludes and everything works fine.
Thinking it might have something to do with nested types, since they're separate documents and all and perhaps handled differently from normal fields, I added an exclude mapping for a 'normal' value type field and then my filter also stopped working.
"publication": {
"type": "string",
"index": "not_analyzed"
}
"_source": {
"excludes": [
"publication"
]
}
So my conclusion is that after you exclude something from _source, you can no longer filter on it? Which doesn't make sense to me, so I'm thinking there's something we're doing wrong here. The _source include/exclude is just a post-process action that manipulates the string data inside that field, right?
I understand that we can also use source filtering to request specific fields to not be included at query time, but it's simply unnecessary to store it. If anything, I would just like to understand why this doesn't work :)

How can I get a search term with a space to be one search term

I have an elasticsearch index, with a field called "name" with a mapping as follows:
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
Now let's say I have a record "Brooklyn Technical High School".
I would like somebody searching for "brooklyn t*" to have that show up. For example: http://myserver/_search?q=name:brooklyn+t*
It seems however to be tokening the search term, and searching for both "brooklyn" and "t", because I get back results like: "Ps 335 Granville T Woods".
I would like it to search the not_analyzed term using the whole term. Enclosing it in quotes doesn't seem to help either.
You need to use the term query -
Term query wont analyzer/tokenize the string before it apply the search.
{
"query": {
"term": {
"user": "kimchy"
}
}
}

Resources