Sorting differences between SQL Server & ElasticSearch - sorting

We are implementing a solution for search-only purpose with ElasticSearch to remove some load from the SQL Server database we have.
The issue I have here is that when I sort a document on SQL Server, its not the same order as in ElasticSearch.
Results:
SQL Server sort is : - ! # & # _ 0 1 a A b
ElasticSearch sorts : ! # & - 0 1 # _ A a b
The mapping of the field I sort on:
description":
{
"type": "nested",
"properties": {
"notAnalyzed":
{
"analyzer": "ipStrictAnalyzer",
"type": "string"
},
"analyzed":
{
"analyzer": "ipAnalyzer",
"type": "string"
}
}
The analyzer:
"ipStrictAnalyzer":
{
"filter": ["lowercase","trim","ipSynonym"],
"type": "custom",
"tokenizer": "ipStrictTokenizer"
}
The tokenizer:
"ipStrictTokenizer":
{
"type": "keyword"
}
The ipSynonym filter is really just a synonym filter with an empty file right now.
"ipSynonym":
{
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
The order by I send:
"sort":
[
{
"description.notAnalyzed": {
"order": "desc"
}
}
],
Can we customize that?
Thanks in advance,
Matthias.

Related

Elasticsearch Field Preference for result sequence

I have created the index in elasticsearch with the following mapping:
{
"test": {
"mappings": {
"documents": {
"properties": {
"fields": {
"type": "nested",
"properties": {
"uid": {
"type": "keyword"
},
"value": {
"type": "text",
"copy_to": [
"fulltext"
]
}
}
},
"fulltext": {
"type": "text"
},
"tags": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"url": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
}
While searching I want to set the preference of fields for example if search text found in title or url then that document comes first then other documents.
Can we set a field preference for search result sequence(in my case preference like title,url,tags,fields)?
Please help me into this?
This is called "boosting" . Prior to elasticsearch 5.0.0 - boosting could be applied in indexing phase or query phase( added as part of field mapping ). This feature is deprecated now and all mappings after 5.0 are applied in query time .
Current recommendation is to to use query time boosting.
Please read this documents to get details on how to use boosting:
1 - https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html
2 - https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

Elasticsearch: Merge result of aggregation by bucket key

I've indexed entities in Elasticsearch, which occur in my documents. The mapping for the entities looks like the following:
"Entities": {
"properties": {
"EntFrequency": {
"type": "long"
},
"EntId": {
"type": "long"
},
"EntType": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"Entname": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
},
[...]
Furthermore, I use this aggregation query to determine the most-occurring entities:
GET cable/document/_search
{
"size" :0,
"query": {
"match_all": {}
},
"aggs" : {
"entities_agg" : {
"terms" : {
"field" : "Entities.EntId"
}
}
}
}
}
Response
"buckets": [
{
"key": 323644,
"doc_count": 231038
},
[...]
However, some of those entity mentions refer to the same entity e.g. "USA" and "United States" and I do know their ids. How do I merge the buckets and the counts of these duplicates in ES?
I cannot use a client-side solution since there are too many entities and retrieving all of them and merging would be probably too slow for my application. The knowledge about duplicates is acquired through runtime. Thus, I cannot use this knowledge for the initial creation of my ES index.
Thanks for your help and comments!

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

ElasticSearch - Different results for searching same index with 5 or 1 shards

I am running a single node with ElasticSearch (1.4.4) and I have two indexes with the same documents indexed. The only difference is in the number of shards:
Index 1: 5 shards, 0 replicas
Index 2: 1 shard, 0 replicas
When I run a test of 100 different queries using the Java API, I get different results for each index. I have tried using DFS Query Then Fetch for both of them, but I still get different results.
I am wondering what's happening. I am not sure how elasticsearch is performing the queries in order to get different results.
UPDATE:
These are the settings and mappings for my indexes:
"settings": {
"number_of_shards" : 1, // or 5 for the other index
"number_of_replicas" : 0,
"analysis": {
"filter": {
"worddelimiter": {
"type": "word_delimiter"
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding",
"worddelimiter"
],
"tokenizer": "standard"
}
}
}
}
"mappings": {
"faq": {
"properties": {
"answer": {
"type": "string",
"analyzer": "custom_analyzer"
},
"answer_id": {
"type": "long"
},
"path": {
"type": "string",
"analyzer": "custom_analyzer"
},
"question": {
"type": "string",
"analyzer": "custom_analyzer"
}
}
}
}
And this is the query:
client.prepareSearch(index)
.setTypes(type)
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.multiMatchQuery(inputText, "questions","answer^2","paths")
.setSize(BUFFER_SIZE)
.setFrom(i * BUFFER_SIZE)
.execute()
.actionGet();

Resources