Elasticsearch: Need offset of exact matching string - elasticsearch

I have a html files and I need to find section around exact matching string, say "ANNUAL REPORT PURSUANT". I am using latest version of Elasticsearch 5.4.0. I am new to elasticsearch. For indexing I have defined analyzer as below:
{
"index_name": {
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "index_name",
"creation_date": "1496927173220",
"analysis": {
"analyzer": {
"contact_section_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"pattern": "(ANNUAL REPORT PURSUANT)",
"type": "pattern",
"group": "1"
}
}
},
"number_of_replicas": "1",
"uuid": "vF3cAe-STJW-GrVxc7N8ww",
"version": {
"created": "5040099"
}
}
}
}
}
Now I am trying to get offset using analyze as below:
POST localhost:9200/sag_sec_items6/_analyze?pretty
{
"analyzer": "contact_section_analyzer",
"text": "my_html_file_contents_already_indexed"
}
It returns:
{
"tokens": []
}
I checked html files they contain that text.
Using _search query with individual _ids I get whole html file back.
How can I get offsets or the html tags containing that text.

I redefined my analyzer settings as below:
"settings": {
"analysis": {
"analyzer": {
"contact_section_start_analyzer": {
"char_filter": "html_strip",
"tokenizer": "contact_section_start_tokenizer"
}
},
"tokenizer": {
"contact_section_start_tokenizer": {
"flags": "CASE_INSENSITIVE|DOTALL",
"pattern": "\\b(annual\\s+report\\s+pursuant)\\b",
"type": "pattern",
"group": "1"
}
}
}
}
With this change in regex pattern and including CASE_INSENSITIVE|DOTALL flag in pattern analyzer I am able to get the offsets.

Related

Configure highlighted part in the elasticsearch

Main question
The user is looking for a name and enters the part of the it, let's say au, and the document with the text paul is found.
I would like to have the doc highlighted like p<em>au</em>l.
How can I achieve it if I have a complex search query (combination of match, prefix, wildcard to rule relevance)?
Sub question
When do highlight settings from documentation for type, boundary_scanner and boundary_chars come into play? As per my tests described below, these settings don't change highlighted part.
Try 1: Wildcard query with default analyzer
PUT myindex
{
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindex/_doc/1
{
"name": "paul"
}
GET myindex/_search
{
"query": {
"wildcard": {"name": "*au*"}
},
"highlight": {
"fields": {
"name": {}
},
"type": "fvh",
"boundary_scanner": "chars",
"boundary_chars": "abcdefghijklmnopqrstuvwxyz.,!? \t\n"
}
}
This kind of search returns highlight <em>paul</em> but I need to get p<em>au</em>l.
Try 2: Match query with NGRAM analyzer
This one works as described in SO question: Highlighting part of word in elasticsearch
PUT myindexngram
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "index_ngram_analyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindexngram/_doc/1
{
"name": "paul"
}
GET myindexngram/_search
{
"query": {
"match": {"name": "au"}
},
"highlight": {
"fields": {
"name": {}
}
}
}
This highlights p<em>au</em>l as desired but:
Highlighting depends on the query type, so combining match and wildcard will again result in <em>paul</em>.
Highlighting is not affected at all on type, boundary_scanner and boundary_chars settings.
Elastic version 7.13.4
Response from Elasticsearch team:
A highlighter works on terms, so only full terms can be highlighted - whatever are the terms in your index. In your second example, au could be highlighted, because it it a term in the index, which is not the case for your first example.
There is also an option to define your own highlight_query that could be different from the main query, but this could lead to unpredictable highlights.
https://discuss.elastic.co/t/configure-highlighted-part/295164

is filter supported in elasticsearch 7 in analysis?

I have the following settings roughly in my elastcisearch index :-
{
"pro_product_nanco_202111": {
"settings": {
"index": {
"max_ngram_diff": "49",
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"number_of_shards": "1",
"provided_name": "pro_product_nanco_202111",
"max_shingle_diff": "4",
"creation_date": "1645513903046",
"analysis": {
"filter": {
"searchkick_suggest_shingle": {
"max_shingle_size": "5",
"type": "shingle"
}
},
"analyzer": {
"searchkick_search": {
"filter": [
"lowercase",
"asciifolding",
"searchkick_search_shingle"
]
},
},
"char_filter": {
"ampersand": {
"type": "mapping",
"mappings": [
"&=> and "
]
}
}
},
....
}
}
}
}
I am however unable to find any equivalent filter option in the elasticsearch I.E. a filter option that is a direct descendant of analysis, is this a property that is depreciated in es or replaced with normalizer ? The above settings were created in es 5 or 6 and currently i'am migrating to es 7. I basically just want to know if filter is still a valid param in analysis when creating settings.

How to create and add values to a standard lowercase analyzer in elastic search

Ive been around the houses with this for the past few days trying things in various orders but cant figure out why its not working.
I am trying to create an index in Elasticsearch with an analyzer which is the same as the "standard" analyzer but retains upper case characters when records are stored.
I create my analyzer and index as follows:
PUT /upper
{
"settings": {
"index" : {
"analysis" : {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
}
}
}
Then add two records to test like this...
POST /upper/doc
{
"text" : "TEST"
}
Add a second record...
POST /upper/doc
{
"text" : "test"
}
Using /upper/_settings gives the following:
{
"upper": {
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "upper",
"creation_date": "1537788581060",
"analysis": {
"analyzer": {
"rebuilt_standard": {
"filter": [
"standard"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "s4oDgdsFTxOwsdRuPAWEkg",
"version": {
"created": "6030299"
}
}
}
}
}
But when I search with the following query I still get two matches! Both the upper and lower cases which must mean the analyser is not applied when I store the records.
Search like so...
GET /upper/_search
{
"query": {
"term": {
"text": {
"value": "test"
}
}
}
}
Thanks in advance!
first thing first you set your analyzer on the title field instead of upon the text field (since your search is on the text property, and since you are indexing doc with only text property)
"properties": {
"title": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
try
"properties": {
"text": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
and keep us posted ;)

Elasticsearch index analyzers seem to do nothing after being added

New to ES and following the docs (https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html) on using different analzers to deal with human language. After following some of the examples, it appears as though the added analyzers are having no effect on searches at all. Eg.
## init some index for testing
PUT /testindex
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 3,
"analysis": {},
"refresh_interval": "1s"
},
"mappings": {
"testtype": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
## adding some analyzers for...
POST /testindex/_close
##... simple lowercase tokenization, ...(https://www.elastic.co/guide/en/elasticsearch/guide/current/lowercase-token-filter.html#lowercase-token-filter)
PUT /testindex/_settings
{
"analysis": {
"analyzer": {
"my_lowercaser": {
"tokenizer": "standard",
"filter": [ "lowercase" ]
}
}
}
}
## ... normalization (https://www.elastic.co/guide/en/elasticsearch/guide/current/algorithmic-stemmers.html#_using_an_algorithmic_stemmer), ...
PUT testindex/_settings
{
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"light_english_stemmer",
"asciifolding"
]
}
}
}
}
## ... and using a hunspell dictionary (https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html#hunspell)
PUT testindex/_settings
{
"analysis": {
"filter": {
"en_US": {
"type": "hunspell",
"language": "en_US"
}
},
"analyzer": {
"en_US": {
"tokenizer": "standard",
"filter": [
"lowercase",
"en_US"
]
}
}
}
}
POST /testindex/_open
GET testindex/_settings
## it appears as though the analyzers have been added without problem
## adding some testing data
POST /testindex/testtype
{
"title": "Will the root word of movement be found?"
}
POST /testindex/testtype
{
"title": "That's why I never want to hear you say, ehhh I waant it thaaat away."
}
## expecting to match against root word of movement (move)
GET /testindex/testtype/_search
{
"query": {
"match": {
"title": "moving"
}
}
}
## which returns 0 hits, as shown below
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
## ... yet I can see that the record expected does in fact exist in the index when using...
GET /testindex/testtype/_search
{
"query": {
"match_all": {}
}
}
Thinking then that I need to actually "add" the analyzer to a (new) field, I do the following (which still shows negative results)
# adding the analyzers to a new field
POST /testindex/testtype
{
"mappings": {
"properties": {
"title2": {
"type": "text",
"analyzer": [
"my_lowercaser",
"english",
"en_US"
]
}
}
}
}
# looking at the tokens I'd expect to be able to find
GET /testindex/_analyze
{
"analyzer": "en_US",
"text": "Moving between directories"
}
# moving, move, between, directory
# what I actually see
GET /testindex/_analyze
{
"field": "title2",
"text": "Moving between directories"
}
# moving, between, directories
Even trying something simpler like
POST /testindex/testtype
{
"mappings": {
"properties": {
"title2": {
"type": "text",
"analyzer": "en_US"
}
}
}
}
does not help at all.
So this seems very messed up. Am I missing something here about how these analyzers are supposed to work? Should these analyzers be working properly (based on the provided info) and I am simply misusing them here? If so, could someone please provide an example query that would actually work/hit?
** Is there other debugging information that should be added here?
title2 field has 3 analyzers, but according to your output(analyze endpoint) it seems that only my_lowercaser is applied.
Finally, the config that worked for me with hunspell is:
"settings": {
"analysis": {
"filter": {
"en_US": {
"type": "hunspell",
"language": "en_US"
}
},
"analyzer": {
"en_US": {
"tokenizer": "standard",
"filter": [ "lowercase", "en_US" ]
}
}
}
}
"mappings": {
"_doc": {
"properties": {
"title-en-us": {
"type": "text",
"analyzer": "en_US"
}
}
}
}
movement is not resolved to move while moving is(probably hunspell dictionary related). Querying with move resulted in docs with moving only, but not movement.

ElasticSearch Reverse Wildcard Search

In ElasticSearch v5.2.2 I can search for "Jo*" using Wildcard and it will match the index value containing "Joseph"
But what if my index also has these values "Joseph","Jo", "Jos", "Jose" and "Josep" and I want to reverse the query.
How can I find "Jo", "Jos", "Jose" and "Josep" in the index using the string "Joseph" as search criteria?
That's possible, but you need to create an edgeNGram search analyzer in your index settings.
First create the settings like this. The name field will be indexed with the standard analyzer but searched with your custom prefix_search analyzer instead.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"prefix_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"prefix"
]
}
},
"filter": {
"prefix": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "prefix_search"
}
}
}
}
}
Then if you create a document like this:
PUT test/doc/1
{
"name": "Jos"
}
You can find it with a query like this one:
POST /test/doc/_search
{
"query": {
"match": {
"name": "Joseph"
}
}
}

Resources