How to store what is generated by the analyser? - elasticsearch

Let's say that I use this mapping:
PUT test
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"testtype": {
"properties": {
"content": {
"type": "text",
"analyzer": "english",
"store": true
}
}
}
}
}
Now I can index a document:
PUT test/testtype/0
{
"content": "The Quick Brown Box"
}
And I can retrieve it:
GET test/testtype/0
Which will return me:
{
"_index": "test",
"_type": "testtype",
"_id": "0",
"_version": 1,
"found": true,
"_source": {
"content": "The Quick brown Fox"
}
}
I know that in the source field you are supposed to only have the document that you inserted this is why I specified in my mapping that I want to store my content field. So by querying my store field I would expect to have in it what is generated my the analyser so something like this:
"quick brown fox"
But when I query the stored field:
GET test/testtype/_search
{
"stored_fields": "content"
}
I have exactly what I wrote in my document:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "testtype",
"_id": "0",
"_score": 1,
"fields": {
"content": [
"The Quick brown Fox"
]
}
}
]
}
}
So my question is how can I store in my elasticsearch the result of what is generated by my analyser?

You question is about the difference between the stored text and the generated tokens:
the store attribute of a lucene field
A stored field contains exactly the same as the corresponding field in the "_source"-JSON.
The generated token are in a lucene internal representation. But you can use the _analyze or _termvectors endpoint to have see the token
or you can use the term-aggregation

You can set index or query time analyzer. If you are using index time analyzer, then the analyzed text will be stored.
More details: https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html
Another way is using multifields. That means you have the original, and the processed text as well.
More details: https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

Related

Hi im storing a entire document data in to a document and searching the document data to match boolean search i.e java and angular and (html or css)

I'm storing an entire document data into a document and searching the document data to match boolean search i.e java and angular and (HTML or CSS). using query string query in elastic search it returning the relevant documents also but I need the documents only matches the boolean query string
{
"query": {
"query_string" : {
"query" : "(java or elasticsearch) and spring",
"default_field" : "author"
}
}
}
{
"took": 92,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.030296518,
"hits": [
{
"_index": "ramesh",
"_type": "books",
"_id": "0",
"_score": 0.030296518,
"_source": {
"id": "0",
"title": null,
"author": "Java\n Angular\n React\n elasticsearch \nblocchain\njavascript\nspring boot\n\n",
"releaseDate": null
}
}'
{
"_index": "ramesh",
"_type": "books",
"_id": "1",
"_score": 0.010253567,
"_source": {
"id": "1",
"title": null,
"author": "Java angular react elasticsearch blocchain\n",
"releaseDate": null
}
}
]
}
}
** I require to get the documents which contain 'spring' and one from (java or elastic search)**
I don't want the relevant search results
I tried with your query, only issue i see is that OR and AND words should be capital.
{
"query": {
"query_string" : {
"query" : "(java OR elasticsearch) AND spring",
"default_field" : "author"
}
}}

Elastic Search Case Insensitive query with prefix query

I am new to elastic search. I have below query
GET deals2/_search
{
"size": 200,
"_source": ["acquireInfo"],
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": ["acquireInfo.company_name.keyword"],
"query": "az*"
}
}
]
}
}
}
Here I want Elastic should gives results like case insensitive Like string start with below like
"Az"
"AZ"
"az"
"aZ"
"Az"
But I am not getting all results like this way. So Anyone can please help me on that.
Example:- I have 4 documents
1)Aziia Avto Ust-Kamenogorsk OOO
2)AZ Infotech Inc
3)AZURE Midstream Partners LP
4)State Oil Fund of the Republic of Azerbaijan
Now searching on az , should return only first 3 docs as they start with az ignoring case here and not the 4th one, which also has az but not at the beginning.
This is happening as you are using the keyword field to index the company_name in your application.
The keyword analyzer is a “noop” analyzer which returns the entire input string as a single token for example, company name, consist of foo, Foo, fOo will be stored with case only and searching for foo, will only match foo as elastic search ultimately works on tokens match(which is case sensitive).
What you need is to use a standard analyzer or some other custom analyzer which solves your other use-cases as well and uses lowercase token filter on the field and use the match query which is analyzed, and uses the same analyzer which is used to index the field, this way your search query will generate the same tokens, which is stored in the index and your search will become case-insensitive.
Edit: Had a discussion with the user in chat and updating the answer to suit his requirements, which are below:-
Step 1:- Define settings and mapping for index.
Endpoint :- http://{{hostname}}:{{port}}/{{index}}
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"company_name": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
Step 2: Index all the documents
Endpoint: http://{{hostname}}:{{port}}/{{index}}/_doc/ --> 1,2,3,4 etc
{
"company_name" : "State Oil Fund of the Republic of Azerbaijan"
}
Step3 :- Search query
Endpoint:- http://{{hostname}}:{{port}}/{{index}}/_search
{ "query": {
"prefix" : { "company_name" : "az" }
}
}
This would bring the below expected results:-
{
"took": 870,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "prerfixsearch",
"_type": "_doc",
"_id": "2ec9df0fc-dc04-47bb-914f-91a9f20d09efd15f2506-293f-4fb2-bdc3-925684a930b5",
"_score": 1,
"_source": {
"company_name": "AZ Infotech Inc"
}
},
{
"_index": "prerfixsearch",
"_type": "_doc",
"_id": "160d01183-a308-4408-8ac1-a85da950f285edefaca2-0b68-41c6-ba34-21bbef57f84f",
"_score": 1,
"_source": {
"company_name": "Aziia Avto Ust-Kamenogorsk OOO"
}
},
{
"_index": "prerfixsearch",
"_type": "_doc",
"_id": "1da878175-7db5-4332-baa7-ac47bd39b646f81c1770-7ae1-4536-baed-0a4f6b20fa38",
"_score": 1,
"_source": {
"company_name": "AZURE Midstream Partners LP"
}
}
]
}
}
Explanation:, As earlier OP didn;t mention the exclusion of 4th doc in the search result, that's the reason I suggested creating a text field, so that individuals tokens are generated but now as requirement is only the prefix search, we don't need the individual tokens and we would want only 1 token but it should be lowercased to support the case insensitive search, that's the reason I applied the custom normalizer on company_name field.

Elasticsearch query with fuzziness AUTO not working as expected

From the Elasticsearch documentation regarding fuzziness:
AUTO
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2
Must match exactly
3..5
One edit allowed
>5
Two edits allowed
However, when I am trying to specify low and high distance arguments in the search query the result is not what I am expecting.
I am using Elasticsearch 6.6.0 with the following index mapping:
{
"fuzzy_test": {
"mappings": {
"_doc": {
"properties": {
"description": {
"type": "text"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Inserting a simple document:
{
"id": "1",
"description": "hello world"
}
And the following search query:
{
"size": 10,
"timeout": "30s",
"query": {
"match": {
"description": {
"query": "helqo",
"fuzziness": "AUTO:7,10"
}
}
}
}
I assumed that fuzziness:AUTO:7,10 would mean that for the input term with length <= 6 only documents with the exact match will be returned. However, here is a result of my query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.23014566,
"hits": [
{
"_index": "fuzzy_test",
"_type": "_doc",
"_id": "OQtUu2oBABnEwrgM3Ejr",
"_score": 0.23014566,
"_source": {
"id": "1",
"description": "hello world"
}
}
]
}
}
This is strange but seems like that bug exists only in version the Elasticsearch 6.6.0. I've tried 6.4.2 and 6.6.2 and both of them work just fine.

Elasticsearch: get multiple specified documents in one request?

I am new to Elasticsearch and hope to know whether this is possible.
Basically, I have the values in the "code" property for multiple documents. Each document has a unique value in this property. Now I have the codes of multiple documents and hope to retrieve them in one request by supplying multiple codes.
Is this doable in Elasticsearch?
Regards.
Edit
This is the mapping of the field:
"code" : { "type" : "string", "store": "yes", "index": "not_analyzed"},
Two example values of this property:
0Qr7EjzE943Q
GsPVbMMbVr4s
What is the ES syntax to retrieve the two documents in ONE request?
First, you probably don't want "store":"yes" in your mapping, unless you have _source disabled (see this post).
So, I created a simple index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"code": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
added the two docs with the bulk API:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"code":"0Qr7EjzE943Q"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"code":"GsPVbMMbVr4s"}
There are a number of ways I could retrieve those two documents. The most straightforward, especially since the field isn't analyzed, is probably a with terms query:
POST /test_index/_search
{
"query": {
"terms": {
"code": [
"0Qr7EjzE943Q",
"GsPVbMMbVr4s"
]
}
}
}
both documents are returned:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.04500804,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.04500804,
"_source": {
"code": "0Qr7EjzE943Q"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.04500804,
"_source": {
"code": "GsPVbMMbVr4s"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/a3e3e4f05753268086a530b06148c4552bfce324

Does an empty field in a document take up space in elasticsearch?

Does an empty field in a document take up space in elasticsearch?
For example, in the case below, is the total amount of space used to store the document the same in Case A as in Case B (assuming the field "colors" is defined in the mapping).
Case A
{"features":
"price": 1,
"colors":[]
}
Case B
{"features":
"price": 1,
}
If you keep the default settings, the original document is stored in the _source field, there will be a difference as the original document of case A is bigger than case B.
Otherwise, there should be no difference : for case A, no term is added in the index for the colors field as it's empty.
You can use the _size field to see the size of the original document indexed, which is the size of the _source field :
POST stack
{
"mappings":{
"features":{
"_size": {"enabled":true, "store":true},
"properties":{
"price":{
"type":"byte"
},
"colors":{
"type":"string"
}
}
}
}
}
PUT stack/features/1
{
"price": 1
}
PUT stack/features/2
{
"price": 1,
"colors": []
}
POST stack/features/_search
{
"fields": [
"_size"
]
}
The last query will output this result, which shows than document 2 takes more space than 1:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "features",
"_id": "1",
"_score": 1,
"fields": {
"_size": 16
}
},
{
"_index": "stack",
"_type": "features",
"_id": "2",
"_score": 1,
"fields": {
"_size": 32
}
}
]
}
}

Resources