I have an index full of keywords and based on those keywords I want to extract the keywords from the input text.
Following is the sample keyword index. Please note that the keywords can be of multiple words too, or basically they are tags which are unique.
{
"hits": {
"total": 2000,
"hits": [
{
"id": 1,
"keyword": "thousand eyes"
},
{
"id": 2,
"keyword": "facebook"
},
{
"id": 3,
"keyword": "superdoc"
},
{
"id": 4,
"keyword": "quora"
},
{
"id": 5,
"keyword": "your story"
},
{
"id": 6,
"keyword": "Surgery"
},
{
"id": 7,
"keyword": "lending club"
},
{
"id": 8,
"keyword": "ad roll"
},
{
"id": 9,
"keyword": "the honest company"
},
{
"id": 10,
"keyword": "Draft kings"
}
]
}
}
Now, if I input the text as "I saw the news of lending club on facebook, your story and quora" the output of the search should be ["lending club", "facebook", "your story", "quora"]. Also the search should be case insensetive
There's just one real way to do this. You'll have to index your your data as keywords and search it analyzed with shingles:
See this reproduction:
First, we'll create two custom analyzers: keyword and shingles:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
},
"my_analyzer_shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"shingle"
]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"keyword": {
"type": "string",
"index_analyzer": "my_analyzer_keyword",
"search_analyzer": "my_analyzer_shingle"
}
}
}
}
}
Now let's create some sample data using what you gave us:
POST /test/your_type/1
{
"id": 1,
"keyword": "thousand eyes"
}
POST /test/your_type/2
{
"id": 2,
"keyword": "facebook"
}
POST /test/your_type/3
{
"id": 3,
"keyword": "superdoc"
}
POST /test/your_type/4
{
"id": 4,
"keyword": "quora"
}
POST /test/your_type/5
{
"id": 5,
"keyword": "your story"
}
POST /test/your_type/6
{
"id": 6,
"keyword": "Surgery"
}
POST /test/your_type/7
{
"id": 7,
"keyword": "lending club"
}
POST /test/your_type/8
{
"id": 8,
"keyword": "ad roll"
}
POST /test/your_type/9
{
"id": 9,
"keyword": "the honest company"
}
POST /test/your_type/10
{
"id": 10,
"keyword": "Draft kings"
}
And finally query to run search:
POST /test/your_type/_search
{
"query": {
"match": {
"keyword": "I saw the news of lending club on facebook, your story and quora"
}
}
}
And this is result:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.009332742,
"hits": [
{
"_index": "test",
"_type": "your_type",
"_id": "2",
"_score": 0.009332742,
"_source": {
"id": 2,
"keyword": "facebook"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "7",
"_score": 0.009332742,
"_source": {
"id": 7,
"keyword": "lending club"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "4",
"_score": 0.009207102,
"_source": {
"id": 4,
"keyword": "quora"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "5",
"_score": 0.0014755741,
"_source": {
"id": 5,
"keyword": "your story"
}
}
]
}
}
So what it does behind the scenes?
It indexes your documents as whole keywords (It emits whole string as a single token). I've also added asciifolding filter, so it normalizes letters, i.e. é becomes e) and lowercase filter (case insensitive search). So for instance Draft kings is indexed as draft kings
Now search analyzer is using same logic, except that its' tokenizer is emitting word tokens and on top of that creates shingles(combination of tokens), which will match your keywords indexed as in first step.
Related
I'm trying to set up a basic Elasticsearch index locally and using Kibana, I am able to get all results when I do a match_all search, but I've tried many variations of a simple match query and none work.
My mapping:
{
"recipes-v1": {
"mappings": {
"dynamic": "false",
"properties": {
"description": {
"type": "keyword"
},
"ingredients": {
"type": "text"
},
"instructions": {
"type": "keyword"
},
"title": {
"type": "keyword"
}
}
}
}
}
Results from a match_all query:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "recipes-v1",
"_id": "0",
"_score": 1,
"_source": {
"Name": "Alfredo Sauce",
"Description": "Cheesy alfredo sauce that is delicious. Definitely not vegan",
"Ingredients": [
"1/2 cup butter",
"3 cloves garlic"
],
"Instructions": [
"Melt butter in a saucepan then add heavy cream and combine on medium low heat",
"Let the mixture simmer for 2 minutes then add garlic, salt, pepper, and italian seasoning to taste. Let simmer until fragrent (about 1 minute)"
]
}
},
{
"_index": "recipes-v1",
"_id": "1",
"_score": 1,
"_source": {
"Name": "Shrimp Scampi",
"Description": "Definitely not just Gordon Ramsay's shrimp scampi minus capers",
"Ingredients": [
"1 lb shrimp",
"2 lemons"
],
"Instructions": [
"Do things",
"Do more things"
]
}
}
]
}
}
I've tried deleting the index and recreating it and every variation of Alfredo, alfredo, alfredo sauce, AlfredoSauce, etc. and none have worked. Please help
All variations in these queries yield no hits though:
POST recipes-v1/_search
{
"query": {
"match": {
"title": {
"query": "alfredo"
}
}
}
}
POST recipes-v1/_search
{
"query": {
"bool": {
"should": {
"match": {
"name": "alfredo"
}
}
}
}
}
EDIT/UPDATE:
I changed the document fields to be all lowercase and the problem persists. However, if I set dynamic mapping to True with a new index, everything works. The mapping is now this and works, but I would like still like to know why my static mapping did not work, as eventually I'd want to make this static.
{
"recipes-v1": {
"mappings": {
"properties": {
"description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"ingredients": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"instructions": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
Your documents contains field names that are capital-cased, i.e. Description, Ingredients, etc
Your mapping contains the same field names but lowercased, i.e. description, ingredients, etc and has dynamic mapping disabled (dynamic: false) so new fields will not be created and indexed dynamically.
You need to either change your mapping or your documents so that both have the exact same field names.
I have a lot of keywords that I want to extract from a query and tell the position(offset) of were the keywords are in that text
So this is my progress I created two custom analyzers keyword and shingles:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
},
"my_analyzer_shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"shingle"
]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"keyword": {
"type": "string",
"index_analyzer": "my_analyzer_keyword",
"search_analyzer": "my_analyzer_shingle"
}
}
}
}
And here are the keywords that I say:
{
"hits": {
"total": 2000,
"hits": [
{
"id": 1,
"keyword": "python programming"
},
{
"id": 2,
"keyword": "facebook"
},
{
"id": 3,
"keyword": "Microsoft"
},
{
"id": 4,
"keyword": "NLTK"
},
{
"id": 5,
"keyword": "Natural language processing"
}
]
}
}
And I make a query something like this:
{
"query": {
"match": {
"keyword": "I post a lot of things on Facebook and quora"
}
}
}
So with the code above I get
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.009332742,
"hits": [
{
"_index": "test",
"_type": "your_type",
"_id": "2",
"_score": 0.009332742,
"_source": {
"id": 2,
"keyword": "facebook"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "4",
"_score": 0.009207102,
"_source": {
"id": 4,
"keyword": "quora"
}
}
]
}
}
But I don't know were in the text are that words the offset of those words:
I want to know that quora start at index 40. But not to highlight them between tags or something like this.
I want to mention that my post is based on this post
Extract keywords (multi word) from text using elastic search
I have a weird problem with Elasticsearch 6.0.
I have an index with the following mapping:
{
"cities": {
"mappings": {
"cities": {
"properties": {
"city": {
"properties": {
"id": {
"type": "long"
},
"name": {
"properties": {
"en": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"it": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"slug": {
"properties": {
"en": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"it": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
},
"doctype": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"suggest": {
"type": "completion",
"analyzer": "accents",
"search_analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": false,
"max_input_length": 50
},
"weight": {
"type": "long"
}
}
}
}
}
}
I have these documents in my index:
{
"_index": "cities",
"_type": "cities",
"_id": "991-city",
"_version": 128,
"found": true,
"_source": {
"doctype": "city",
"suggest": {
"input": [
"nazaré",
"nazare",
"나자레",
"najare",
"najale",
"ナザレ",
"Ναζαρέ"
],
"weight": 1807
},
"weight": 3012,
"city": {
"id": 991,
"name": {
"en": "Nazaré",
"it": "Nazaré"
},
"slug": {
"en": "nazare",
"it": "nazare"
}
}
}
}
{
"_index": "cities",
"_type": "cities",
"_id": "1085-city",
"_version": 128,
"found": true,
"_source": {
"doctype": "city",
"suggest": {
"input": [
"nazareth",
"nazaret",
"拿撒勒",
"na sa le",
"sa le",
"le",
"na-sa-lei",
"나사렛",
"nasares",
"nasales",
"ナザレス",
"nazaresu",
"नज़ारेथ",
"nj'aareth",
"aareth",
"najaratha",
"Назарет",
"Ναζαρέτ",
"názáret",
"nazaretas"
],
"weight": 1809
},
"weight": 3015,
"city": {
"id": 1085,
"name": {
"en": "Nazareth",
"it": "Nazareth"
},
"slug": {
"en": "nazareth",
"it": "nazareth"
}
}
}
}
Now, when I search using the suggester, with the following query:
POST /cities/_search
{
"suggest":{
"suggest":{
"prefix":"nazare",
"completion":{
"field":"suggest"
}
}
}
}
I expect to have both documents in my results, but I only get the second one (nazareth) back:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0.0,
"hits": []
},
"suggest": {
"suggest": [
{
"text": "nazare",
"offset": 0,
"length": 6,
"options": [
{
"text": "nazaresu",
"_index": "cities",
"_type": "cities",
"_id": "1085-city",
"_score": 1809.0,
"_source": {
"doctype": "city",
"suggest": {
"input": [
"nazareth",
"nazaret",
"拿撒勒",
"na sa le",
"sa le",
"le",
"na-sa-lei",
"나사렛",
"nasares",
"nasales",
"ナザレス",
"nazaresu",
"नज़ारेथ",
"nj'aareth",
"aareth",
"najaratha",
"Назарет",
"Ναζαρέτ",
"názáret",
"nazaretas"
],
"weight": 1809
},
"weight": 3015,
"city": {
"id": 1085,
"name": {
"en": "Nazareth",
"it": "Nazareth"
},
"slug": {
"en": "nazareth",
"it": "nazareth"
}
}
}
}
]
}
]
}
}
This is unexpected, because in the suggester input for the first document, the term that I searched "nazare" appears exactly as I input it.
Another fun fact is that if I search for "najare" instead of "nazare" I get the correct results.
Any hint will be really appreciated!
For a quick solution, use the size parameter in the completion object of your query.
GET /cities/_search
{
"suggest":{
"suggest":{
"prefix":"nazare",
"completion":{
"field":"suggest",
"size": 100 <- HERE
}
}
}
}
The size parameter default to 5, so once elasticsearch as found 5 terms (and not document) having the correct prefix, it will stop looking for more terms (and consequently documents).
This limit is per term, not per document. So if one document contains 5 terms having the correct and you use the default value of 5, then possibly the other documents will not be returned.
I strongly believe that it is whats happening in your case. The returned document has at least 5 suggest terms having the prefix nazare so only this one will be returned.
For your fun fact, when you are searching najare, there is only one term having the correct prefix, so you have the correct result.
The tricky thing is that the results depends on the order elasticsearch retrieve the documents. If the first document would have been retrieved first, it would not have reach the size threshold (only 2 or 3 prefix occurrences), the next document would be also retrieved and you would have get the correct result.
Also, unless necessary, avoid using a very high value (e.g. > 1000) for the sizeparameter. It might impact the performance particularly for short or common prefixes.
I'm running into strange problems following the shingles example at https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html
When I try to index the three documents on that tutorial, only two of them get indexed, the document with ID 3 is never indexed.
The request POSTed to http://elastic:9200/myIndex/page/_bulk is:
{ "index": { "_id": 1 }}
{ "text": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "text": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "text": "Sue never goes anywhere without her alligator skin purse" }
But the response is:
{
"took": 18,
"errors": false,
"items": [
{
"index": {
"_index": "myIndex",
"_type": "page",
"_id": "1",
"_version": 1,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"status": 201
}
},
{
"index": {
"_index": "myIndex",
"_type": "page",
"_id": "2",
"_version": 1,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"status": 201
}
}
]}
Index and mappings definition:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 5,
"min_shingle_size": 2,
"output_unigrams": "false"
},
"filter_stop": {
"type": "stop"
}
},
"analyzer": {
"analyzer_shingle": {
"tokenizer": "standard",
"filter": ["standard", "lowercase", "filter_stop", "filter_shingle"]
}
}
}
},
"mappings": {
"page": {
"properties": {
"text": {
"type": "string",
"index_options": "offsets",
"analyzer": "standard",
"fields": {
"shingles": {
"search_analyzer": "analyzer_shingle",
"analyzer": "analyzer_shingle",
"type": "string"
}
}
},
"title": {
"type": "string",
"index_options": "offsets",
"analyzer": "standard",
"search_analyzer": "standard"
}
}
}
}}
When posting documents in bulk, you need to make sure to include a new line character after the last line as explained in the official docs
curl -XPOST http://elastic:9200/myIndex/page/_bulk -d '
{ "index": { "_id": 1 }}
{ "text": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "text": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "text": "Sue never goes anywhere without her alligator skin purse" }
' <--- new line
I am using elastic search version 1.2.1 and trying to sort results on an integer field but it's not working the way I believe it has to and I can't sort in asc or desc order. I am using Sense extension of Chrome and Elastic search listens port 9200 at localhost.
Here's how I defined index:
PUT keyword_test
Then I added mapping for keyword_test index:
PUT /keyword_test/_mapping/keyword
{
"keyword": {
"properties" : {
"id": {
"type": "string"
},
"search_date": {
"type": "string"
},
"keyword": {
"type": "string",
"index": "analyzed"
},
"count": {
"type": "integer"
}
}
}
}
Then I added different keywords with different counts and try to search among them with the query below:
GET _search
{
"sort": {
"count": {
"order": "asc",
"ignore_unmapped": true
}
},
"query": {
"fuzzy": {
"keyword": "iphone"
}
}
}
I get the result below:
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "keyword_test",
"_type": "keyword",
"_id": "8",
"_score": null,
"_source": {
"id": 8,
"count": 9000,
"keyword": "iphone 5s",
"search_date": "2015-05-05"
},
"sort": [
9000
]
},
{
"_index": "keyword_test",
"_type": "keyword",
"_id": "10",
"_score": null,
"_source": {
"id": 10,
"count": 9500,
"keyword": "iphone 6 plus",
"search_date": "2015-05-05"
},
"sort": [
9500
]
},
{
"_index": "keyword_test",
"_type": "keyword",
"_id": "9",
"_score": null,
"_source": {
"id": 9,
"count": 9100,
"keyword": "iphone 6",
"search_date": "2015-05-05"
},
"sort": [
9100
]
}
]
}
Result should be 9000, 9100, 9500 order but it is in 9000, 9500, 9100 order. I also get SearchParseException if I remove ignore_unmapped. What should I do? Am I missing some mapping for count field? Any help would be appreciated, thanks.