Extract keywords from fields

Extract keywords from fields - elasticsearch

I want to write a query to analyze one or more fields ?
i.e. current analyzers require text to function, instead of passing text I want to pass a field value.
If I have a document like this
{
"desc": "A document description",
"name": "This name is not original",
"amount": 3000
}
I would like to return something like the below
{
"desc": ["document", "description"],
"name": ["name", "original"],
"amount": 3000
}

You can use Term Vectors or Multi Term Vectors to achieve what you're looking for:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html
You'd have to specify the Ids of the fields you want as well as the fields and it will return an array of analyzed tokens for each document you have as well as certain other info which you can easily disable.
GET /exampleindex/_doc/_mtermvectors
{
"ids": [
"1","2"
],
"parameters": {
"fields": [
"*"
]
}
}
Will return something along the lines of:
"docs": [
{
"_index": "exampleindex",
"_type": "_doc",
"_id": "1",
"_version": 2,
"found": true,
"took": 0,
"term_vectors": {
"desc": {
"field_statistics": {
"sum_doc_freq": 5,
"doc_count": 2,
"sum_ttf": 5
},
"terms": {
"amazing": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 3,
"end_offset": 10
}
]
},
"an": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 2
}
]
}

Related

Elasticsearch : how to return the document with the exact word searched and not all documents that contain that word in an sentence?

I have field (type text) named 'description'
I have 3 documents.
doc1 description = "test"
doc2 description = "test dsc"
doc3 description = "2021 test desc"
CASE 1- if i search "test" i want only doc1
CASE 2- if i search "test dsc" i want only doc2
CASE 3- if i search "2021 test desc" i want only doc3
But now only CASE 3 is working
For example CASE1 not working .If i try this query i have all 3 document
GET /myindex/_search
{
"query": {
"match" : {
"Description" : "test"
}
}
}
thanks

You are getting all three documents in your search because by default elasticsearch uses a standard analyzer, for the text type field. This will tokenize "2021 test desc" into
{
"tokens": [
{
"token": "2021",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 0
},
{
"token": "test",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "desc",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Therefore, it will return all the documents that match any of the above tokens.
If you want to search for the exact term you need to update your index mapping.
You can update the mapping, by indexing the same field in multiple ways i.e by using multi fields.
PUT /_mapping
{
"properties": {
"description": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
And then reindex the data again. After this, you will be able to query using the "description" field as of text type and "description.raw" as of keyword type
Search Query:
{
"query": {
"match": {
"description.raw": "test dsc"
}
}
}
Search Result:
"hits": [
{
"_index": "67777521",
"_type": "_doc",
"_id": "2",
"_score": 0.9808291,
"_source": {
"description": "test dsc"
}
}
]

Elasticsearch match vs. term in filter

I don't see any difference between term and match in filter:
POST /admin/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber": "j1knd"
}
}
]
}
}
}
And the result contains not exactly matched partnumbers too, e.g.: "52527.J1KND-H"
Why?

Term queries are not analyzed and mean whatever you send will be used as it is to match the tokens in the inverted index, while match queries are analyzed and the same analyzer applied on the fields, which is used at index time and accordingly matches the document.
Read more about term query and match query. As mentioned in the match query:
Returns documents that match a provided text, number, date or boolean
value. The provided text is analyzed before matching.
You can also use the analyze API to see the tokens generated for a particular field.
Tokens generated by standard analyzer on 52527.J1KND-H text.
POST /_analyze
{
"text": "52527.J1KND-H",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "52527",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0
},
{
"token": "j1knd",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "h",
"start_offset": 12,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Above explain to you why you are getting the not exactly matched partnumbers too, e.g.: "52527.J1KND-H", I would take your example and how you can make it work.
Index mapping
{
"mappings": {
"properties": {
"partnumber": {
"type": "text",
"fields": {
"raw": {
"type": "keyword" --> note this
}
}
}
}
}
}
Index docs
{
"partnumber" : "j1knd"
}
{
"partnumber" : "52527.J1KND-H"
}
Search query to return only the exact match
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber.raw": "j1knd" --> note `.raw` in field
}
}
]
}
}
Result
"hits": [
{
"_index": "so_match_term",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"partnumber": "j1knd"
}
}
]
}

Elasticsearch: Count terms in document

I'm fairly new to elasticsearch, use version 6.5. My database contains website pages and their content, like this:
Url Content
abc.com There is some content about cars here. Lots of cars!
def.com This page is all about cars.
ghi.com Here it tells us something about insurances.
jkl.com Another page about cars and how to buy cars.
I have been able to perform a simple query that returns all documents that contain the word "cars" in their content (using Python):
current_app.elasticsearch.search(index=index, doc_type=index,
body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}},
"from": 0, "size": 100})
Result looks something like this:
{'took': 2521,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index':
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571,
'_source': {'content': '....'}}]}}
The "_id"s are referring to a domain, so I basically get back:
abc.com
def.com
jkl.com
But I now want to know how often the searchterm ("cars") is present in each document, like:
abc.com: 2
def.com: 1
jkl.com: 2
I found several solutions how to obtain the number of documents that contain the searchterm, but none that would tell how to get the number of terms in a document. I also couldn't find anything in the official documentation, although I'm pretty sure is in there somewhere and I'm maybe just not realising that it is the solution for my problem.
Update:
As suggested by #Curious_MInd I tried term aggregation:
current_app.elasticsearch.search(index=index, doc_type=index,
body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content"
}}}})
Result:
{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful':
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0,
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252',
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations':
{'skala_count': {'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0, 'buckets': []}}}
I don't see where it would display the counts per document here, but I'm assuming that's because "buckets" is empty? On another note: The results found by term aggregation are significantly worse than those with multi_match query. Is there any way to combine those?

What you are trying to achieve can't be done in a single query. The first query will be to filter and get the doc Ids for which the terms counts is required.
Lets assume you have the following mapping:
{
"test": {
"mappings": {
"_doc": {
"properties": {
"details": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads"
},
"name": {
"type": "keyword"
}
}
}
}
}
}
Assuming you query returns the following two docs:
{
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"details": "There is some content about cars here. Lots of cars!",
"name": "n1"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"details": "This page is all about cars",
"name": "n2"
}
}
]
}
}
From the above response you can get all the document ids that matched your query. For above we have : "_id": "1" and "_id": "2"
Now we use _mtermvectors api to get the frequency(count) of each term in a given field:
test/_doc/_mtermvectors
{
"docs": [
{
"_id": "1",
"fields": [
"details"
]
},
{
"_id": "2",
"fields": [
"details"
]
}
]
}
The above returns the following result:
{
"docs": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 2,
"tokens": [
{
"position": 5,
"start_offset": 28,
"end_offset": 32
},
{
"position": 9,
"start_offset": 47,
"end_offset": 51
}
]
},
....
}
}
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 23,
"end_offset": 27
}
]
},
....
}
}
}
]
}
Note that I have used .... to denote other terms data in the field since the term vector api return the term related details for all the terms.
You can definitely extract the info about the required term from the above response, here I have shown for cars and the field you are interested in is term_freq

I guess you need Term Aggregation here like below, See
GET /_search
{
"aggs" : {
"cars_count" : {
"terms" : { "field" : "Content" }
}
}
}

Elasticsearch: total term frequency and doc count from given set of documents

I am trying to get total term frequency and document count from given set of documents, but _termvectors in elasticsearch returns ttf and doc_count from all documents within the index. Is there any way so that I can specify list of documents (document ids) so that result will based on those documents only.
Below are documents details and query to get total term frequency:
Index details:
PUT /twitter
{ "mappings": {
"tweets": {
"properties": {
"name": {
"type": "text",
"analyzer":"english"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}
}
Document Details:
PUT /twitter/tweets/1
{
"name":"Hello bar"
}
PUT /twitter/tweets/2
{
"name":"Hello foo"
}
PUT /twitter/tweets/3
{
"name":"Hello foo bar"
}
It will create three document with ids 1, 2 and 3. Now suppose tweets with ids 1 and 2 belongs to user1 and 3 belong to another user and I want to get the termvectors for user1.
Query to get this result:
GET /twitter/tweets/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": ["name"],
"term_statistics": true,
"offsets":false,
"payloads":false,
"positions":false
}
}
Response:
{
"docs": [
{
"_index": "twitter",
"_type": "tweets",
"_id": "1",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"name": {
"field_statistics": {
"sum_doc_freq": 7,
"doc_count": 3,
"sum_ttf": 7
},
"terms": {
"bar": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1
},
"hello": {
"doc_freq": 3,
"ttf": 3,
"term_freq": 1
}
}
}
}
},
{
"_index": "twitter",
"_type": "tweets",
"_id": "2",
"_version": 1,
"found": true,
"took": 1,
"term_vectors": {
"name": {
"field_statistics": {
"sum_doc_freq": 7,
"doc_count": 3,
"sum_ttf": 7
},
"terms": {
"foo": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1
},
"hello": {
"doc_freq": 3,
"ttf": 3,
"term_freq": 1
}
}
}
}
}
]
}
Here we can see hello is having doc_count 3 and ttf 3. How can I make it to consider only documents with given ids.
One approach I am thinking is to create different index for different users. But I am not sure if this approach is correct. With this approach indices will increase with users. Or can there be another solution?

To obtain term doc count on a subset of documents you may try to use simple aggregations.
You will have to enable fielddata in the mapping of the field (though it might become tough on memory, check out the documentation page about fielddata for more details):
PUT /twitter
{
"mappings": {
"tweets": {
"properties": {
"name": {
"type": "text",
"analyzer":"english",
"fielddata": true,
"term_vector": "yes"
}
}
}
}
}
Then use terms aggregation:
POST /twitter/tweets/_search
{
"size": 0,
"query": {
"terms": {
"_id": [
"1",
"2"
]
}
},
"aggs": {
"my_term_doc_count": {
"terms": {
"field": "name"
}
}
}
}
The response will be:
{
"hits": ...,
"aggregations": {
"my_term_doc_count": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hello",
"doc_count": 2
},
{
"key": "bar",
"doc_count": 1
},
{
"key": "foo",
"doc_count": 1
}
]
}
}
}
I couldn't find a way to calculate total_term_frequency on the subset of documents though, I'm afraid it can't be done.
I would suggest to compute term vectors offline with _analyze API and store them in a separate index explicitly. In this way you will be able to use simple aggregations to compute also total term frequency. Here I show an example usage of _analyze API.
POST twitter/_analyze
{
"text": "Hello foo bar"
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "foo",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "bar",
"start_offset": 10,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Hope that helps!

es query of suggest in elasticsearch 5.0.1

I have a question that i want to search a result use suggest.
My type schema like this
`
{
"name": {
"input": [
"uers1"
]
},
"usertype": 1
}{
"name": {
"input": [
"uers2"
]
},
"usertype": 2
}`
I want search data by suggest, the query like these
`{
"suggest": {
"person_suggest": {
"text": "us",
"completion": {
"field": "name"
}
}
}
}`
And the result like these
`{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"person_suggest": [
{
"text": "word",
"offset": 0,
"length": 4,
"options": [
{
"name": "user1",
"usertype": 1,
"score": 1
},
{
"text": "user2",
"usertype": 2,
"score": 1
}
]
}
]
} `
But I only want the result is usertype = 1, like add a where condition in mysql. Any body can help me ?I want a DSL query.Thx a lot.

You can'nt filter in completion suggest queries. A solution to your problem to make different completion fields for each usertype or use standard queries with nGram analyzers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extract keywords from fields - elasticsearch

Related

Elasticsearch : how to return the document with the exact word searched and not all documents that contain that word in an sentence?

Elasticsearch match vs. term in filter

Elasticsearch: Count terms in document

Elasticsearch: total term frequency and doc count from given set of documents

es query of suggest in elasticsearch 5.0.1

Categories

Resources