Get top 100 most used three word phrases in all documents - elasticsearch

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}

What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

Related

Elasticsearch's minimumShouldMatch for each member of an array

Consider an Elasticsearch entity:
{
"id": 123456,
"keywords": ["apples", "bananas"]
}
Now, imagine I would like to find this entity by searching for apple.
{
"match" : {
"keywords" : {
"query" : "apple",
"operator" : "AND",
"minimum_should_match" : "75%"
}
}
}
The problem is that the 75% minimum for matching would be required for both of the strings of the array – so nothing will be found. Is there a way to say something like minimumSouldMatch: "75% of any array fields"?
Note that I need to use AND as each item of keywords may be composed of longer text.
EDIT:
I tried the proposed solutions, but none of them was giving expected results. I guess the problem is that the text might be quite long, eg.:
["national gallery in prague", "narodni galerie v praze"]
I guess the fuzzy expansion is just not able to expand such long strings if you just start searching by "national g".
Would this may be be possible somehow via Nested objects?
{ keywords: [{keyword: "apples"}, {keyword: "babanas"}}
and then have minimumShouldMatch=1 on keywords and then 75% on each keyword?
As per doc
The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator parameter can be set to or or and to control the boolean clauses (defaults to or). The minimum number of optional should clauses to match can be set using the minimum_should_match parameter.
If you are searching for multiple tokens example "apples mangoes" and set minimum as 100%. It will mean both tokens should be present in document. If you set it at 50% , it means at least one of these should be present.
If you want to match tokens partially
You can use fuzziness parameter
Using fuzziness you can set maximum edit distance allowed for matching
{
"query": {
"match": {
"keywords": {
"query": "apple",
"fuzziness": "auto"
}
}
}
}
If you are trying to match word to its root form you can use "stemming" token filter
PUT index-name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [ "stemmer" ]
}
}
}
},
"mappings": {
"properties": {
"keywords":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated
GET index-name/_analyze
{
"text": ["apples", "bananas"],
"analyzer": "my_analyzer"
}
"tokens" : [
{
"token" : "appl",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "banana",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 101
}
]
stemming breaks words to their root form.
You can also explore n-grams, edge grams for partial matching

prefix autocomplete suggestion elasticsearch

I am trying to implement a prefix auto complete feature using ElasticSearch, here is my mapping for the suggest field:
PUT vdpinfo
{
"mappings": {
"details" : {
"properties" : {
"suggest" : {
"type" : "completion"
},
"title": {
"type": "keyword"
}
}
}
}
}
And I indexed some data with both single word and double words(bigram), such as:
{"suggest": "leather"}
And also:
{"suggest": "leather seats"}
{"suggest": "2 leather"}
And my search query is like this:
GET /vdpinfo/details/_search
{
"suggest": {
"feature-suggest": {
"prefix": "leather",
"completion": {
"field": "suggest"
}
}
}
}
But the result returns both {"suggest": "leather"} and {"suggest": "2 leather"}, and more importantly, {"suggest": "2 leather"} is ranked higher than leather.
My question is why the 2 leather gets returned, why doesn't it just do prefix autocomplete as in the query. prefix: leather?
This is because the default analyzer that is used for analyzing your data is the simple analyzer, which simply breaks text into terms whenever it encounters a character which is not a letter, so 2 leather is actually indexed as leather, hence why that result is showing (and also why it is showing first).
The reason they are using the simple analyzer by default instead of the standard one is to not provide suggestion based on stop words (explanation here).
So if you use the standard analyzer instead, you won't get any suggestion for 2 leather
PUT vdpinfo
{
"mappings": {
"details" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "standard"
},
"title": {
"type": "keyword"
}
}
}
}
}

Asking for significant terms but returns nothing

I am having an issue with Elasticsearch (version 2.0), I am trying to get the significant terms from a bunch of documents but it always returns nothing.
Here is the schema of my index :
{
"documents" : {
"warmers" : {},
"mappings" : {
"document" : {
"properties" : {
"text" : {
"index" : "not_analyzed",
"type" : "string"
},
"entities": {
"properties": {
"text": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1447410095617",
"uuid" : "h2m2J9sJQaCpxvGDI591zg",
"number_of_replicas" : "1",
"version" : {
"created" : "2000099"
},
"number_of_shards" : "5"
}
},
"aliases" : {}
}
}
So it's a simple index that contains the field text, which is not analyzed, and an array entities that will contains dictionnaries with a single field: text, which is not analyzed neither.
What I want to do is to match some of the documents and extracts the most significant terms from the entities associated. For that, I use a wildcard and then an aggregation.
Here is the the request I am sending through curl:
curl -XGET 'http://localhost:9200/documents/_search' -d '{
"query": {
"bool": {
"must": {"wildcard": {"text": "*test*"}}
}
},
"aggregations" : {
"my_significant_terms" : {
"significant_terms" : { "field" : "entities.text" }
}
}
}'
Unfortunately, even if Elasticsearch is hitting on some documents, the buckets of the significant terms aggregation are always empty.
I tried to put analyzed instead of not_analyzed also, but I got the same empty results.
So first, is it relevant to do it this way ?
I am a very beginner to Elasticsearch, so, can you explain me how the significant terms aggregations work ?
And finaly, if it is relevant, why my query isn't working ?
EDIT: I just saw in the Elasticsearch documentation that the significant terms aggregation need a certain amount of data to become effective, and I just have 163 documents in my index. Could it be that ?
Not sure if it will help. Try to specify
"min_doc_count" : 1
the significant terms aggregation need a certain amount of data to
become effective, and I just have 163 documents in my index. Could it
be that ?
Using 1 shard not 5 will help if you have a small number of docs.

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

Is it possible to rank span_near queries with unique results higher than duplicate results?

Assume I have two documents that have a "catField" containing the following information:
Document one:
happy cat
sad cat
meh cat
Document two:
happy cat
happy cat
happy cat
I am attempting to write a query that fulfils two requirements:
Find any word with a length of at least three followed by the word "cat".
The query should also rank documents with more unique types of cats (document one) higher than those that have the same types of cats (document two).
Here is my initial solution that uses span_near with regexp that fulfils the first requirement:
"span_near": {
"clauses": [
{
"span_multi": {
"match": {
"regexp": {
"catField": "[a-z]{3,}"
}
}
}
},
{
"span_multi": {
"match": {
"regexp": {
"catField": "cat"
}
}
}
}
],
"slop": 0,
"in_order": true
}
This works great for finding documents with lists of cats, but it will rank Document one, and Document two (above) the same. How can I fulfil that second requirement of ranking unique cat lists higher than non-unique ones?
So here is an approach using some indexing magic to get what you want. I'm not entirely certain of your requirements (since you are probably working with data more complicated than just "happy cat"), but it should get you started in the index-time direction.
This may or may not be the right approach for your setup. Depending on index size and query load, phrase queries/span queries/bool combinations may work better. Your requirements are tricky though, since they depend on order, size of preceding token, and number of variations.
The advantage of this is that much of your complex logic is baked into the index, gaining speed at query time. It does make your data a bit more rigid however.
curl -XDELETE localhost:9200/cats
curl -XPUT localhost:9200/cats -d '
{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index" : {
"analysis" : {
"analyzer" : {
"catalyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : ["cat_pattern", "unique", "cat_replace"]
}
},
"filter" : {
"cat_pattern" : {
"type" : "pattern_capture",
"preserve_original" : false,
"patterns" : [
"([a-z]{3,} cat)"
]
},
"cat_replace" : {
"type" : "pattern_replace",
"preserve_original" : false,
"pattern" : "([a-z]{3,} cat)",
"replacement" : "cat"
}
}
}
}
},
"mappings" : {
"cats" : {
"properties" : {
"catField" : {
"type" : "multi_field",
"fields": {
"catField" : {
"type": "string",
"analyzer": "standard"
},
"catalyzed" : {
"type": "string",
"index_analyzer": "catalyzer",
"search_analyzer" : "whitespace"
}
}
}
}
}
}
}'
First we are creating an index with a bunch of custom analysis. First we tokenize with a keyword analyzer (which doesn't actually tokenize, just emits a single token). Then we use a pattern_capture filter to find all "cats" that are preceded with a word longer than three characters. We then use a unique filter to get rid of duplicates (e.g. "happy cat" three times in a row). Finally we use a pattern_replace to change our "happy cat" into just "cat".
The final tokens for a field will just be "cat", but there will be more occurrences of "cat" if there are multiple types of cats.
At search time, we can simply search for "cat" and the docs that mention "cat" more often are boosted higher. More mentions means more unique types due to our analysis, so we get the boosting behavior "for free".
I used a multi-field, so you can still query the original field (e.g if you want to search for "happy cat").
Demonstration using the above mappings:
curl -XPOST localhost:9200/cats/cats/1 -d '
{
"catField" : ["sad cat", "happy cat", "meh cat"]
}'
curl -XPOST localhost:9200/cats/cats/2 -d '
{
"catField" : ["happy cat", "happy cat", "happy cat"]
}'
curl -XPOST localhost:9200/cats/cats/3 -d '
{
"catField" : ["a cat", "x cat", "y cat"]
}'
curl -XPOST localhost:9200/cats/cats/_search -d '
{
"query" : {
"match": {
"catField.catalyzed": "cat"
}
}
}'
Notice that the third document isn't returned by the search, since it doesn't have a cat that is preceeded by a type longer than three characters.

Resources