How to find most used phrases in elasticsearch? - elasticsearch

I know that you can find most used terms in an index with using facets.
For example on following inputs:
"A B C"
"AA BB CC"
"A AA B BB"
"AA B"
term facet returns this:
B:3
AA:3
A:2
BB:2
CC:1
C:1
But I'm wondering that is it possible to list followings:
AA B:2
A B:1
BB CC:1
....etc...
Is there such a feature in ElasticSearch?

As mentioned in ramseykhalaf's comment, a shingle filter would produce tokens of length "n" words.
"settings" : {
"analysis" : {
"filter" : {
"shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"filter_stop":{
"type":"stop",
"enable_position_increments":"false"
}
},
"analyzer" : {
"shingle_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["standard," "lowercase", "shingle", "filter_stop"]
}
}
}
},
"mappings" : {
"type" : {
"properties" : {
"letters" : {
"type" : "string",
"analyzer" : "shingle_analyzer"
}
}
}
}
See this blog post for full details.

I'm not sure if elasticsearch will let you do this the way you want natively. But you might be interested in checking out Carrot2 - http://search.carrot2.org to accomplished what you want (and probably more.)

Related

How to build an Elasticsearch phrase query that matches text with special characters?

During the last few days I've been playing around elastic-search indexing and searching and I've to build different queries that I intended to. My problem right now is being able to build a query that is able to match text with special characters even if I don't type them in the "search bar". I'll give an example to easily explain what I mean.
Imagine you have a document indexed that contains a field called page content. Inside this field, you can have a part of the text such as
"O carro do João é preto." (means João's car is black in portuguese)
What I want to be able to do is type something like:
O carro do joao e preto
and still be able to get the proper match.
What I've tried so far:
I've been using the match phrase query provided in the documentation of elasticsearch (here) such as the example below:
GET _search
{
"query": {
"match_phrase": {
"page content":
{
"query": "o carro do joao e preto"
}
}
}
}
The result of this query gives me 0 hits. Which is perfectly acceptable given that the provided content of the query is different from what has been stored in that document.
I've tried setting the ASCII Folding Token Filter (here) but I'm not sure of how to use it. So what I've basically done is creating a new index with this query:
PUT /newindex '
{
"page content": "O carro do João é preto",
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
}'
Then if I try to query, using the match_phrase query provided above, like this:
O carro do joao e preto
it should show me the correct result as I wanted it to. But the thing is it isn't working for me. Am I forgetting something? I've been around this for the last two days without success and I feel like it's something that I'm missing.
So question: What do I have to do to get the desired matching?
Managed to find the answer to my own question. I had to change the analyzer a little bit when I created the index. Further details in this previous answer:
My code now:
{
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding"]
},
"text" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase"],
"char_filter" : "html_strip"
},
"sortable" : {
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : "html_strip"
}
}
}
}
}

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

Elasticsearch search fo words having '#' character

For example, I am right now searching like this:
http://localhost:9200/posts/post/_search?q=content:%23sachin
But, I am getting all the results with 'sachin' and not '#sachin'. Also, I am writing a regular expression for getting the count of terms. The facet looks like this:
"facets": {
"content": {
"terms": {
"field": "content",
"size": 1000,
"all_terms": false,
"regex": "#sachin",
"regex_flags": [
"DOTALL",
"CASE_INSENSITIVE"
]
}
}
}
This is not returning any values. I think it has something to do with escaping the '#' inside the regular expression, but I am not sure how to do it. I have tried to escape it \ and \\, but it did not work. Can anyone help me in this regard?
This article gives information on how save # and # using custom analyzers:
https://web.archive.org/web/20160304014858/http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
curl -XPUT 'http://localhost:9200/twitter' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tweet_filter" : {
"type" : "word_delimiter",
"type_table": ["# => ALPHA", "# => ALPHA"]
}
},
"analyzer" : {
"tweet_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "tweet_filter"]
}
}
}
},
"mappings" : {
"tweet" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "tweet_analyzer"
}
}
}
}
}'
This isn't dealing with facets, but the redefining of the type of those special characters in the analyzer could help.
Another approach that worth to consider is to index a special (e.g. "reserved") word instead of hash symbol. For example: HASHSYMBOLCHAR. Make sure that you will replace '#' chars in query as well.

Index fields with hyphens in Elasticsearch

I'm trying to work out how to configure elasticsearch so that I can make query string searches with wildcards on fields that include hyphens.
I have documents that look like this:
{
"tags":[
"deck-clothing-blue",
"crew-clothing",
"medium"
],
"name":"Crew t-shirt navy large",
"description":"This is a t-shirt",
"images":[
{
"id":"ba4a024c96aa6846f289486dfd0223b1",
"type":"Image"
},
{
"id":"ba4a024c96aa6846f289486dfd022503",
"type":"Image"
}
],
"type":"InventoryType",
"header":{
}
}
I have tried to use a word_delimiter filter and a whitespace tokenizer:
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tags_filter" : {
"type" : "word_delimiter",
"type_table": ["- => ALPHA"]
}
},
"analyzer" : {
"tags_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["tags_filter"]
}
}
}
},
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string",
"analyzer" : "tags_analyzer"
}
}
}
}
}
But these are the searches (for tags) and their results:
deck* -> match
deck-* -> no match
deck-clo* -> no match
Can anyone see where I'm going wrong?
Thanks :)
The analyzer is fine (though I'd lose the filter), but your search analyzer isn't specified so it is using the standard analyzer to search the tags field which strips out the hyphen then tries to query against it (run curl "localhost:9200/_analyze?analyzer=standard" -d "deck-*" to see what I mean)
basically, "deck-*" is being searched for as "deck *" there is no word that has just "deck" in it so it fails.
"deck-clo*" is being searched for as "deck clo*", again there is no word that is just "deck" or starts with "clo" so the query fails.
I'd make the following modifications
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "whitespace",
"filter" : ["lowercase"] <--- you don't need this, just thought it was a nice touch
}
}
}
then get rid of the special analyzer on the tags
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string"
}
}
}
}
let me know how it goes.

Resources