Indexing a comma-separated value field in Elastic Search - elasticsearch

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!

Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.

POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

Related

Can only use wildcard queries on keyword, text and wildcard fields - not on [id] which is of type [long]

Elasticsearch version 7.13.1
GET test/_mapping
{
"test" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text"
}
}
}
}
}
POST test/_doc/101
{
"id":101,
"name":"hello"
}
POST test/_doc/102
{
"id":102,
"name":"hi"
}
Wildcard Search pattern
GET test/_search
{
"query": {
"query_string": {
"query": "*101* *hello*",
"default_operator": "AND",
"fields": [
"id",
"name"
]
}
}
}
Error is : "reason" : "Can only use wildcard queries on keyword, text and wildcard fields - not on [id] which is of type [long]",
It was working fine in version 7.6.0 ..
What is new change in latest ES and what is the resolution of this issue?
It's not directly possible to perform wildcards on numeric data types. It is better to convert those integers to strings.
You need to modify your index mapping to
PUT /my-index
{
"mappings": {
"properties": {
"code": {
"type": "text"
}
}
}
}
Otherwise, if you want to perform a partial search you can use edge n-gram tokenizer

ElasticSearch - Fuzzy search in list elements

I've got some documents stored in ElasticSearch like this:
{
"tag" : ["tag1", "tag2", "tag3"]
...
}
I want to search through the "tag" field. I know that It should work with a query like:
{
"query":
{
"match" : {"tag" : "tag1"}
}
}
But, I don't want to use a match, I want to use a fuzzy search through the list, for example, something like:
{
"query":
{
"fuzzy" : {"tag" : "tagg1"}
}
}
The problem is, the above query doesn't return anything. What should I use instead?
What is the type of tag field in your elasticsearch mapping ?
I have tried with following type for tag field & elastisearch version is 7.2
"tag" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
And working well for me.
Query With elastic fuzzy will be :
{
"query":
{
"fuzzy": {"tag" : "tagg1"}
}
}

prefix autocomplete suggestion elasticsearch

I am trying to implement a prefix auto complete feature using ElasticSearch, here is my mapping for the suggest field:
PUT vdpinfo
{
"mappings": {
"details" : {
"properties" : {
"suggest" : {
"type" : "completion"
},
"title": {
"type": "keyword"
}
}
}
}
}
And I indexed some data with both single word and double words(bigram), such as:
{"suggest": "leather"}
And also:
{"suggest": "leather seats"}
{"suggest": "2 leather"}
And my search query is like this:
GET /vdpinfo/details/_search
{
"suggest": {
"feature-suggest": {
"prefix": "leather",
"completion": {
"field": "suggest"
}
}
}
}
But the result returns both {"suggest": "leather"} and {"suggest": "2 leather"}, and more importantly, {"suggest": "2 leather"} is ranked higher than leather.
My question is why the 2 leather gets returned, why doesn't it just do prefix autocomplete as in the query. prefix: leather?
This is because the default analyzer that is used for analyzing your data is the simple analyzer, which simply breaks text into terms whenever it encounters a character which is not a letter, so 2 leather is actually indexed as leather, hence why that result is showing (and also why it is showing first).
The reason they are using the simple analyzer by default instead of the standard one is to not provide suggestion based on stop words (explanation here).
So if you use the standard analyzer instead, you won't get any suggestion for 2 leather
PUT vdpinfo
{
"mappings": {
"details" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "standard"
},
"title": {
"type": "keyword"
}
}
}
}
}

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources