Elasticsearch, search for domains in urls - elasticsearch

We index HTML documents which may include links to other documents. We're using elasticsearch and things are pretty smooth for most keyword searches, which is great.
Now, we're adding more complex searches similar to Google site: or link: searches: basically we want to retrieve documents which point to eithr specific urls or even domains. (If document A has a link to http://a.site.tld/path/, the search link:http://a.site.tld should yield it.).
And we're now trying what would be the best way to achieve this.
So far, we have extracted the links from the documents and added a links field to our document. We setup the links to be not analyzed. We can then do search that match the exact url link:http://a.site.tld/path/ But of course link:http://a.site.tld does not yield anything.
Our initial idea would be to create a new field linkedDomains which would work similarly... but there may exist better solutions?

You could try the Path Hierarchy Tokenizer:
Define a mapping as follows:
PUT /link-demo
{
"settings": {
"analysis": {
"analyzer": {
"path-analyzer": {
"type": "custom",
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"link": {
"type": "string",
"index_analyzer": "path-analyzer"
}
}
}
}
}
Index a doc:
POST /link-demo/doc
{
link: "http://a.site.tld/path/"
}
The following term query returns the indexed doc:
POST /link-demo/_search?pretty
{
"query": {
"term": {
"link": {
"value": "http://a.site.tld"
}
}
}
}
To get a feel for how this is being indexed:
GET link-demo/_analyze?analyzer=path-analyzer&text="http://a.site.tld/path"&pretty
Shows the following:
{
"tokens" : [ {
"token" : "\"http:",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "\"http:/",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld/path\"",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
} ]
}

Related

Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch

I want to use common_gram token filter based on this link.
My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch.
I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
Query Time Tokens
UPDATE:
I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?

Simple elasticsearch regexp

I'm trying to write a query to will give me all the documents where the field "id" is of the form: "SOMETHING-SOMETHING-4SOMETHING-SOMETHING-SOMETHING"
For instance, ab-ba-4a-b-a is a valid id.
I wrote this query
"query":
{
"regexp":
{
"id":
{
"value": ".*-.*-4.*-.*-.*"
}
}
}
It gets no hits. What's wrong with this? I can see many ids of this form.
If the id field is of type keyword the regexp should be working fine.
However if it is of type text, notice how elasticsearch stores the token internally.
POST /_analyze
{
"text": "abc-abc-4bc-abc-abc",
"analyzer": "standard"
}
Response:
{
"tokens" : [
{
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "abc",
"start_offset" : 4,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "4bc",
"start_offset" : 8,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "abc",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "abc",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
Notice that it breaks down the token abc-abc-4abc-abc-abc into 5 strings. Take a look at what Analysis and Analyzers are and how they are only applied on text fields.
However, keyword datatype has been created only for the cases where you do not want your text to be analyzed (i.e. broken into tokens and stored in inverted indexes) and stores the string value as it is internally.
Now just in case if your mapping is dynamic, ES by default creates two different fields for string values. a text and its keyword sibling, something like below:
{
"mappings" : {
"properties" : {
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
In that case, just apply the query you have on id.keyword field.
POST <your_index_name>/_search
{
"query": {
"regexp": {
"id.keyword": ".*-.*-4.*-.*-.*"
}
}
}
Hope that helps!

Emails not being searched properly in elasticsearch

I have indexed a few documents in elasticsearch which have email ids as a field. But when I query for a specific email id, the search results are showing all the documents without filtering.
This is the query I have used
{
"query": {
"match": {
"mail-id": "abc#gmail.com"
}
}
}
By default, your mail-id field is analyzed by the standard analyzer which will tokenize the email abc#gmail.com into the following two tokens:
{
"tokens" : [ {
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "gmail.com",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
What you need instead is to create a custom analyzer using the UAX email URL tokenizer, which will tokenize email addresses as a one token.
So you need to define your index as follows:
curl -XPUT localhost:9200/people -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
},
"mappings": {
"person": {
"properties": {
"mail-id": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
After creating that index, you can see that the email abc#gmail.com will be tokenized as a single token and your search will work as expected.
curl -XGET 'localhost:9200/people/_analyze?analyzer=my_analyzer&pretty' -d 'abc#gmail.com'
{
"tokens" : [ {
"token" : "abc#gmail.com",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<EMAIL>",
"position" : 1
} ]
}
This happens when you use the default mappings. Elasticsearch has uax_url_email tokenizers which would identify the urls and emails as a single entity/token.
You can read more about this here and here

Searching for hyphened text in Elasticsearch

I am storing a 'Payment Reference Number' in elasticsearch.
The layout of it is e.g.: 2-4-3-635844569819109531 or 2-4-2-635844533758635433 etc
I want to be able to search for documents by their payment ref number either by
Searching using the 'whole' reference number, e.g. putting in 2-4-2-635844533758635433
Any 'part' of the reference number from the 'start'. E.g. 2-4-2-63 (.. so only return the second one in the example)
Note: i do not want to search 'in the middle' or 'at the end' etc. From the beginning only.
Anyways, the hyphens are confusing me.
Questions
1) I am not sure if I should remove them in the mapping like
"char_filter" : {
"removeHyphen" : {
"type" : "mapping",
"mappings" : ["-=>"]
}
},
or not. I have never use the mappings in that way so not sure if this is necessary.
2) I think I need a 'ngrams' filter because I want to be able to search a part of the reference number from the being. I think something like
"partial_word":{
"filter":[
"standard",
"lowercase",
"name_ngrams"
],
"type":"custom",
"tokenizer":"whitespace"
},
and the filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
I am not sure how to put it all together but
"paymentReference":{
"type":"string",
"analyzer": "??",
"fields":{
"partial":{
"search_analyzer":"???",
"index_analyzer":"partial_word",
"type":"string"
}
}
}
Everything that I have tried seems to always 'break' in the second search case.
If I do 'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433" it always breaks the hyphen as it's own token and returns e.g. all documents with 2- which is 'alot'! and not what I want when searching for 2-4-2-6
Can someone tell me how to map this field for the two types of searches I am trying to achieve?
Update - Answer
Effectively what Val said below. I just changed the mapping slightly to be more specific re the analyzers and also I don't need the main string indexed because I just query the partial.
Mapping
"paymentReference":{
"type": "string",
"index":"not_analyzed",
"fields": {
"partial": {
"search_analyzer":"payment_ref",
"index_analyzer":"payment_ref",
"type":"string"
}
}
}
Analyzer
"payment_ref": {
"type": "custom",
"filter": [
"lowercase",
"name_ngrams"
],
"tokenizer": "keyword"
}
Filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
You don't need to use the mapping char filter for this.
You're on the right track using the Edge NGram token filter since you need to be able to search for prefixes only. I would use a keyword tokenizer instead to make sure the term is taken as a whole. So the way to set this up is like this:
curl -XPUT localhost:9200/orders -d '{
"settings": {
"analysis": {
"analyzer": {
"partial_word": {
"type": "custom",
"filter": [
"lowercase",
"ngram_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
}
}
},
"mappings": {
"order": {
"properties": {
"paymentReference": {
"type": "string",
"fields": {
"partial": {
"analyzer": "partial_word",
"type": "string"
}
}
}
}
}
}
}'
Then you can analyze what is going to be indexed into your paymentReference.partial field:
curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"
And you get exactly what you want, i.e. all the prefixes:
{
"tokens" : [ {
"token" : "2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-635",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6358",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63584",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
...
Finally you can search for any prefix:
curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3
Not sure whether wildcard search match your needs. I define custom filter and set preserve_original and generate number parts false. Here is the sample code:
PUT test1
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [ "dont_split_on_numerics" ]
}
},
"filter" : {
"dont_split_on_numerics" : {
"type" : "word_delimiter",
"preserve_original": true,
"generate_number_parts" : false
}
}
}
},
"mappings": {
"type_one": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
},
"type_two": {
"properties": {
"raw": {
"type": "text",
"analyzer": "myAnalyzer"
}
}
}
}
}
POST test1/type_two/1
{
"raw": "2-345-6789"
}
GET test1/type_two/_search
{
"query": {
"wildcard": {
"raw": "2-345-67*"
}
}
}

Combine search terms automatically when with Elasticssearch?

Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}

Resources