Elastic Search Analyzer for Dynamically Defined Regular Expression Searches - elasticsearch

We have lots of documents in an elastic search index and doing full text searches at the moment. My next requirement in a project is finding all credit cards data in documents. Also user will be able to define some regular expression searching rules dynamically in the future. But with standard analyzer it is not possible to search credit card info or any user defined rule.
For instance, let's say a document contains credit card info such as 4321-4321-4321-4321 or 4321 4321 4321 4321. Elastic search indexes this data as 4 parts as seen below :
"tokens" : [
{
"token" : "4321",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "4321",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "4321",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "4321",
"start_offset" : 15,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 3
}
]
I just don't take into account Luhm algorithm now. If i do a basic regular expression search for finding a credit card with reg exp "([0-9]{4}[- ]){3}[0-9]{4}" it returns nothing because data is not analyzed and indexed for that. I thought for this purpose, i need to define a custom analyzer for regular expression searches and store the another version of data in another field or index. But as I said before in the future the user will define his/her own custom rule patterns for searching. How should i define the custom analyzer? Should i define ngram tokenizer(min:2, max:20) for that? With ngram tokenizer i think i can search for all defined regular expression rules. But is it reasonable? Project has to work with huge data without any performance problems. (A company's whole file system will be indexed). Do you have any other suggestion for this type of data discovery problem? My main purpose is finding credit cards at the moment. Thanks for helping.

Ok, here is a pair of custom analyzers that can help you detect credit card numbers and social security numbers. Feel free to adapt the regular expression as you see fit (by adding/removing other character separators that you will find in your data).
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"card_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"card_number"
]
},
"ssn_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"social_number"
]
}
},
"filter": {
"card_number": {
"type": "pattern_replace",
"preserve_original": false,
"pattern": """.*(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4}).*""",
"replacement": "$1$2$3$4"
},
"social_number": {
"type": "pattern_replace",
"preserve_original": false,
"pattern": """.*(\d{3})[\s\.\-]+(\d{2})[\s\.\-]+(\d{4}).*""",
"replacement": "$1$2$3"
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"card": {
"type": "text",
"analyzer": "card_analyzer"
},
"ssn": {
"type": "text",
"analyzer": "ssn_analyzer"
}
}
}
}
}
}
Let's test this.
POST test/_analyze
{
"analyzer": "card_analyzer",
"text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}
Will yield a nice digit-only credit card number:
{
"tokens" : [
{
"token" : "3526472847236374",
"start_offset" : 0,
"end_offset" : 86,
"type" : "word",
"position" : 0
}
]
}
Similarly for SSN:
POST test/_analyze
{
"analyzer": "ssn_analyzer",
"text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}
Will yield a nice digit-only social security number:
{
"tokens" : [
{
"token" : "442231452",
"start_offset" : 0,
"end_offset" : 86,
"type" : "word",
"position" : 0
}
]
}
And now we can search for either a credit card or a SSN. Let's say we have the following two documents. The SSN and credit card numbers are the same, yet they use different character separators
POST test/_doc
{ "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374" }
POST test/_doc
{ "text": "SSN is 442.23.1452 belongs to Mr. XYZ. He paid $20 via credit card number 3526-4728-4723-6374" }
You can now find both documents by looking for the credit card number and/or SSN in any format:
POST test/_search
{
"query": {
"match": {
"text.card": "3526 4728 4723 6374"
}
}
}
POST test/_search
{
"query": {
"match": {
"text.card": "3526 4728 4723-6374"
}
}
}
POST test/_search
{
"query": {
"match": {
"text.ssn": "442 23-1452"
}
}
}
All the above queries will match and return both documents.

Related

Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch

I want to use common_gram token filter based on this link.
My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch.
I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
Query Time Tokens
UPDATE:
I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?

How synonyms work internally in Elasticsearch?

I came across with Elasticsearch some time ago and started exploring it. I got to know about synonyms feature which is amazing! Can someone explain how internally this whole synonyms process work? How index time synonyms analyzing and search time synonyms analyzing are different in terms of internal structure?
Thanks :)
Elastic Doc:
Typically, the same parser should be applied at both index time and
lookup time to ensure that the query terms are in the same format as
the inverted index terms.
When you use the search_analyzer synonyms, you are generating the synonym tokens for the search term just in search time.
When you use synonyms at indexing time, you are expanding the term to the other terms of the synonyms, that is, everything is there in the inverted index. This can decrease your storage as you are indexing more term.
IndexTime example:
PUT synonym_index_time
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
},
"filter": {
"synonyms_filter": {
"type": "synonym",
"lenient": true,
"synonyms": [
"laptop, notebook"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
Test:
GET synonym_index_time/_analyze
{
"field": "name",
"text": ["laptop"]
}
Results:
{
"tokens" : [
{
"token" : "laptop",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "notebook",
"start_offset" : 0,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 0
}
]
}
Look, the terms laptop and notebook have been indexed, but notebook is a synonym.

Elastic Search - Apply appropriate analyser to accurate result

I am new in Elastic Search. I would like to apply any analyser that satisfy below search.
Lets take an example.
Suppose I have entered below text in a document
I am walking now
I walked to Ahmedabad
Everyday I walk in the morning
Anil walks in the evening.
I am hiring candidates
I hired candidates
Everyday I hire candidates
He hires candidates
Now when I search with
text "walking"
result should be [walking, walked, walk, walks]
text "walked"
result should be [walking, walked, walk, walks]
text "walk"
result should be [walking, walked, walk, walks]
text "walks"
result should be [walking, walked, walk, walks]
Same result should also for hire.
text "hiring"
result should be [hiring, hired, hire, hires]
text "hired"
result should be [hiring, hired, hire, hires]
text "hire"
result should be [hiring, hired, hire, hires]
text "hires"
result should be [hiring, hired, hire, hires]
Thank You,
You need to use stemmer token filter
Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search.
For example, walking and walked can be stemmed to the same root word:
walk. Once stemmed, an occurrence of either word would match the other
in a search.
Mapping
PUT index36
{
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ,"lowercase"]
}
}
}
}
}
Analyze
GET index36/_analyze
{
"text": ["walking", "walked", "walk", "walks"],
"analyzer": "my_analyzer"
}
Result
{
"tokens" : [
{
"token" : "walk",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "walk",
"start_offset" : 8,
"end_offset" : 14,
"type" : "word",
"position" : 101
},
{
"token" : "walk",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 202
},
{
"token" : "walk",
"start_offset" : 20,
"end_offset" : 25,
"type" : "word",
"position" : 303
}
]
}
All the four words produce same token "walk". So any of these words would match the other in a search.
What you are searching for is a language analyzer, see the documentation here
An Word anaylzer always consists of an word-tokenizer and a word-filter as the example below shows.
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
You can now use the analyzer in your index-mapping like this:
{ mappings": {
"myindex": {
"properties": {
"myField": {
"type": "keyword",
"analyzer": "rebuilt_english"
}
}
}
}
}
Remember to use a match query in order to query full-text.

ElasticSearch catenate_words -- only keep concatenated value

Following examples here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html
Specifically the catenate_words option.
I would like to use this to concatenate words that I can then use in a phrase query before and after the concatenated word, but the word parts prevent this.
For example, their example is this:
super-duper-xl → [ superduperxl, super, duper, xl ]
Now if my actual phrase was "what a great super-duper-xl" that would turn into a sequence:
[what,a,great,superduperxl,super,duper,xl]
That matches the phrase "great superduperxl" which is fine.
However, if the phrase was "the super-duper-xl emerged" the sequence would be:
[the,superduperxl,super,duper,xl,emerged]
This does not phrase match "superduperxl emerged", however it would if the part tokens (super,duper,xl) were not emitted.
Is there any way I can concatenate words keeping only the concatenated word and filtering out the word parts?
Pattern replace character filter can be used here.
"-" is replaced with "" to generate tokens
Query
PUT my-index1
{
"settings": {
"analysis": {
"analyzer": {
"remove_hyphen_analyzer": {
"tokenizer": "standard",
"char_filter": [
"remove_hyphen_filter"
]
}
},
"char_filter": {
"remove_hyphen_filter": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "remove_hyphen_analyzer"
}
}
}
}
POST my-index1/_analyze
{
"analyzer": "remove_hyphen_analyzer",
"text": "the super-duper-xl emerged"
}
Result
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "superduperxl",
"start_offset" : 4,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "emerged",
"start_offset" : 19,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

Elasticsearch, search for domains in urls

We index HTML documents which may include links to other documents. We're using elasticsearch and things are pretty smooth for most keyword searches, which is great.
Now, we're adding more complex searches similar to Google site: or link: searches: basically we want to retrieve documents which point to eithr specific urls or even domains. (If document A has a link to http://a.site.tld/path/, the search link:http://a.site.tld should yield it.).
And we're now trying what would be the best way to achieve this.
So far, we have extracted the links from the documents and added a links field to our document. We setup the links to be not analyzed. We can then do search that match the exact url link:http://a.site.tld/path/ But of course link:http://a.site.tld does not yield anything.
Our initial idea would be to create a new field linkedDomains which would work similarly... but there may exist better solutions?
You could try the Path Hierarchy Tokenizer:
Define a mapping as follows:
PUT /link-demo
{
"settings": {
"analysis": {
"analyzer": {
"path-analyzer": {
"type": "custom",
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"link": {
"type": "string",
"index_analyzer": "path-analyzer"
}
}
}
}
}
Index a doc:
POST /link-demo/doc
{
link: "http://a.site.tld/path/"
}
The following term query returns the indexed doc:
POST /link-demo/_search?pretty
{
"query": {
"term": {
"link": {
"value": "http://a.site.tld"
}
}
}
}
To get a feel for how this is being indexed:
GET link-demo/_analyze?analyzer=path-analyzer&text="http://a.site.tld/path"&pretty
Shows the following:
{
"tokens" : [ {
"token" : "\"http:",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "\"http:/",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld/path\"",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
} ]
}

Resources