Elastics Search email search mismatch with match query using com - elasticsearch

GET candidates1/candidate/_search
{
"fields": ["contactInfo.emails.main"],
"query": {
"bool": {
"must": [
{
"match": {
"contactInfo.emails.main": "com"
}
}
]
}
}
}
GET candidates1/candidate/_search
{
"size": 5,
"fields": [
"contactInfo.emails.main"
],
"query": {
"match": {
"contactInfo.emails.main": "com"
}
}
}
Hi,
When i am using the above query i am getting results like ['nraheem#dbtech1.com','arelysf456#gmai1.com','ron#rgb52.com'] but i am not getting emails like ['pavann.aryasomayajulu#gmail.com','kumar#gmail.com','raj#yahoo.com']
But when i am using the query to match "gmail.com", i am getting results which have gmail.com
So My question is when i am using "com" in the first query, i am expecting results that include gmail.com as "com" is present in gmail.com. But that is not happening
Note: we are having almost 2Million emailid and most of them are gmail.com , yahoo.com or hotmail but only few are of other types.

"contactInfo.emails.main" fields seem to be an analyzed field.
In elasticsearch all string fields are analyed using Standard Analyzer and are converted into tokens.You can see how your text is getting analyzed using analyze api. Email Ids mentioned by you ending in number before com are getting analyzed as nraheem , dbtech1 , com. Use following query to see the tokens.
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "nraheem#dbtech1.com"
}'
As you can see there is a separate term com being created. While if you analyze kumar#gmail.com you will get tokens like kumar , gmail.com. There is no separate token com created in this case.
This is because Standard Analyzer splits the terms when it encounters some special characters like #,? etc or numbers too. You can create custom Analyzer to meet your requirement.
Hope this helps!!

Related

Searching documents indexed via ingest-attachment in elasticsearch

Searching documents indexed via ingest-attachment in elastic search on basis of attributes.
i'm new to elastic search and wanted to index files with some of the attributes like Author, Title, Subject, Category, Community etc.
How far i reached :-
i was able to create a attachment pipeline and was able to ingest the different docs in the elastic with attributes. see below how i did:-
1) created pipeline by following request:-
{
"description":"Extract attachment information",
"processors":[
{
"attachment":{
"field":"data"
}
}
]
}
2) upload an attachment via following code :-
{
"filename":"Presentations-Tips.ppt",
"Author":"Jaspreet",
"Category":"uploading ppt",
"Subject":"testing ppt",
"fileUrl":"",
"attributes":
{"attr11":"attr11value","attr22":"attr22value","attr33":"attr33value"},
"data": "here_base64_string_of_file"
}
3) then able to search freely on the all the above attributes and on file content as well:-
{
"query":{
"query_string":{
"query":"*test*"
}
}
}
Now what I wanted is :-
Wanted to narrow down the searches through some filters like :-
wanted to search on basis like search on specific parameters like search all those whose author must "Rohan"
then search all whose author must be "Rohan" and category must be "Education"
then search all whose author has letters like "han" and categories has letters "Tech"
search all whose author is "Rohan" and can search full text search on all fields which can have "progress" in any field, means first narrow down search for author and then full text search on those resultset fields.
Please help me with proper query syntax and call url like for above full text search I used 'GET /my_index/_search'
After spending some time finally got the answer:-
curl -X POST \
http://localhost:9200/my_index/_search \
-H 'Content-Type: application/json' \
-d '{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "progress"
}
},
{
"wildcard": {
"Author": "Rohan"
}
},
{
"wildcard": {
"Title": "q*"
}
}
]
}
}
}'
now in above you can remove or add any object in must array as per your need.

How can I achieve this type of queries in ElasticSearch?

I have added a document like this to my index
POST /analyzer3/books
{
"title": "The other day I went with my mom to the pool and had a lot of fun"
}
And then I do queries like this
GET /analyzer3/_analyze
{
"analyzer": "english",
"text": "\"The * day I went with my * to the\""
}
And it successfully returns the previously added document.
My idea is to have quotes so that the query becomes exact, but also wildcards that can replace any word. Google has this exact functionality, where you can search queries like this, for instance "I'm * the university" and it will return page results that contain texts like I'm studying in the university right now, etc.
However I want to know if there's another way to do this.
My main concern is that this doesn't seem to work with other languages like Japanese and Chinese. I've tried with many analyzers and tokenizers to no avail.
Any answer is appreciated.
Exact matches on the tokenized fields are not that straightforward. Better save your field as keyword if you have such requirements.
Additionally, keyword data type support wildcard query which can help you in your wildcard searches.
So just create a keyword type subfield. Then use the wildcard query on it.
Your search query will look something like below:
GET /_search
{
"query": {
"wildcard" : {
"title.keyword" : "The * day I went with my * to the"
}
}
}
In the above query, it is assumed that title field has a sub-field named keyword of data type keyword.
More on wildcard query can be found here.
If you still want to do exact searches on text data type, then read this
Elasticsearch doesn't have Google like search out of the box, but you can build something similar.
Let's assume when someone quotes a search text what they want is a match phrase query. Basically remove the \" and search for the remaining string as a phrase.
PUT test/_doc/1
{
"title": "The other day I went with my mom to the pool and had a lot of fun"
}
GET test/_search
{
"query": {
"match_phrase": {
"title": "The other day I went with my mom to the pool and had a lot of fun"
}
}
}
For the * it's getting a little more interesting. You could just make multiple phrase searches out of this and combine them. Example:
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"title": "The"
}
},
{
"match_phrase": {
"title": "day I went with my"
}
},
{
"match_phrase": {
"title": "to the"
}
}
]
}
}
}
Or you could use slop in the phrase search. All the terms in your search query have to be there (unless they are being removed by the tokenizer or as stop words), but the matched phrase can have additional words in the phrase. Here we can replace each * with 1 other words, so a slop of 2 in total. If you would want more than 1 word in the place of each * you will need to pick a higher slop:
GET test/_search
{
"query": {
"match_phrase": {
"title": {
"query": "The * day I went with my * to the",
"slop": 2
}
}
}
}
Another alternative might be shingles, but this is a more advanced concept and I would start off with the basics for now.

Elastic search: Like query not working in string contains &(ampersand)?

I have a query which is not working. I think the searched value containing & (ampersand) symbol. I am talking about the following query.
GET "path"_search
{
"from": 0,
"size": 10000,
"query": {
"query_string": {
"query": "(Department_3037.Department_3037_analyzed:*P&C*)"
}
}
}
Why this query is not working and how to overcome this issue or i need like query of string containing & like (p&c) ,l&t etc...
Let me know how this can be fixed..
Thank you.
Since the field you are searching in is analyzed. If the field contains the text (Department_3037.Department_3037_analyzed:*P&C*) , this will be tokenized as : Department_3037 , Department_3037_analyzed , P , C.
You can get this using :
curl -XGET "http://localhost:9200/_analyze?tokenizer=standard" -d "(Department_3037.Department_3037_analyzed:*P&C*)"
You will get the tokens as follows:
{"tokens":[{"token":"Department_3037","start_offset":1,"end_offset":16,"type":"<ALPHANUM>","position":1},
{"token":"Department_3037_analyzed","start_offset":17,"end_offset":41,"type":"<ALPHANUM>","position":2},
{"token":"P","start_offset":43,"end_offset":44,"type":"<ALPHANUM>","position":3},
{"token":"C","start_offset":45,"end_offset":46,"type":"<ALPHANUM>","position":4}]}.
If you want to retrieve the documents you will have to escape the special characters :
{
"query": {
"query_string": {
"default_field": "name",
"query": "\\(Department_3037\\.Department_3037_analyzed\\:\\*P\\&C\\*\\)"
}
}
}
Hope it helps.

Productsearch with Elasticsearch

I am relatively new to elasticsearch and I want to perform a search for products with brand and type names.
I already tried a bit but I think I am missing something important to have a solid search algorithm. Here is my approach:
A product looks e.g. like this:
{
brandName: "Samsung",
typeName: "PS-50Q7HX",
...
}
I will have a single input field. The user can search for a brand/type only or for a brand in combination with a type name. E.g.
Samsung | Samsung PS-50Q7HX | PS-50Q7HX
To eliminate misstyping in the typeName field I use an ngram tokenizer which works great when I search for types only. But in combination with the brandName field I get in trouble. Using something like this does not work well (especially when I use an ngram tokenizer on the brandName field too):
{
"query" : {
"multi_match" : {
"query": "Samsung PS 50Q 7HX",
"type": "cross_fields",
"fields": ["brandName", "typeName"]
}
}
}
Of course I know why this is not working well with two ngram tokenizer and a mixed field but I am not sure how to solve this the best way.
I think the main problem is that I do not know if the user entered a brand name or not and I thought about using a second index filled with all available brands, which I use to perform a "pre-search" for an eventually given brand name in my query string. If I find a match I am able to split the search string into type and brand name and perform a more specific search. Like this one
{
"query": {
"bool": {
"must": [
{ "match": { "brandName": "Samsung" } },
{ "match": { "typeName": "PS-50Q7HX" } }
]
}
}
}
Does this sound like a good approach? Or does anyone see a better way?
Any help is appreciated!
Thank you very much and best regards,
Stefan
To eliminate the typo mistake by the user, you used ngram analyzer which is a costly one. You could use stem analyzer which provide some flexible options to eliminate the typo mistakes
As per my concern, instead of index this in 2 different fields you could index this as a single field.
ex:- "FIELD_NAME": "Samsung|PS-50Q7HX"
Brand name and Product name with some delimiter i used |. analyse this field values with delimiter. so your content data will be index as follows
Samsung
PS-50Q7HX
Then you could search by the following query
{
"query": {
"query-string": {
"query": "Samsung PS-50Q7HX",
"default_operator": "or",
"fields": [
"FIELD_NAME"
]
}
}
}
this will retrieve the document which has the brand name as samsung or product name as PS-50Q7Hx from index. you could use prefix search and if you use default_operator as and then your search will be most accuracy.

Elasticsearch bulk or search

Background
I am working on an API that allows the user to pass in a list of details about a member (name, email addresses, ...) I want to use this information to match up with account records in my Elasticsearch database and return a list of potential matches.
I thought this would be as simple as doing a bool query on the fields I want, however I seem to be getting no hits.
I'm relatively new to Elasticsearch, my current _search request looks like this.
Example Query
POST /member/account/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"should" [{
"term" : {
"email": "jon.smith#gmail.com"
}
},{
"term" : {
"email": "samy#gmail.com"
}
},{
"term" : {
"email": "bo.blog#gmail.com"
}
}]
}
}
}
}
}
Question
How should I update this query to return records that match any of the email addresses?
Am I able to prioritise records that match email and another field? Example "family_name".
Will this be a problem if I need to do this against a few hundred emails addresses?
Well , you need to make the change in the index side rather than query side.
By default your email ID is broken into
jon.smith#gmail.com => [ jon , smith , gmail , com]
While indexing.
Now when you are searching using term query , it does not apply the analyzer and it tries to get the exact match of jon.smith#gmail.com , which as you can see , wont work.
Even if you use match query , then you will end up getting all document as matches.
Hence you need to change the mapping to index email ID as a single token , rather than tokenizing it.
So using not_analyzed would be the best solution here.
When you define email field as not_analyzed , the following happens while indexing.
jon.smith#gmail.com => [ jon.smith#gmail.com]
After changing the mapping and indexing all your documents , now you can freely run the above query.
I would suggest to use terms query as following -
{
"query": {
"terms": {
"email": [
"jon.smith#gmail.com",
"samy#gmail.com",
"bo.blog#gmail.com"
]
}
}
}
To answer the second part of your question - You are looking for boosting and would recommend to go through function score query

Resources