ElasticSearch / Lucene query strict matching child fields - elasticsearch

Say I have an Elastic search index of songs, and the artist field can contain multiple artists.
I want to find Michael Jacson songs, so I might use a query like this:
artist.first_name: Michael AND artist.last_name: Jackson
However I recently noticed that might return me a result like this:
{
title: 'Some Janet Jackson Song feat. Michael Bublé',
artist: [
{first_name: 'Michael', last_name: 'Bublé'}
{first_name: 'Janet', last_name: 'Jackson'}
]
}
Note I have one artist with the first name "Micheal" and another with the the last name "Jackson" so technically this song matches my query.
I don't know the right words to search for this issue. Is this a problem with how my search index is structured? Can I formulate my query a way to avoid this? Ideally I don't want to have a full_name field with these values concatenated or anything like that.

Related

Elasticsearch - query primary and secondary attribute with different terms

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:
Example:
I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.
That means, I want to query for John Doe Back Street with the following sample data:
{
"fullname" : "John Doe John and Jane",
"street" : "Main Street"
}
{
"fullname" : "John Doe",
"street" : "Back Street"
}
Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.
Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:
Term frequency
Inverse document frequency
Field-length norm
Shortly:
the often the term occurs in field, the MORE relevant is
the often the term occurs in entire index, the LESS relevant is
the longer the term is, the MORE relevant is
I recommend you to read below materials:
What Is Relevance?
Theory Behind Relevance Scoring
Controlling Relevance and subpages
If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:
{
"query": {
"multi_match": {
"query": "john doe",
"fields": [
"fullname^10",
"street"
]
}
}
}
In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.
I know that I did not answer directly but I hope to helped you to understand how this works.

ES reversed filtering

Pardon the title, not sure how better to describe the problem.
Anyway, I have a table with entries(id, name, age, ..dynamic fields) and filter_groups(id, filters[])
Each filter group as a list of filters of the form {filter, field, value} that is used client-side to filter an HTML table of entries
Exmaples:
[{ filter: 'less_than', field: 'age' value: 10 },
{filter: 'is', field: 'name' value: "john doe" }]
If I wanted to fetch all the entries matching a particular filter group, it seems fairly straightforward to construct the query and send it to ES using the filters.
However, if you reverse the situation and given an entry want to fetch all the filter_groups whose filters match the entry, how would you go about doing this?
Thanks

couchdb view query based on multiple fieds

I'm new to couchdb and stuck with one scenario. I have the following data.
{
_id:"1",
firstName: "John",
lastName: "John"
}
I am writing a view to return documents where firstName="John" or lastName="John" and have the following map. So, the query will be /view/byName?key="John"
function(doc){emit(doc.firstName, doc);emit(doc.lastName, doc);}
I can filter out the duplicates in reduce, however I am searching for a way to filter the documents in map.
If by filter you mean get all unique values then reduce is the right way to do it. Couchdb the definitive guide suggests this as well. Just create a dummy reduce
function(key,values){return true;}
and call your view with ?group=true and you will have all the unique results.
If I understand you correctly, you want to have both documents in case of "John Smith" and "John Black", but "John John" should be reported once.
Couch gives you the unique set of keys with respect to keys ("John" in your example). Just emit the pair of name and document id ([doc.firstName, doc._id] and [doc.lastName, doc._id]) and reduce will do what you want.
["John", "ID_OF_SMITH"] != ["John", "ID_OF_BLACK"]
["John", "ID_OF_JOHNJOHN"] == ["John", "ID_OF_JOHNJOHN"]

elasticsearch find posts by comments

I'm trying to write a query that finds articles based on their comments.
So if a user is trying to find "chocolates"
{
type: "article",
id: "myArticle1",
title: "something about brown food"
}
{
body: "I love chocolates!",
type:"comment",
commentOf: "myArticle1"
}
In this example I have both documents in the same index and I'm trying to get the "myArticle1" document via the comment matching chocolates in body. How do I do this? Is it with the top_children query?
You can use the parent-child in ES to achieve this:
Define the parent (article) and child (comment)
Index data. You should know how to index child data as it will difference from normal (need to specify parent in the index request)
Use has_child query to query for article that matched some
fields in comment
I wrote a full working sample script for it: https://gist.github.com/dqduc/efa66047358dac66461b
You can run it to test and send me your feedback. I guess you're new to ES and parent-child relationship in ES.

How do I model my document in MongoDB to make it paginable for nested attributes?

I'm trying to cache my tweets and show that based on my keyword save. However, as tweets grow overtime I need to paginate them.
I'm using Ruby and Mongoid which this is what I have come up so far.
class SavedTweet
include Mongoid::Document
field :saved_id, :type => String
field :slug, :type => String
field :tweets, :type => Array
end
And the tweets array would be like this
{id: "id", text: "text", created_at: "created_at"}
So it's like a bucket for each keyword that you can save. My first problem would be that Mongodb cannot sort the second level of document which in this case it's tweets and that'd make pagination much harder because I cannot use skip and limit. I will have to load the whole tweets and put that in the cache and paginate from that.
The question is how should I model my problem to make it paginable out of Mongodb and not in the memory. I'm assuming that doing it in Mongodb would be faster. Right now, I'm in the early stage of my application so it's easier to change the model than later. If you guys have any suggestions or opinion I'm really appreciated.
An option could be to save tweets in a different collection and link them with your SavedTweet class. It will be easy to query and you could use skip and limit without problems.
{id: "id", text: "text", created_at: "created_at", saved_tweet:"_id"}
EDIT: a better explanation, with two aditional options
As far I see, you have three options, if I understand correctly your requirements:
Use the same schema that you are already using. You would have two problems: you cannot use skip and limit with an usual query and you have a limit of 16 MB per document. I think, the first one could be resolved with an Aggregation Framework query ($unwind, $skip and $limit could be helpful). The second one could be a problem if you have a lot of tweet documents in the array, because one document cannot have more than 16MB of size.
Use two collections to store your tweets. One collection would have the same structure that you already have. For example:
{
save_id:"23232",
slug:"adkj"
}
And the other collection would have one document per tweet.
{
id: "id",
text: "text",
created_at: "created_at",
saved_tweet:"_id"
}
With saved_tweet field you are linking saved_tweets with tweet with a 1 to N relation. So with this way, you can carry out queries over tweet collection and still be able to use limit and skip operators..
Save all info in the same document. If your saved_tweet collection only have those fields, you can save all info in a whole document (one document for each tweet). Something like this:
{
save_id:"23232",
slug:"adkj"
tweet:
{
tweet_id: "id",
text: "text",
created_at: "created_at"
}
}
Whit this solution you are duplicating fields, because *save_id* and slug would be the same in other documents of the same saved_tweet, but I could be an option if you have a little quantity of fields and that fields are not subdocuments or arrays.
I hope it is clear now.

Resources