how to breakdown seearch result with elasticsearch? - elasticsearch

I have documents in my elasticsearch that represent suppliers, each document is a supplier and each supplier have branches as well, it looks like this:
{
"id": 1,
"supplierName": "John Flower Shop",
"supplierAddress": "107 main st, Los Angeles",
"branches": [
{
"branchId": 11,
"branchName": "John Flower Shop New York",
"branchAddress": "34 5th Ave, New York"
},
{
"branchId": 12,
"branchName": "John Flower Shop Miami",
"branchAddress": "56 ragnar st, Miami"
}
]
}
currently I exposed api to allow search in fields: supplierName, supplierAddress, branchName and branchAddress.
the use case is a search box in my website, that perform a call to the backend, and pur the result in a dropdown for the user to choose the supplier.
my issue is, given the example document above, if you search for "John Flower Shop Miami", the answer will be the whole document, and what will be presented is the top level supplier name.
what I want is to present "John Flower Shop Miami", and im not sure how to understand what part of the result is what hit the search....
does someone had to do something like this before?

Handling relationship in elasticsearch is a bit of work but you can do it. I recommend you to read the ES guide's chapter handling relationships to have the big picture.
Then my advice is to index your branches as nested documents. Thus they will be stored as distinct documents in your index.
It will require you to change your query syntax to use nested queries that can be a pain in the a... but in exchange, you will be granted with inner_hits functionality.
It will allow you to know which subdocument ( nested document ) matched your query.

Related

Elasticsearch - query primary and secondary attribute with different terms

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:
Example:
I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.
That means, I want to query for John Doe Back Street with the following sample data:
{
"fullname" : "John Doe John and Jane",
"street" : "Main Street"
}
{
"fullname" : "John Doe",
"street" : "Back Street"
}
Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.
Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:
Term frequency
Inverse document frequency
Field-length norm
Shortly:
the often the term occurs in field, the MORE relevant is
the often the term occurs in entire index, the LESS relevant is
the longer the term is, the MORE relevant is
I recommend you to read below materials:
What Is Relevance?
Theory Behind Relevance Scoring
Controlling Relevance and subpages
If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:
{
"query": {
"multi_match": {
"query": "john doe",
"fields": [
"fullname^10",
"street"
]
}
}
}
In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.
I know that I did not answer directly but I hope to helped you to understand how this works.

Spring Data MongoDB with text index: difference between matchingany and matchingphrase

I am using MongoDB and Spring for an application
I am using a text index on my collection.
I found two methods:
matchingany
matchingphrase
But I am unable to understand the difference.
Please help me to understand them.
If you want a match on multiple words forming a phrase then use matchingPhrase, if you want a match on at least one word in a ist of words then use matchingAny.
For example, given these documents (and assuming the title attribute is text-indexed):
{ "id": 1, "title": "The days of the week"}
{ "id": 2, "title": "Once a week"}
{ "id": 3, "title": "Once a month"}
matchingAny("Once") will match the documents with id=2 and id=3
matchingAny("month", "foo' , "bar") will match the document with id=3
matchingPhrase("The days of the week") will match the document with id=1
More details in the docs.

How to find related related songs or artists using Freebase MQL?

I have any Freebase mid such as: /m/0mgcr, which is The Offspring.
Whats the best way to use MQL to find related artists?
Or if I have a song mid such as: /m/0l_f7f, which is Original Prankster by The Offspring.
Whats the best way to use MQL to find related songs?
So, the revised question is, given a musical artist, find all other musical artists who share all of the same genres assigned to the first artist.
MQL doesn't have any operators which can work across parts of the query tree, so this can't be done in a single query, but given that you're likely doing this from a programming language, it be done pretty simply in two steps.
First, we'll get all genres for our subject artist, sorted by the number of artists that they contain using this query (although the last part isn't strictly necessary):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count"
}]
}]
Then, using the genre with the smallest number of artists for maximum selectivity, we'll add in the other genres to make it even more specific. Here's a version of the query with the artists that match on the three most specific genres (the base genre plus two more):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count",
"limit": 1,
"a:artists": [{
"name": null,
"id": null,
"a:genre": {
"id": "/en/ska_punk"
},
"b:genre": {
"id": "/en/melodic_hardcore"
}
}]
}]
}]
Which gives us: Authority Zero, Millencolin, Michael John Burkett, NOFX, Bigwig, Huelga de Hambre, Freygolo, The Vandals
The things to note about this query are that, this fragment:
"sort": "artists.count",
"limit": 1,
limits our initial genre selection to the single genre with the fewest artists (ie Skate Punk), while the prefix notation:
"a:genre": {"id": "/en/ska_punk"},
"b:genre": {"id": "/en/melodic_hardcore"}
is to get around the JSON limitation on not having more than one key with the same name. The prefixes are ignored and just need to be unique (this is the same reason for the a:artists elsewhere in the query.
So, having worked through that whole little exercise, I'll close by saying that there are probably better ways of doing this. Instead of an absolute match, you may get better results with a scoring function that looks at % overlap for the most specific genres or some other metric. Things like common band members, collaborations, contemporaneous recording history, etc, etc, could also be factored into your scoring. Of course this is all beyond the capabilities of raw MQL and you'd probably want to load the Freebase data for the music domain (or some subset) into a graph database to run these scoring algorithms.
In point of fact, both last.fm and Google think a better list would include bands like Sum 41, blink-182, Bad Religion, Green Day, etc.

How can I query/filter an elasticsearch index by an array of values?

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.
As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

How to remove luis entity marker from utterance

I am using LUIS to determine which state a customer lives in. I have set up a list entity called "state" that has the 50 states with their two-letter abbreviations as synonyms as described in the documentation. LUIS is returning certain two letter words, such as "hi" or "in" as state entities.
I have set up an intent with phrases such as "My state is Oregon", "I am from WA", etc. Inside the intent, if the word "in" is included in the utterance, for example in the utterance "I live in Kentucky", the word "in" is marked automatically by LUIS as a state entity and I am unable to remove that marker.
Below is a snip of the LUIS json response to the utterance "I live in Kentucky". As you can see, the response includes both Indiana and Kentucky as entities when there should only be Kentucky.
"query": "I live in Kentucky",
"topScoringIntent": {
"intent": "STATE_INQUIRY",
"score": 0.9338141
},
....
"entities": [
....
{
"entity": "in",
"type": "state",
"startIndex": 7,
"endIndex": 8,
"resolution": {
"values": [
"indiana"
]
}
},
{
"entity": "kentucky",
"type": "state",
"startIndex": 10,
"endIndex": 17,
"resolution": {
"values": [
"kentucky"
]
}
}
], ....
How do I train LUIS not to mark the words "in" and "hi" in this context as states if I can't remove the intent marker from the utterance?
In this particular case (populating a list entity with state abbvreviations/names), you would be better served using the geographyV2 prebuilt entity or Places.AbsoluteLocation prebuilt domain entity. (Please note that at the time of this writing, the geographyV2 prebuilt entity has a slight bug, so using the prebuilt domain entity would be the better option).
The reason for this is two-fold:
One, geographic locations are already baked into LUIS and they don't collide with regular syntactic words like "in", "hi", or "me". I tested this in reverse by creating a [Medical] list that contained "ct" as the normalized value and "ct scan" as a synonym. When I typed "get me a ct in CT" it resulted in "get me a [Medical] in [Medical]". To fix, I selected the second "CT" value and re-assigned it to the Places.AbsoluteLocation entity. After retraining, I tested "when in CT show me ct options" which correctly resulted in "when in [Places.AbsoluteLocation] show me [Medical] options". Further examples and training will refine the results.
Two, lists work well for words that have disparate words that can reference one. This tutorial shows a simple example where loosely associated words are assigned as synonyms to a canonical name (normalized value).
Hope of help!
#StevenKanberg's answer was very helpful but unfortunately not complete for my situation. I tried to implement both geographyV2 and Places.AbsoluteLocation (separately). Neither one works entirely in the way I need it to (recognizing states and their two-letter abbrevs in a way that can be queried from the entities in the response).
So my choices are:
Create my own list of states, using the state name and the two-letter abbrev as synonyms, as described in the list description itself. This works except for two letter abbrevs that are also words, such as "in", "hi" and "me".
Use geographyV2 prebuilt which does not allow synonyms and does not recognize two-letter abbrevs at all, or
Use Places.AbsoluteLocation which does recognize two-letter abbrevs for states, does not confuse them with words, but also grabs all locations including cities, countries and addresses and does not differentiate between them so I have no way of parsing which entity is the state in an utterance like "I live in Lake Stevens, Snohomish County, WA".
Solution: If I combine 1 with 3, I can query for entities that have both of those types. If LUIS marks the word "in" as a state (Indiana), I can then check to see if that word has also been flagged as an AbsoluteLocation. If it has not, then I can safely discard that entity. It's not ideal but is a workaround that solves the problem.

Resources