Phrase matching with Sitecore ContentSearch API

Phrase matching with Sitecore ContentSearch API - linq

I am using Sitecore 7.2 with a custom Lucene index and Linq. I need to give additional (maximum) weight to exact matches.
Example:
A user searches for "somewhere over the rainbow"
Results should include items which contain the word "rainbow", but items containing the exact and entire term "somewhere over the rainbow" should be given maximum weight. They will displayed to users as the top results. i.e. An item containing the entire phrase should weigh more heavily than an item which contains the word "rainbow" 100 times.
I may need to handle ranking logic outside of the ContentSearch API by collecting "phrase matches" separately from "wildcard matches", and that's fine.
Here's my existing code, truncated for brevity. The code works, but exact phrase matches are not treated as I described.
using (var context = ContentSearchManager.GetIndex("sitesearch-index").CreateSearchContext())
{
var pred = PredicateBuilder.False<SearchResultItem>();
pred = pred
.Or(i => i.Name.Contains(term)).Boost(1)
.Or(i => i["Field 1"].Contains(term)).Boost(3)
.Or(i => i["Field 2"].Contains(term)).Boost(1);
IQueryable<SearchResultItem> query = context.GetQueryable<SearchResultItem>().Where(pred);
var hits = query.GetResults().Hits;
...
}
How can I perform exact phrase matching and is it possible with the Sitecore.ContentSearch.Linq API?

Answering my own question. The problem was with the parenthesis syntax. It should be
.Or(i => i.Name.Contains(term).Boost(1))
rather than
.Or(i => i.Name.Contains(term)).Boost(1)
The boosts were not being observed.

I think if you do the following it will solve this:
Split your search string on space
Create a predicate for each split with an equal boost value,
Create an additional predicate with the complete search string and
with higher boost value
combine all these predicates in one "OR" predicate.
Also I recommend you to check the following:
Sitecore Solr Search Score Value
http://sitecoreinfo.blogspot.com/2015/10/sitecore-solr-search-result-items.html

Related

How to get documents that contain sub-string in FaunaDB

I'm trying to retrieve all the tasks documents that have the string first in their name.
I currently have the following code, but it only works if I pass the exact name:
res, err := db.client.Query(
f.Map(
f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
f.Lambda("ref", f.Get(f.Var("ref"))),
),
)
I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.
Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages

FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).
How to read the code samples
To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:
const faunadb = require('faunadb')
const q = faunadb.query
const {
Not,
Abort,
...
} = q
You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.
Option 1: using ContainsStr() & Filter
Basic usage according to the docs
ContainsStr('Fauna', 'a')
Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:
q.Map(
Paginate(Documents(Collection('tasks'))),
Lambda(['ref'], Get(Var('ref')))
)
But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:
{
name: "tasks_name_and_ref",
unique: false,
serialized: true,
source: "tasks",
terms: [],
values: [
{
field: ["data", "name"]
},
{
field: ["ref"]
}
]
}
And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.
Map(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)
Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:
Filter(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)
In my case, the result is:
{
"data": [
[
"Firstly, we'll have to go and refactor this!",
Ref(Collection("tasks"), "267120709035098631")
],
[
"go to a big rock-concert abroad, but let's not dive in headfirst",
Ref(Collection("tasks"), "267120846106001926")
],
[
"The first thing to do is dance!",
Ref(Collection("tasks"), "267120677201379847")
]
]
}
Filter and reduced page sizes
As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).
Option 2: indexes!
Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).
Bindings
In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)
We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib
function StringSplit(string: ExprArg, delimiter = " "){
return If(
Not(IsString(string)),
Abort("SplitString only accept strings"),
q.Map(
FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
Lambda("res", LowerCase(Select(["data"], Var("res"))))
)
)
)
And use it in our binding.
CreateIndex({
name: 'tasks_by_words',
source: [
{
collection: Collection('tasks'),
fields: {
words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
}
}
],
terms: [
{
binding: 'words'
}
]
})
Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:
What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.
We can now find tasks that contain the string 'first' as one of their words by using the following query:
q.Map(
Paginate(Match(Index('tasks_by_words'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Which will give me the document with name:
"The first thing to do is dance!"
The other two documents didn't contain the exact words, so how do we do that?
Option 3: indexes and Ngram (exact contains matching)
To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.
We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).
What NGram essentially does is get substrings of a string of a certain length.
For example:
NGram('lalala', 3, 3)
Will return:
If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.
function GenerateNgrams(Phrase) {
return Distinct(
Union(
Let(
{
// Reduce this array if you want less ngrams per word.
indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
indexesFiltered: Filter(
Var('indexes'),
// filter out the ones below 0
Lambda('l', GT(Var('l'), 0))
),
ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
},
Var('ngramsArray')
)
)
)
}
You can then write your index as followed:
CreateIndex({
name: 'tasks_by_ngrams_exact',
// we actually want to sort to get the shortest word that matches first
source: [
{
// If your collections have the same property tht you want to access you can pass a list to the collection
collection: [Collection('tasks')],
fields: {
wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
}
}
],
terms: [
{
binding: 'wordparts'
}
]
})
And you have an index backed search where your pages are the size you requested.
q.Map(
Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)
If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.
CreateIndex({
name: 'tasks_by_ngrams',
source: {
collection: Collection('tasks'),
fields: {
ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
}
},
terms: [
{
binding: 'ngrams'
}
]
})
If we would place the binding in values again to see what comes out we'll see something like this:
In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:
For example, we can now do a fuzzy search as follows:
q.Map(
Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
Lambda('ref', Get(Var('ref')))
)
In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.
But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

Generating Boolean Searches Against an Array of Sentences to Group Results into n Results or Fewer

I feel this is a strange one. It comes from nowhere specific but it's a problem I've started trying to solve and now just want to know the answer or at least a starting place.
I have an array of x number of sentences,
I have a count of how many sentences each word appears in,
I have a count of how many sentences each word appears in with every other word,
I can search for a sentence using typical case insensitive boolean search clauses (AND +/- Word)
My data structure looks like this:
{ words: [{ word: '', count: x, concurrentWords: [{ word: '', count: x }] }] }
I need to generate an array of searches which will group the sentences into arrays of n size or less.
I don't know if it's even possible to do this in a predictable way so approximations are cool. The solution doesn't have to use the fact that I have my array of words and their counts. I'm doing this in JavaScript, not that that should matter.
Thanks in advance

Explain the below Linq Query?

results.Where(x=>x.Members.Any(y=>members.Contains(y.Name.ToLower())
I happened to see this query in internet. Can anyone explain this query please.
suggest me a good LINQ tutorial for this newbie.
thank you all.
Edited:
what is this x and y stands for?

x is a single result, of the type of the elements in the results sequence.
y is a single member, of the type of the elements in the x.Members sequence.
These are lambda expressions (x => x.whatever) that were introduced into the language with C# 3, where x is the input, and the right side (x.whatever) is the output (in this particular usage scenario).
An easier example
var list = new List<int> { 1, 2, 3 };
var oddNumbers = list.Where(i => i % 2 != 0);
Here, i is a single int item that is an input into the expression. i % 2 != 0 is a boolean expression evaluating whether the input is even or odd. The entire expression (i => i % 2 != 0) is a predicate, a Func<int, bool>, where the input is an integer and the output is a boolean. Follow? As you iterate over the query oddNumbers, each element in the list sequence is evaluated against the predicate. Those that pass then become part of your output.
foreach (var item in oddNumbers)
Console.WriteLine(item);
// writes 1, 3

Its a lambda expression. Here is a great LINQ tutorial

Interesting query, but I don't like it.
I'll answer your second question first. x and y are parameters to the lambda methods that are defined in the calls to Where() and Any(). You could easy change the names to be more meaningful:
results.Where(result =>
result.Members.Any(member => members.Contains(member.Name.ToLower());
And to answer your first question, this query will return each item in results where the Members collection has at least one item that is also contained in the Members collection as a lower case string.
The logic there doesn't make a whole lot of sense to me with knowing what the Members collection is or what it holds.

x will be every instance of the results collection. The query uses lambda syntax, so x=>x.somemember means "invoke somemember on each x passed in. Where is an extension method for IEnumerables that expects a function that will take an argument and return a boolean. Lambda syntax creates delegates under the covers, but is far more expressive for carrying out certain types of operation (and saves a lot of typing).
Without knowing the type of objects held in the results collection (results will be something that implements IEnumerable), it is hard to know exactly what the code above will do. But an educated guess is that it will check all the members of all the x's in the above collection, and return you an IEnumerable of only those that have members with all lower-case names.

Querying with multiple predicates on sub-objects in a collection with MongoDB (official c# driver)

I've the following data structure:
A
_id
B[]
_id
C[]
_id
UserId
I'm trying to run the following query:
where a.B._id == 'some-id' and a.B.C.UserId=='some-user-id'.
That means I need to find a B document that has a C document within with the relevant UserId, something like:
Query.And(Query.EQ("B._id", id), Query.EQ("B.C.UserId", userId));
This is not good, of course, as it may find B with that id and another different B that has C with that UserId. Not good.
How can I write it with the official driver?

If the problem is only that your two predicates on B are evaluated on different B instances ("it may find B with that ID and another different B that has C with that UserId"), then the solution is to use an operator that says "find me an item in the collection that satisfies both these predicates together".
Seems like the $elemMatch operator does exactly that.
From the docs:
Use $elemMatch to check if an element in an array matches the specified match expression. [...]
Note that a single array element must
match all the criteria specified; [...]
You only need to use this when more than 1 field must be matched in the array element.
Try this:
Query.ElemMatch("B", Query.And(
Query.EQ("_id", id),
Query.EQ("C.UserId", userId)
));
Here's a good explanation of $elemMatch and dot notation, which matches this scenario exactly.

Recursively (?) compose LINQ predicates into a single predicate

(EDIT: I have asked the wrong question. The real problem I'm having is over at Compose LINQ-to-SQL predicates into a single predicate - but this one got some good answers so I've left it up!)
Given the following search text:
"keyword1 keyword2 keyword3 ... keywordN"
I want to end up with the following SQL:
SELECT [columns] FROM Customer
WHERE
(Customer.Forenames LIKE '%keyword1%' OR Customer.Surname LIKE '%keyword1%')
AND
(Customer.Forenames LIKE '%keyword2%' OR Customer.Surname LIKE '%keyword2%')
AND
(Customer.Forenames LIKE '%keyword3%' OR Customer.Surname LIKE '%keyword3%')
AND
...
AND
(Customer.Forenames LIKE '%keywordN%' OR Customer.Surname LIKE '%keywordN%')
Effectively, we're splitting the search text on spaces, trimming each token, constructing a multi-part OR clause based on each token, and then AND'ing the clauses together.
I'm doing this in Linq-to-SQL, and I have no idea how to dynamically compose a predicate based on an arbitrarily-long list of subpredicates. For a known number of clauses, it's easy to compose the predicates manually:
dataContext.Customers.Where(
(Customer.Forenames.Contains("keyword1") || Customer.Surname.Contains("keyword1")
&&
(Customer.Forenames.Contains("keyword2") || Customer.Surname.Contains("keyword2")
&&
(Customer.Forenames.Contains("keyword3") || Customer.Surname.Contains("keyword3")
);
but I want to handle an arbitrary list of search terms. I got as far as
Func<Customer, bool> predicate = /* predicate */;
foreach(var token in tokens) {
predicate = (customer
=> predicate(customer)
&&
(customer.Forenames.Contains(token) || customer.Surname.Contains(token));
}
That produces a StackOverflowException - presumably because the predicate() on the RHS of the assignment isn't actually evaluated until runtime, at which point it ends up calling itself... or something.
In short, I need a technique that, given two predicates, will return a single predicate composing the two source predicates with a supplied operator, but restricted to the operators explicitly supported by Linq-to-SQL. Any ideas?

I would suggest another technique
you can do:
var query = dataContext.Customers;
and then, inside a cycle do
foreach(string keyword in keywordlist)
{
query = query.Where(Customer.Forenames.Contains(keyword) || Customer.Surname.Contains(keyword));
}

If you want a more succinct and declarative way of writing this, you could also use Aggregate extension method instead of foreach loop and mutable variable:
var query = keywordlist.Aggregate(dataContext.Customers, (q, keyword) =>
q.Where(Customer.Forenames.Contains(keyword) ||
Customer.Surname.Contains(keyword));
This takes dataContext.Customers as the initial state and then updates this state (query) for every keyword in the list using the given aggregation function (which just calls Where as Gnomo suggests.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Phrase matching with Sitecore ContentSearch API - linq

Answering my own question. The problem was with the parenthesis syntax. It should be .Or(i => i.Name.Contains(term).Boost(1)) rather than .Or(i => i.Name.Contains(term)).Boost(1) The boosts were not being observed.

Related

How to get documents that contain sub-string in FaunaDB

Generating Boolean Searches Against an Array of Sentences to Group Results into n Results or Fewer

Explain the below Linq Query?

Querying with multiple predicates on sub-objects in a collection with MongoDB (official c# driver)

Recursively (?) compose LINQ predicates into a single predicate

Categories

Resources