How to get documents that contain sub-string in FaunaDB - go

I'm trying to retrieve all the tasks documents that have the string first in their name.
I currently have the following code, but it only works if I pass the exact name:
res, err := db.client.Query(
f.Map(
f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
f.Lambda("ref", f.Get(f.Var("ref"))),
),
)
I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.
Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages

FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).
How to read the code samples
To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:
const faunadb = require('faunadb')
const q = faunadb.query
const {
Not,
Abort,
...
} = q
You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.
Option 1: using ContainsStr() & Filter
Basic usage according to the docs
ContainsStr('Fauna', 'a')
Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:
q.Map(
Paginate(Documents(Collection('tasks'))),
Lambda(['ref'], Get(Var('ref')))
)
But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:
{
name: "tasks_name_and_ref",
unique: false,
serialized: true,
source: "tasks",
terms: [],
values: [
{
field: ["data", "name"]
},
{
field: ["ref"]
}
]
}
And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.
Map(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)
Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:
Filter(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)
In my case, the result is:
{
"data": [
[
"Firstly, we'll have to go and refactor this!",
Ref(Collection("tasks"), "267120709035098631")
],
[
"go to a big rock-concert abroad, but let's not dive in headfirst",
Ref(Collection("tasks"), "267120846106001926")
],
[
"The first thing to do is dance!",
Ref(Collection("tasks"), "267120677201379847")
]
]
}
Filter and reduced page sizes
As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).
Option 2: indexes!
Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).
Bindings
In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)
We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib
function StringSplit(string: ExprArg, delimiter = " "){
return If(
Not(IsString(string)),
Abort("SplitString only accept strings"),
q.Map(
FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
Lambda("res", LowerCase(Select(["data"], Var("res"))))
)
)
)
And use it in our binding.
CreateIndex({
name: 'tasks_by_words',
source: [
{
collection: Collection('tasks'),
fields: {
words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
}
}
],
terms: [
{
binding: 'words'
}
]
})
Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:
What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.
We can now find tasks that contain the string 'first' as one of their words by using the following query:
q.Map(
Paginate(Match(Index('tasks_by_words'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Which will give me the document with name:
"The first thing to do is dance!"
The other two documents didn't contain the exact words, so how do we do that?
Option 3: indexes and Ngram (exact contains matching)
To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.
We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).
What NGram essentially does is get substrings of a string of a certain length.
For example:
NGram('lalala', 3, 3)
Will return:
If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.
function GenerateNgrams(Phrase) {
return Distinct(
Union(
Let(
{
// Reduce this array if you want less ngrams per word.
indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
indexesFiltered: Filter(
Var('indexes'),
// filter out the ones below 0
Lambda('l', GT(Var('l'), 0))
),
ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
},
Var('ngramsArray')
)
)
)
}
You can then write your index as followed:
CreateIndex({
name: 'tasks_by_ngrams_exact',
// we actually want to sort to get the shortest word that matches first
source: [
{
// If your collections have the same property tht you want to access you can pass a list to the collection
collection: [Collection('tasks')],
fields: {
wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
}
}
],
terms: [
{
binding: 'wordparts'
}
]
})
And you have an index backed search where your pages are the size you requested.
q.Map(
Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)
If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.
CreateIndex({
name: 'tasks_by_ngrams',
source: {
collection: Collection('tasks'),
fields: {
ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
}
},
terms: [
{
binding: 'ngrams'
}
]
})
If we would place the binding in values again to see what comes out we'll see something like this:
In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:
For example, we can now do a fuzzy search as follows:
q.Map(
Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
Lambda('ref', Get(Var('ref')))
)
In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.
But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

Related

JSONata prevent array flattening

Q: How do I prevent JSONata from "auto-flattening" arrays in an array constructor?
Given JSON data:
{
"w" : true,
"x":["a", "b"],
"y":[1, 2, 3],
"z": 9
}
the JSONata query seems to select 4 values:
[$.w, $.x, $.y, $.z]
The nested arrays at $.x and $.y are getting flattened/inlined into my outer wrapper, resulting in more than 4 values:
[ true, "a", "b", 1, 2, 3, 9 ]
The results I would like to achieve are
[ true, ["a", "b"], [1, 2, 3], 9 ]
I can achieve this by using
[$.w, [$.x], [$.y], $.z]
But this requires me to know a priori that $.x and $.y are arrays.
I would like to select 4 values and have the resulting array contain exactly 4 values, independent of the types of values that are selected.
There are clearly some things about the interactions between JSONata sequences and arrays that I can't get my head around.
In common with XPath/XQuery sequences, it will flatten the results of a path expression into the output array. It is possible to avoid this in your example by using the $each higher-order function to iterate over an object's key/value pairs. The following expression will get what you want without any flattening of results:
$each($, function($v) {
$v
})
This just returns the value for each property in the object.
UPDATE: Extending this answer for your updated question:
I think this is related to a previous github question on how to combine several independent queries into the same question. This uses an object to hold all the queries in a similar manner to the one you arrived at. Perhaps a slightly clearer expression would be this:
{
"1": t,
"2": u.i,
"3": u.j,
"4": u.k,
"5": u.l,
"6": v
} ~> $each(λ($v){$v})
The λ is just a shorthand for function, if you can find it on your keyboard (F12 in the JSONata Exerciser).
I am struggling to rephrase my question in such as way as to describe the difficulties I am having with JSONata's sequence-like treatment of arrays.
I need to run several queries to extract several values from the same JSON tree. I would like to construct one JSONata query expression which extracts n data items (or runs n subqueries) and returns exactly n values in an ordered array.
This example seems to query request 6 values, but because of array flattening the result array does not have 6 values.
This example explicitly wraps each query in an array constructor so that the result has 6 values. However, the values which are not arrays are wrapped in an extraneous & undesirable array. In addition one cannot determine what the original type was ...
This example shows the result that I am trying to accomplish ... I asked for 6 things and I got 6 values back. However, I must know the datatypes of the values I am fetching and explicitly wrap the arrays in an array constructor to work-around the sequence flattening.
This example shows what I want. I queried 6 things and got back 6 answers without knowing the datatypes. But I have to introduce an object as a temporary container in order to work around the array flattening behavior.
I have not found any predicates that allow me to test the type of a value in a query ... which might have let me use the ?: operator to dynamically decide whether or not to wrap arrays in an array constructor. e.g. $isArray($.foo) ? [$.foo] : $.foo
Q: Is there an easier way for me to (effectively) submit 6 "path" queries and get back 6 values in an ordered array without knowing the data types of the values I am querying?
Building on the example from Acoleman, here is a way to pass in n "query" strings (that represent paths):
(['t', 'u.i', 'u.j', 'u.k', 'u.l', 'v'] {
$: $eval('$$.' & $)
}).$each(function($o) {$o})
and get back an array ofn results with their original data format:
[
12345,
[
"i",
"ii",
"iii"
],
[],
"K",
{
"L": "LL"
},
null
]
It seems that using $each is the only way to avoid any flattening...
Granted, probably not the most efficient of expressions, since each has to be evaluated from a path string starting at the root of the data structure -- but there ya go.

Ruby storing data for queries

I have a string
"4813243948,1234433948,1.3,Type2
1234433948,4813243948,1.3,Type1
1234433948,6345635414,1.3,Type1
4813243948,2435677524,1.3,Type2
4813243948,5245654367,1.3,Type2
2345243524,6754846756,1.3,Type1
1234512345,2345124354,1.3,Type1
1342534332,4565346546,1.3,Type1"
This is telephone outbound call data where each new line represents a new phone call.
(Call From, Call To, Duration, Line Type)
I want to save this data in a way that allows me to query a specific number and get a string output of the number, its type, its total minutes used, and all the calls that it made (outbound calls). I just want to do this in a single ruby file.
Thus typing in this
4813243948
Returns
4813243948, Type 2, 3.9 Minutes total
1234433948, 1.3
2435677524, 1.3
5245654367, 1.3
I am wondering if I should try to store values in arrays, or create a custom class and make each number an object of a class then append the calls to each number.. not sure how to do the class method. Having a different array for each number seems like it would get cluttered as there are thousands of numbers and millions of calls. Of course, the provided input string is a very small portion of the real source.
I have a string
"4813243948,1234433948,1.3,Type2
1234433948,4813243948,1.3,Type1
This looks like a CSV. If you slap some headers on top, you can parse it into an array of hashes.
str = "4813243948,1234433948,1.3,Type2
1234433948,4813243948,1.3,Type1"
require 'csv'
calls = CSV.parse(str, headers: %w[from to length type], header_converters: :symbol).map(&:to_h)
# => [{:from=>"4813243948", :to=>"1234433948", :length=>"1.3", :type=>"Type2"},
# {:from=>"1234433948", :to=>"4813243948", :length=>"1.3", :type=>"Type1"}]
This is essentially the same as your original string, only it trades some memory for ease of access. You can now "query" this dataset like this:
calls.select{ |c| c[:from] == '4813243948' }
And then aggregate for presentation however you wish.
Naturally, searching through this array takes linear time, so if you have millions of calls you might want to organize them in a more efficient search structure (like a B-Tree) or move the whole dataset to a real database.
If you only want to make queries for the number the call originated from, you could store the data in a hash where the keys are the "call from" numbers and the value is an array, or another hash, containing the rest of the data. For example:
{ '4813243948': { call_to: 1234433948, duration: 1.3, line_type: 'Type2' }, ... }
If the dataset is very large, or you need more complex queries, it might be better to store it in a database and just query it directly.

Phrase matching with Sitecore ContentSearch API

I am using Sitecore 7.2 with a custom Lucene index and Linq. I need to give additional (maximum) weight to exact matches.
Example:
A user searches for "somewhere over the rainbow"
Results should include items which contain the word "rainbow", but items containing the exact and entire term "somewhere over the rainbow" should be given maximum weight. They will displayed to users as the top results. i.e. An item containing the entire phrase should weigh more heavily than an item which contains the word "rainbow" 100 times.
I may need to handle ranking logic outside of the ContentSearch API by collecting "phrase matches" separately from "wildcard matches", and that's fine.
Here's my existing code, truncated for brevity. The code works, but exact phrase matches are not treated as I described.
using (var context = ContentSearchManager.GetIndex("sitesearch-index").CreateSearchContext())
{
var pred = PredicateBuilder.False<SearchResultItem>();
pred = pred
.Or(i => i.Name.Contains(term)).Boost(1)
.Or(i => i["Field 1"].Contains(term)).Boost(3)
.Or(i => i["Field 2"].Contains(term)).Boost(1);
IQueryable<SearchResultItem> query = context.GetQueryable<SearchResultItem>().Where(pred);
var hits = query.GetResults().Hits;
...
}
How can I perform exact phrase matching and is it possible with the Sitecore.ContentSearch.Linq API?
Answering my own question. The problem was with the parenthesis syntax. It should be
.Or(i => i.Name.Contains(term).Boost(1))
rather than
.Or(i => i.Name.Contains(term)).Boost(1)
The boosts were not being observed.
I think if you do the following it will solve this:
Split your search string on space
Create a predicate for each split with an equal boost value,
Create an additional predicate with the complete search string and
with higher boost value
combine all these predicates in one "OR" predicate.
Also I recommend you to check the following:
Sitecore Solr Search Score Value
http://sitecoreinfo.blogspot.com/2015/10/sitecore-solr-search-result-items.html

Generating Boolean Searches Against an Array of Sentences to Group Results into n Results or Fewer

I feel this is a strange one. It comes from nowhere specific but it's a problem I've started trying to solve and now just want to know the answer or at least a starting place.
I have an array of x number of sentences,
I have a count of how many sentences each word appears in,
I have a count of how many sentences each word appears in with every other word,
I can search for a sentence using typical case insensitive boolean search clauses (AND +/- Word)
My data structure looks like this:
{ words: [{ word: '', count: x, concurrentWords: [{ word: '', count: x }] }] }
I need to generate an array of searches which will group the sentences into arrays of n size or less.
I don't know if it's even possible to do this in a predictable way so approximations are cool. The solution doesn't have to use the fact that I have my array of words and their counts. I'm doing this in JavaScript, not that that should matter.
Thanks in advance

Ruby - Check if intersect exists

I'm trying to speed up a search function in a RoR app w/ Postgres DB. I won't explain how it works currently...just go with an /achieve approach!
I have x number of records (potentially a substantial number) which each have an associated array of Facebook ID numbers...potentially up to 5k. I need to search against this with an individual's list of friend IDs to ascertain if an intersect between the search array and any (and which) of the records' arrays exists.
I don't need to know the result of the intersection, just whether it's true or false.
Any bright ideas?!
Thanks!
Just using pure ruby since you don't mention your datastore:
friend_ids = user.friend_ids
results = records.select { |record| !(record.friend_ids & friend_ids).empty? }
results will contain all records that have at least 1 friend_id in common. This will not be very fast if you have to check a very large number of records.
& is the array intersection operator, which is implemented in C, you can see it here: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-26
A probably faster version of #ctcherry's answer, especially when user.friend_ids has high cardinality:
require 'set'
user_friend_ids = Set[ user.friend_ids ]
results = records.select { |record|
record.friend_ids.any? { |friend_id| user_friend_ids.include? friend_id }
}
Since this constructs the test set(hash) for user.freind_ids only once, it's probably also faster than the Array#memory_efficient_intersect linked by #Tass.
This may also be faster performed in the db, but without more info on the models, it's hard to compose an approach.

Resources