Azure Search Scoring Profile Magnitude by Downloads - ranking

I am new to Azure Search so I just want to run this by before I try to implement it. We have a search setup on items and we want to score/rank the results based on its initial score and how many times the item has been used/downloaded. We want the items downloaded the most to appear at the top of the result list.
We have a separate field in the search index that contains the used/download count (itemCount).
I know I have to set up a Magnitude profile but I am not sure what to use for the range as the itemCount can contain 0 - N So do I just set the range to be some large number i.e. 100,000,000 or what is the best practice?
var functionRankByDownload = new MagnitudeFunction()
{
Boost = 1000,
BoostingRangeStart = 0,
BoostingRangeEnd = 100000000,
ConstantBoostBeyondRange = true,
FieldName = "itemCount",
Interpolation = InterpolationTypes.Linear
};
scoringProfile1.Functions = new List() { functionRankByDownload };
I found the score calculation is as follows:
((initialScore * boost * itemCount) - min) / (max-min)
So it seems like it should work ok having a large value for the max but again just wanting to know the best practice.
Thanks!

That seems reasonable. The BoostingRangeEnd can be any reasonable bound to your range depending on the scenario. Since, you are using ConstantBoostBeyondRange, it would also take care of boosting values outside ranges appropriately.
You might also want to experiment with the boost value for a large range like this and see if a bigger boost value is more helpful for your scenario.

Related

I need to find a faster solution to iterate rows in Google App Script

I'm trying to save some rows values for multiple columns on multiple tabs in GAS, but it's taking a lot of time and I'd like to find a faster way of doing this, if there's any.
A project e.g:'Project1' -as a key- has a value associated with it which corresponds to the column where it's stored, the tabs are 600+ iterations long.
this script opens up a tab called 'person1' at first and goes through all the rows for the column that corresponds to that project in 'projects' dictionary (it's the same format for every tab, but more projects will be added in the future)
right now i'm iterating through the 'members' dictionary (length=m), then through the projects dictionary (length=p) and finally through the length of the rows (length='r'), in the meantime it access the other spreadsheet where I want to save all those rows.
This means that the current time complexity of my algorithm is O(mpr) and it's WAY too slow.
for 15 people and 6 projects each, the amount of iterations would be 156600+ = 54,000 iterations at least (more people and more projects and more rows will be added).
is there any way to make my algorithm faster?
const members = {'Person1':'P1', 'Person2':'P2'};
const projects = {'Project1':'L','Project2':'R'}
function saveRowValue() {
let sourceSpreadsheet = SpreadsheetApp.getActiveSpreadsheet();
let targetSpreadsheet = SpreadsheetApp.openById('-SPREADSHEET-');
let targetSheet = targetSpreadsheet.getSheetByName('Tracking time');
let rowsToWrite = [];
rowsToWrite.push(['Project', 'Initials', 'Date', 'Tracking time'])
var rowsToSave = 1;
for(m in members){
Logger.log(m +' initials:'+ members[m]);
let sourceSheet = sourceSpreadsheet.getSheetByName(m);
for(p in projects){
let values = sourceSheet.getRange(projects[p]+"1:"+projects[p]).getValues();
Logger.log(values)
let list = [null, 0,''];
for(var i=0; i<values.length; i++){
try{
date = sourceSheet.getRange('B'+i).getValue();
let val = sourceSheet.getRange(projects[p]+i)
val = Utilities.formatDate(val.getValue(), "GMT", val.getNumberFormat())
Logger.log(val);
if(!(list.includes(val)) && date instanceof Date){
//rowsToWrite.push();
rowsToSave++;
targetSheet.getRange(rowsToSave,1,1,4).setValues([[p, members[m], date, val]]);
}
}catch(e){
Logger.log(e)
}
}
}
}
Logger.log(rowsToWrite);
[Here you can see how much time it takes to iterate 600 rows for a single project and a single member after changing what Yuri Khristich told me to change][1]
[1]: https://i.stack.imgur.com/CnRZY.png
First step is to try to get rid of getValue() and setValue() in loops. All data should be captured at once as 2D arrays in one step and put on the sheet in one step as well. No single cell or single row operations.
Next trick depends on your workflow. Say, it's unlikely that every time all 54000+ cells need to be checked. Probably there are ranges that have no changes. You can figure out some way to indicate the changes. And process only the changed ranges. Probably, the indication could be performed with onChange() trigger. For example you can add * to the name of the sheets and columns where changes have occurred and remove these * whenever you run your script.
Reference:
Use batch operations

Custom scoring in ElasticSearch

How do i use the following function? (For elastica in PHP with respect to Function Score query)
addScriptScoreFunction($script, $filter)
Does the filter filter out results or only score based on the script for those that pass the filter? How efficient is the scoring?
Also can i add more than one script score function to function score query?
$keyword = 'foo';
$fiels = 'name';
$inner_query = new Elastica\Query\Match();
$inner_query->setFieldQuery($field, $keyword);
// Wrap the function_score around the initial query
$scorefunction = new Elastica\Query\FunctionScore();
$scorefunction->setQuery($inner_query);
$scorefunction->setBoostMode('replace'); // Otherwise it will be multiplied with _score
// Make the custom score function: boost max 20% of initial _score, depending on popularity
$script = new Elastica\Script("_score + (doc['popularity'].value * 0.2 * _score)/100");
$scorefunction->addScriptScoreFunction($script);
// Last step: put that all in Elastica\Query and execute with Elastica\Search
There are some possible pitfalls:
without ->setBoostMode('replace'); the original _score will be multiplied with the result of the script. As in my case the addition was desired, therefore 'replace'.
It seems that divisions are rounded down. As the popularity that I used in my formula is allways between 1 and 100, thus popularity/100 alone was allways rounded down to 0 and the formula seemed to have no effect.

Size of a filtered dimension in Crossfilter?

I've read through the Crossfilter API docs several times but can't see how to do the following.
Suppose I have set up
crossfilter(event);
and a dimension foo:
var foo = event.dimension(function(d) { return d.foo; }),
foos = foo.group(function(d) { return Math.floor(d) ; });
Then, before any filters are applied, event.size() will give me the number of records in the event, and foos.size() will give me the number of distinct records in the foo dimension
Great! Now I apply some filters by sliding brushes around. event.groupAll().value() now gives me the current number of records in event that are selected. Great again.
Now how do I get the current number of distinct records in the foo dimension? I've tried many different combinations of the API primitives, but none seem to work.
Any ideas?
This should do the trick
var n = foo.top(Number.POSITIVE_INFINITY).length;
I do not have enough reputation to comment the solution proposed by Reno.
This should do the trick
var n = foo.top(Number.POSITIVE_INFINITY).length;
The problem of this solution is that is not efficient, because top function is ordering the data.
I have the same problem that you and I have a counter in the filter to know how many entries have the dimension.

mongoDB geoNear command with count

I am using the geoNear commang with mongoid in order to retrive a document collection ordered by distance. I need the distance for each document in the collection which is why I am having to resort to the geoNear command.
Given the following command:
category_ids = ["list", "of", "ids"]
cmd = Hash.new
cmd[:geoNear] = :poi
cmd[:near] = [params[:location][:x], params[:location][:y]]
cmd[:query] = {
"$or" => [
{primary_category_id: {"$in" => category_ids}},
{category_ids: {"$in" => category_ids}}
]
}
cmd[:spherical] = true
cmd[:num] = num
res = Poi.collection.database.command cmd
My problem is that I require the total number of results in the collection. Sure I could just run another query that just counts the number of items that satisfy the query part of the command, however that would be pretty inefficient and also not very extendible as every change I make in the command would have to be reflected in the count query. Just adding a maxDistance would land me in a whole heap of trouble.
Another option would be to go with find and calculate the distance manually but again I would like to avoid that.
So my question is there a clever way of getting the number of documents returned by the command (minus the num) without having to run a separate query or having to calculate the distance manually and go with find.
You can use facet for the same after geoNear use facet one will project the documents and in other you can use group by _id null and use the count in group to count the total number of documents.

Best clustering algorithm? (simply explained)

Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.

Resources