Could anyone point me in the right direction when it comes to load terms from a vocabular in an ascending order based on the terms weight?
This is what i have now:
$taxonomy = taxonomy_vocabulary_machine_name_load('tool_type');
$terms = entity_load('taxonomy_term', FALSE, array('vid' => $taxonomy->vid));
From what i can see my terms are now sorted in an alphabetical order, this is not what i need.
I also find it weird that alphabetical order seems to be the standard when you can drag and drop the terms in the CMS to change the weight of the term.
It would make much more sense if the terms were sorted by weight as standard.
Can anyone explain to me why it is implemented this way?
You could sort them using the weight property of taxonomy items :
$taxonomy = taxonomy_vocabulary_machine_name_load('tool_type');
$terms = entity_load('taxonomy_term', FALSE, array('vid' => $taxonomy->vid));
uasort($terms, function($a, $b) {
if ($a->weight == $b->weight) return strcmp($a->name, $b->name) ;
return $a->weight > $b->weight;
});
This will sort by weight. If weight are the same, sorted by name.
Related
I am new to Azure Search so I just want to run this by before I try to implement it. We have a search setup on items and we want to score/rank the results based on its initial score and how many times the item has been used/downloaded. We want the items downloaded the most to appear at the top of the result list.
We have a separate field in the search index that contains the used/download count (itemCount).
I know I have to set up a Magnitude profile but I am not sure what to use for the range as the itemCount can contain 0 - N So do I just set the range to be some large number i.e. 100,000,000 or what is the best practice?
var functionRankByDownload = new MagnitudeFunction()
{
Boost = 1000,
BoostingRangeStart = 0,
BoostingRangeEnd = 100000000,
ConstantBoostBeyondRange = true,
FieldName = "itemCount",
Interpolation = InterpolationTypes.Linear
};
scoringProfile1.Functions = new List() { functionRankByDownload };
I found the score calculation is as follows:
((initialScore * boost * itemCount) - min) / (max-min)
So it seems like it should work ok having a large value for the max but again just wanting to know the best practice.
Thanks!
That seems reasonable. The BoostingRangeEnd can be any reasonable bound to your range depending on the scenario. Since, you are using ConstantBoostBeyondRange, it would also take care of boosting values outside ranges appropriately.
You might also want to experiment with the boost value for a large range like this and see if a bigger boost value is more helpful for your scenario.
Importing this dataset as a table:
https://data.cityofnewyork.us/Housing-Development/Registration-Contacts/feu5-w2e2#revert
I use the following query to perform an aggregation and then attempt to sort in descending order based on the reduction field. My intention is sort based on the count of that field or to have the aggregation create a second field called count and sort the grouping results in descending order of the reduction array count or length. How can this be done in rethinkdb?
query:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"}).hasFields("CorporationName").group("CorporationName").ungroup().orderBy(r.desc('reduction'))
I don't quite understand what you're going for, but does this do what you want? If not, what do you want to be different in the output?
r.table("contacts")
.filter({"Type": "Agent","ContactDescription" : "CONDO"})
.hasFields("CorporationName")
.group("CorporationName")
.ungroup()
.merge(function(row){ return {count: row('reduction').count()}; })
.orderBy(r.desc('count'))
You are almost there:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"}).hasFields("CorporationName").group("CorporationName").count().ungroup().orderBy(r.desc('reduction'))
See that .count()? That is a map-reduce operation to get the count of each group.
I haven't tested the query on your dataset. Please comment in case you had problems with it.
EDIT:
If you want to add a count field and preserve the original document, you need to use map and reduce. In your case, it should be something like:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"})
.hasFields("CorporationName")
.group("CorporationName")
.map(r.row.merge({count:1}))
.reduce(function(left, right){
return {
count: left('count').add(right('count')),
<YOUR_OTHER_FIELDS>: left('<YOUR_OTHER_FIELDS>'),
...
};
})
.ungroup().orderBy(r.desc(r.row('reduction')('count')))
EDIT:
I am not sure if this can do the trick, but it is worth a try:
.reduce(function(left, right){
return left.merge({count: left('count').add(right('count'))})
})
I want to order documents randomly in RethinkDB. The reason for this is that I return n groups of documents and each group must appear in order in the results (so all documents belonging to a group should be placed together); and I need to randomly pick a document, belonging to the first group in the results (you don't know which is the first group in the results - the first ones could be empty, so no documents are retrieved for them).
The solution I found to this is to randomly order each of the groups before concat-ing to the result, then always pick the first document from the results (as it will be random). But I'm having a hard time ordering these groups randomly. Would appreciate any hint or even a better solution if there is one.
If you want to order a selection of documents randomly you can just use .orderBy and return a random number using r.random.
r.db('test').table('table')
.orderBy(function (row) { return r.random(); })
If these document are in a group and you want to randomize them inside the group, you can just call orderBy after the group statement.
r.db('test').table('table')
.groupBy('property')
.orderBy(function (row) { return r.random(); })
If you want to randomize the order of the groups, you can just call orderBy after calling .ungroup
r.db('test').table('table')
.groupBy('property')
.ungroup()
.orderBy(function (row) { return r.random(); })
The accepted answer here should not be possible, as John mentioned the sorting function must be deterministic, which r.random() is not.
The r.sample() function could be used to return a random order of the elements:
If the sequence has less than the requested number of elements (i.e., calling sample(10) on a sequence with only five elements), sample will return the entire sequence in a random order.
So, count the number of elements you have, and set that number as the sample number, and you'll get a randomized response.
Example:
var res = r.db("population").table("europeans")
.filter(function(row) {
return row('age').gt(18)
});
var num = res.count();
res.sample(num)
I'm not getting this to work. I tried to sort an table randomly and I'm getting the following error:
e: Sorting by a non-deterministic function is not supported in:
r.db("db").table("table").orderBy(function(var_33) { return r.random(); })
Also I have read in the rethink documentation that this is not supported. This is from the rethinkdb orderBy documentation:
Sorting functions passed to orderBy must be deterministic. You cannot, for instance, order rows using the random command. Using a non-deterministic function with orderBy will raise a ReqlQueryLogicError.
Any suggestions on how to get this to work?
One simple solution would be to give each document a random number:
r.db('db').table('table')
.merge(doc => ({
random: r.random(1, 10)
})
.orderBy('random')
Say there's a list. Each item in the list has a unique id.
List [5, 2, 4, 3, 1]
When I remove an item from this list, the unique id from the item goes with it.
List [5, 2, 3, 1]
Now say I want to add another item to the list, and give it the least lowest unique id.
What's the easiest way to get the lowest unique id when adding a new item to the list?
Here's the restriction though: I'd prefer it if I didn't reassign the unique id of another item when deleting an item.
I realise it would be easy to find the unique id if I reassigned unique id 5 to unique id 4 when I deleted 4. Then I could get the length of the list (5) and creating the new item with the unique id with that number.
So is there another way, that doesn't involve iterating through the entire list?
EDIT:
Language is java, but I suppose I'm looking for a generic algorithm.
An easy fast way is to just put your deleted ids in a priority queue, and just pick the next id from there when you insert new ones (or use size() + 1 of the first list as id when the queue is empty). This would however require another list.
You could maintain a list of available ID's.
Declare a boolean array (pseudo code):
boolean register[3];
register[0] = false;
register[1] = false;
register[2] = false;
When you add an element, loop from the bottom of the register until a false value is found. Set the false value to true, assign that index as the unique identifier.
removeObject(index)
{
register[index] = false;
}
getsetLowestIndex()
{
for(i=0; i<register.size;i++)
{
if(register[i]==false)
{
register[i] = true;
return i;
}
}
// Array is full, increment register size
register.size = register.size + 1;
register[register.size] = true;
return register.size;
}
When you remove an element, simply set the index to false.
You can optimise this for larger lists by having continuality markers so you don't need to loop the entire thing.
This would work best for your example where the indexes are in no particular order, so you skip the need to sort them first.
Its equivalent to a search, just this time you search for a missing number. If your ID's are sorted integers, you can start going from bottom to top checking if the space between two ID's is 1.
If you know how many items in the list and its sorted you can implement a binary search.
I don't think you can do this without iterating through the list.
When you say
'Now say I want to add another item to
the list, and give it the least
highest unique id. '
I assume you mean you want to assign the lowest available ID that has not been used elsewhere.
You can do this:
private int GetLowestFreeID(List list){
for (int idx = 0; idx < list.Length; ++i){
if ( list[idx] == idx ) continue;
else return idx;
}
}
this returns the lowest free index.
This assumes your list is sorted, and is in C# but you get the idea.
The data structure that would be used to do this is a Priority Binary Heap that only allow unique values.
How about keeping the list sorted. and than you can remove it from one end easily.
Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.