Related
Code example:
import { API, graphqlOperation } from "aws-amplify"
const fetchInitialAds = () => {
const { filterAttributes } = useContext(FilterContext)
const {
query: { slug },
} = useRouter()
const fetchAds = async ({ pageParam = null }) => {
const {
deal_type,
priceFrom,
priceTo,
} = filterAttributes
const currTimeInSeconds = Math.ceil(Date.now() / 1000)
const options = {
subcategoryID: `SUBCATEGORY#${slug}`,
sortDirection: "DESC",
limit: 24,
nextToken: pageParam,
filter: {
and: [
{
or: [
{
list_price: {
between: [
Number(priceFrom) || 0,
Number(priceTo) || 100000000,
],
},
},
{
new_price: {
between: [
Number(priceFrom) || 0,
Number(priceTo) || 100000000,
],
},
},
],
},
],
},
}
const adsResult = await API.graphql(
graphqlOperation(adsBySubcategoryByExpdate, options)
)
const data = await adsResult.data.adsBySubcategoryByExpdate
return data
}
return useInfiniteQuery(["ads", slug], fetchAds, {
getNextPageParam: (currentPage, allPages) => currentPage.nextToken,
})
}
Problem with 'Query' with 'limit' + 'filter' on a big database for example with 5000 items in it. If you limit down to show 24 items at a time (for pagination), the 'filter' expresion will filter only the first 24 items in your table and return matched items. If you get filter-match for 5 out of 24 first items it will return only these 5 items... <-- this means your first page you want to display for client will be not full with 24 items but only 5 items... and then u need to filter again next 24 items with 'nextToken'. This time maybe you get 10 matched items. So now you are displaying only 16 items for the client (still not full page of 24 items...). And lets say the third run on next 24 items u get 0 matched items <-- you will not get any new items to display your client. And it will feel wierd that - You clicked on Button to fetch next 24 items but got 0 items and you have to continue clicking that button till it hits some new matched items and till you have went throught the whole Table....
So my question is: Is there a solution where if you put a limit of 24 items, so that dynamo db collects full list of 24 matched items and only then return a full page? I dont want to get halffull pages and spam my 'Fetch more' button and see if i havnt missed any other filter-matched items......
The fact that a Query operation first reads a page of data (1MB of data, or Limit rows if you specified that parameter), and only then does filtering on it, is deliberate and documented:
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression. ... A Query operation can return an empty result set and a LastEvaluatedKey if all the items read for the page of results are filtered out.
The reason for this design is that it limits the latency and the cost of each request: Imagine a situation that in a 1 TB database you run a query with a filter that only matches 5 items. If the query were to continue until it can return 5 items, it would potentially read 1 TB of data before returning anything - which will cause huge latency (the client would most likely assume the connection broke and disconnect it before getting a response...), as well as huge cost. By returning an empty page each time the database reads 1 MB of data, the client remains aware that the query is progressing and doesn't time out - and also remains aware of the cost of this query and given an opportunity to stop it.
One could imagine a better API, which includes both a limit on the number of items to read, and a limit on the items to return, with a page being ended as soon as either one of these limits is reached. Unfortunately, DynamoDB doesn't have such an API. As you noticed you are forced to use a Limit and retry if not enough results were returned.
If you believe that the matching results are uniformly distributed across items, you can attempt to "guess" how big Limit should be to produce roughly the right number of results - and you can improve this guess as you go along. Guessing a too-low Limit just means you'll need to issue a second query, and guessing a too-high Limit just means you will have read a bit too much (and paid a bit too much), but in either case it's (usually) not a disaster. In any case you don't need the user to click to get more results - you can do it internally in code.
I have data in the form:
data = [..., {id:X,..., turnover:[[2015,2017,2018],[2000000,3000000,2800000]]}, ...];
My goal is to plot the year in the x-axis, against the average turnover for all companies currently selected via crossfilter in the y-axis.
The years recorded per company are inconsistent, but there should always be three years.
If it would help, I can reorganise the data to be in the form:
data = [..., {id:X,..., turnover:{2015:2000000, 2017:3000000, 2018:2800000}}, ...];
Had I been able to reorganise the data further to look like:
[...{id:X, ..., year:2015, turnover:2000000},{id:X,...,year:2017,turnover:3000000},{id:X,...,year:2018,turnover:2800000}];
Then this question would provide a solution.
But splitting the companies into separate rows doesn't make sense with everything else I'm doing.
Unless I'm mistaken, you have what I call a "tag dimension", aka a dimension with array keys.
You want each row to be recorded once for each year it contains, but you only want it to affect this dimension. You don't want to observe the row multiple times in the other dimensions, which is why you don't want to flatten.
With your original data format, your dimension definition would look something like:
var yearsDimension = cf.dimension(d => d.turnover[0], true);
The key function for a tag dimension should return an array, here of years.
This feature is still fairly new, as crossfilter goes, and a couple of minor bugs were found this year. These bugs should be easy to avoid. The feature has gotten a lot of use and no major bugs have been found.
Always beware with tag dimensions, since any aggregations will add up to more than 100% - in your case 300%. But if you are doing averages across companies for a year, this should not be a problem.
pairs of tags and values
What's unique about your problem is that you not only have multiple keys per row, you also have multiple values associated with those keys.
Although the crossfilter tag dimension feature is handy, it gives you no way to know which tag you are looking at when you reduce. Further, the most powerful and general group reduction method, group.reduce(), doesn't tell you which key you are reducing..
But there is one even more powerful way to reduce across the entire crossfilter at once: dimension.groupAll()
A groupAll object behaves like a group, except that it is fed all of the rows, and it returns only one bin. If you use dimension.groupAll() you get a groupAll object that observes all filters except those on that dimension. You can also use crossfilter.groupAll if you want a groupAll that observes all filters.
Here is a solution (using ES6 syntax for brevity) of reduction functions for groupAll.reduce() that reduces all of the rows into an object of year => {count, total}.
function avg_paired_tag_reduction(idTag, valTag) {
return {
add(p, v) {
v[idTag].forEach((id, i) => {
p[id] = p[id] || {count: 0, total: 0};
++p[id].count;
p[id].total += v[valTag][i];
});
return p;
},
remove(p, v) {
v[idTag].forEach((id, i) => {
console.assert(p[id]);
--p[id].count;
p[id].total -= v[valTag][i];
})
return p;
},
init() {
return {};
}
};
}
It will be fed every row and it will loop over the keys and values in the row, producing a count and total for every key. It assumes that the length of the key array and the value array are the same.
Then we can use a "fake group" to turn the object on demand into the array of {key,value} pairs that dc.js charts expect:
function groupall_map_to_group(groupAll) {
return {
all() {
return Object.entries(groupAll.value())
.map(([key, value]) => ({key,value}));
}
};
}
Use these functions like this:
const red = avg_paired_tag_reduction('id', 'val');
const avgPairedTagGroup = turnoverYearsDim.groupAll().reduce(
red.add, red.remove, red.init
);
console.log(groupall_map_to_group(avgPairedTagGroup).all());
Although it's possible to compute a running average, it's more efficient to instead calculate a count and total, as above, and then tell the chart how to compute the average in the value accessor:
chart.dimension(turnoverYearsDim)
.group(groupall_map_to_group(avgPairedTagGroup))
.valueAccessor(kv => kv.value.total / kv.value.count)
Demo fiddle.
Summary
I want to pull out Year over Year stats in a Crossfilter-DC driven dashboard
Year over Year (YoY) Definition
2017 YoY is the total units in 2017 divided by the total units in 2016.
Details
I'm using DC.js (and therefore D3.js & Crossfilter) to create an interactive Dashboard that can also be used to change the data it's rendering.
I have data, that though wider (has ~6 other attributes in addition to date and quantity: size, color, etc...sales data), boils down to objects like:
[
{ date: 2017-12-7, quantity: 56, color: blue ...},
{ date: 2017-2-17, quantity: 104, color: red ...},
{ date: 2016-12-7, quantity: 60, color: red ...},
{ date: 2016-4-15, quantity: 6, color: blue ...},
{ date: 2017-2-17, quantity: 10, color: green ...},
{ date: 2016-12-7, quantity: 12, color: green ...}
...
]
I'm displaying one rowchart per attribuet such that you can see the totals by color, size, etc. People would use each of these charts to be able to see the totals by that attribute and drill into the data by filtering by just a color, or a color and a size, or a size, etc. This setup is all (relatively) straight forward and kind of what DC is made for.
However, now I'd like to add some YoY stats such that I can show a barchart with x-axis as the years, and the y-axis as the YoY values (ex. YoY-2019 = Units-2019 / Units-2018). I'd also like to do the same by quarter and month such that I could see YoY Mar-2019 = Units-Mar-2019 / Units-Mar-2018 (and the same for quarter).
I have a year dimension and sum quantity
var yearDim = crossfilterObject.dimension(_ => _.date.getFullYear());
var quantityGroup = yearDim.group.reduceSum(_ => _.quantity);
I can't figure out how to do the Year over Year calc though in the nice, beautiful DC.js-way.
Attempted Solutions
Year+1
Add another dimension that's year + 1. I didn't' really get any further though because all I get out of it are two dimensions whose year groups I want to divide ... but am not sure how.
var yearPlusOneDim = crossfilterObject.dimension(_ => _.date.getFullYear() + 1);
Visually I can graph the two separately and I know, conceptually, what I want to do: which is divide the 2017 number in yearDim by the 2017 number in YearPlusOneDim (which, in reality, is the 2016 number). But "as a concept is as far as I got on this one.
Abandon DC Graphing
I could always use the yearDim's quantity group to get the array of values, which I could then feed into a normal D3.js graph.
var annualValues = quantityGroup.all();
console.log(annualValues);
// output = [{key: 2016, value: 78}, {key: 2017, value: 170}]
// example data from the limited rows listed above
But this feels like a hacky solution that's bound to fail and not benefit from all the rapid and dynamic DC updating.
I'd use a fake group, in order to solve this in one pass.
As #Ethan says, you could also use a value accessor, but then you'd have to look up the previous year each time a value is accessed - so you'd probably have to keep an extra table around. With a fake group, you only need this table in the body of your .all() function.
Here's a quick sketch of what the fake group might look like:
function yoy_group(group) {
return {
all: function() {
// index all values by date
var bydate = group.all().reduce(function(p, kv) {
p[kv.key.getTime()] = kv.value;
return p;
}, {});
// for any key/value pair which had a value one year earlier,
// produce a new pair with the ratio between this year and last
return group.all().reduce(function(p, kv) {
var date = d3.timeYear.offset(kv.key, -1);
if(bydate[date.getTime()])
p.push({key: kv.key, value: kv.value / bydate[date.getTime()]});
return p;
}, []);
}
};
}
The idea is simple: first index all the values by date. Then when producing the array of key/value pairs, look each one up to see if it had a value one year earlier. If so, push a pair to the result (otherwise drop it).
This should work for any date-keyed group where the dates have been rounded.
Note the use of Array.reduce in a couple of places. This is the spiritual ancestor of crossfilter's group.reduce - it takes a function which has the same signature as the reduce-add function, and an initial value (not a function) and produces a single value. Instead of reacting to changes like the crossfilter one does, it just loops over the array once. It's useful when you want to produce an object from an array, or produce an array of different size from the original.
Also, when indexing an object by a date, I use Date.getTime() to fetch the numeric representation of the date. Otherwise the date coerces to a string representation which may not be exact. Probably for this application it would be okay to skip .getTime() but I'm in the habit of always comparing dates exactly.
Demo fiddle of YOY trade volume in the data set used by the stock example on the main dc.js page.
I've rewritten #Gordon 's code below. All the credit is his for the solution (answered above) and I've just wirtten down my own version (far longer and likely only useful for beginners like me) of the code (much more verbose!) and the explanation (also much more verbose) to replicate my thinking in bridging my near-nothing starting point up to #Gordon 's really clever answer.
yoyGroup = function(group) {
return { all: function() {
// For every key-value pair in the group, iterate across it, indexing it by it's time-value
var valuesByDate = group.all().reduce(function(outputArray, thisKeyValuePair) {
outputArray[thisKeyValuePair.key.getTime()] = thisKeyValuePair.value;
return outputArray;
}, []);
return group.all().reduce(function(newAllArray, thisKeyValuePair) {
var dateLastYear = d3.timeYear.offset(thisKeyValuePair.key, -1);
if (valuesByDate[dateLastYear.getTime()]) {
newAllArray.push({
key: thisKeyValuePair.key,
value: thisKeyValuePair.value / valuesByDate[dateLastYear.getTime()] - 1
});
}
return newAllArray;
}, []); // closing reduce() and a function(...)
}}; // closing the return object & a function
};
¿Why are we overwritting the all() function?
When DC.js goes to create a graph based on a grouping, the only function from Crossfilter it uses is the all() function. So if we want to do something custom to a grouping to affect a DC graph, we only have to overwrite that one function: all().
¿What does the all() function need to return?
A group's all function must return an array of objects and each object must have two properties: key & value.
¿So what exactly are we doing here?
We're starting with an existing group which shows some values over time (Important Assumption: keys are date objects) and then creating a wrapper around it so that we can take advantage of the work that crossfilter has already done to aggregate at a certain level (ex. year, month, etc.).
We start by using reduce to manipulate the array of objects into a more simple array where the keys and values that were in the objects are now directly in the array. We do this to make it easier to look up values by keys.
before / output structure of group.all()
[ {key: k1, value: v1},
{key: k2, value: v2},
{key: k3, value: v3}
]
after
[ k1: v1,
k2: v2,
k3: v3
]
Then we move on to creating the correct all() structure again: an array of objects each of which has a key & value property. We start with the existing group's all() array (once again), but this time we have the advantage of our valuesByDate array which will make it easy to look up other dates.
So we iterate (via reduce) over the original group.all() output and lookup in the array we generated earlier (valuesByDate), if there's an entry from one year ago (valuesByDate[dateLastYear.getTime()]). (We use getTime() so it's simple integers rather than objects we're indexing off of.) If there is an element of the array from one year ago, then we add a key-value object-pair to our soon-to-be-returned array with the current key (date) and for the value we divide the "now" value (thisKeyValuePair.value) by the value 1 year ago: valuesByDate[dateLastYear.getTime()]. Lastly we subtract 1 so that it's (the most traditional definition of) YoY. Ex. This year = 110 and last year = 100 ... YoY = +10% = 110/100 - 1.
I am new to Azure Search so I just want to run this by before I try to implement it. We have a search setup on items and we want to score/rank the results based on its initial score and how many times the item has been used/downloaded. We want the items downloaded the most to appear at the top of the result list.
We have a separate field in the search index that contains the used/download count (itemCount).
I know I have to set up a Magnitude profile but I am not sure what to use for the range as the itemCount can contain 0 - N So do I just set the range to be some large number i.e. 100,000,000 or what is the best practice?
var functionRankByDownload = new MagnitudeFunction()
{
Boost = 1000,
BoostingRangeStart = 0,
BoostingRangeEnd = 100000000,
ConstantBoostBeyondRange = true,
FieldName = "itemCount",
Interpolation = InterpolationTypes.Linear
};
scoringProfile1.Functions = new List() { functionRankByDownload };
I found the score calculation is as follows:
((initialScore * boost * itemCount) - min) / (max-min)
So it seems like it should work ok having a large value for the max but again just wanting to know the best practice.
Thanks!
That seems reasonable. The BoostingRangeEnd can be any reasonable bound to your range depending on the scenario. Since, you are using ConstantBoostBeyondRange, it would also take care of boosting values outside ranges appropriately.
You might also want to experiment with the boost value for a large range like this and see if a bigger boost value is more helpful for your scenario.
Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.