Need Opinions / Perspectives - Pagination approach for large data set - performance

We have a requirement where we need to show a lot of data in multiple grids & also provide the option to sort at the UI side. There are 2 approaches:
Load everything in UI & have UI side pagination & sorting.
Load server side paginated data in UI & if user clicks sorting based on any other column, recall the API to re-index the data on sorted column & send back the results again in paginated fashion.
The general feeling is that with approach 1, UI would be unnecessarily loaded with extreme volumes of data (like 10k records across grids - ranging from 1-2 MB) & might cause UI performance issues - not to forget servers serving those requests for large user group close to a million. With approach 2, every time the user clicks sorting, there is an API call & the server resources are wasted for re-sorting the huge data (where the user would care only to see few 10's of records)
What is the best way to handle this kind of scenario?
Is there any industry standard practices where we can refer?
How do we quantify the UI performance?

There's a third approach:
The server has a different index for each possible sort order. When new data is added, it's inserted in the right place in each index. The UI for each user asks the server for "entries N*K to (N=1)*K" of the index that corresponds to whichever sort order the user selected. There is no sorting. There is no need to load everything into each UI.
Note 1: You can probably cheat a little - e.g. if you have an index for "sorted alphabetically in ascending order" then you can use the same index for "sorted alphabetically in descending order". In this way you might only need 4 indexes for 8 possible sort orders.
Note 2: You can probably cheat more. Rather than having one index for each sort order, you can split the data into "buckets" and have an index for each bucket for each sort order. E.g. instead of one index for "sorted alphabetically in ascending order" you could have one index for "starts with A", another index for "starts with B", ... In the same way, instead of one index for "sorted chronologically" you could have one index for this year, one index for last year, ... This helps to speed up insertion costs (when new data is added), and could allow you improve the UI (e.g. little "skip to bucket" buttons users can use).
Is there any industry standard practices where we can refer?
The industry standard practice depends on which industry. Far too many things are shifting to "web apps", where the industry standard practice is to get incompetent developers working for below minimum wage to slap together a piece of trash using extremely inefficient frameworks.
How do we quantify the UI performance?
I'd use response times (the time it takes for the app to start and show the user data, the time it takes to show data after scrolling/moving to a different page, the time between the user clicking on a different "sort order" button and the screen showing the dat ain the new sort order, etc).

Related

VW contextual bandits: historical data and online learning

I'd like to test CB for e-commerce task: personal offer recommendations (like "last chance to buy", "similar positions", "consumers recommend", "bestsellers", etc.). My task is to order them (more relevant issue is higher in the list of recommendations).
So, there are 5 possible offers.
I have some historical data collected without using any model: context (user and web-session features), action id (one of my 5 offers), reward (1 if user clicked this offer, 0 - not clicked). So I have N users and 5 offers with known reward, totally 5*N rows in my historical data.
Ex:
1:1:1 | user_id:1 f1:... f2:...
2:-1:1 | user_id:1 f1:... f2:...
3:-1:1 | user_id:1 f1:... f2:...
This means that user 1 have seen 3 offers (1,2,3), cost of the 1 offer is equal to 1 (didn't click), user ckickes on offers 2 and 3 (cost is negative -> reward is positive). Probabilities are equal to 1, since all offers were shown and we know rewards.
Global task is to increase CTR. I'd like to use this data for training CB and then improve the model by exploration/exploitation policies. I set probabilities equal to 1 in this data (Is it right?). But next I'd like to set the order of offers according to rewards.
Should I use for this warm start in VW CB? Will this work correctly with data collected without using CB? Maybe you can advise more relevant methods in CB for this data and task?
Thanks a lot.
If there are only 5 possible offers and if you (as indicated) have data of the form "I have N users and 5 offers with known reward, totally 5*N rows in my historical data." then your historical data is supervised multilabel data and the warm-start functionality would apply; make sure you use the cost-sensitive version to accommodate the multilabel aspect of your historical data (i.e., there is more than one offer that would result in a click).
Will this work correctly with data collected without using CB?
Because the every action-reward is specified for every user in the data set, you only have to ensure that the sample of users is representative of the population you care about.
Maybe you can advise more relevant methods in CB for this data and task?
The first paragraph started with "if" because the more typical case is 1) there are many possible offers and 2) users have only seen a few of them historically.
In such case what you have is a combination of a degenerate logging policy and multiple rewards being revealed. If there are k possible actions but each user has only seen n<=k historically then you could try and make n lines for each user as you did. Theoretically this does not necessarily work but in practice it might help.
Out of the box: change the data
If the data you have was collected as the result of running an existing policy, then an alternative would be to start randomizing the decisions made by that system in order to collect a dataset which conforms to CB. For example, use your current system to pick the "best" action 96% of the time, and one of the other 4 actions at random 4% of the time, and log the probability along with the reward (either 0.96 or 0.01 depending upon whether it was the considered best), and then set up a proper CB-style training set for vw. With this you can also counterfactually estimate the value of both your current policy and the policy vw generates, and only switch to vw when it is winning.
The fastest way to implement the last paragraph is to just start using APS.

Nested sorting in dimension hierarchies (Tableau)

I am working on a vizualisation in Tableau that has dimension hierarchy (Product category, product sub-category, product type etc.) sorted descending by number of orders. I want my viz to show by default only first product level (product category) sorted the same way, but give an option to drill down (using "+" on the dimension) to detailed product levels and using nested sorting (again, descending by number of orders).
superstore data sample
I tried using nested sorting option for each product level, but when I drill up and down again, the sorting is wrong again, as if it clears out. I cannot find an option to keep them fixed unless I keep all product levels visible in the viz (without drill-down option).
Does anyone know how can I do it? I tried also different ways of indexing and ranking calculations, but nothing seem to work. I know there is one option to combine hierarchy dimensions and using sorting option on them, but it keeps the viz really untidy.
Thanks in advance!
Tableau will always sort based on the left most column. With the newer nested sorting you can more easily do a secondary sorting. However, when you expand up/down hierarchies like you are noting the formatting might not be retained.
The "classic" way to do this is to create a Rank by number of orders (sounds like you were close on this one). rank(COUNT([Order ID]),'desc'). Make this a discrete measure and put it to the left of all the other dimensions.
To clean it up, you can uncheck "Show Header" on the rank pill.
And if you expand/collapse the hierarchy, it keeps the sorting... Final product:
EDIT: Here is another way to try to accomplish this. It seems to work for 3 levels but starts to break down after that. (It also didn't seem to work well on grouped dimensions.)
Expand the hierarchy to all three levels.
On each dimension, enforce a sort order of Count of Order ID Descending.

Paging elasticsearch aggregation results

Imagine i have two kind of records: a bucket and an item, where item is contained in a bucket, and bucket may have relatively small amount of items (normally not more than 4, never more than 10). Those records are squashed into one (an item with extra bucket information) and placed inside Elasticsearch.
The task i am trying to solve is to find 500 buckets (at max) with all related items at once by filtered query that relies on item's attributes, and i'm stuck on limiting / offsetting aggregations. How do i perform such kind of task? I see top_hits aggregation which allows me to control size of related items amount, but i can't find a clue how can i control size of returned buckets.
update: okay, i'm terribly stupid. The size parameter of terms aggregation provides me with limiting. Is there any way to perform offset task? I don't need 100% precision and probably won't ever page those results, but anyway i'd like to see this functionality.
I don't think we'll be seeing this feature any time soon, see relevant discussion at GitHub.
Paging is tricky to implement because document counts for terms
aggregations are not exact when shard_size is less than the field
cardinality and sorting on count desc. So weird things may happen like
the first term of the 2nd page having a higher count than the last
element of the first page, etc.
There an interesting approach is mentioned, you could request like top 20 results on 1st page, then on 2nd page you run the same aggregation but exclude those 20 terms you already saw on the previous page and so forth. But this doesn't allow you "random" access to arbitrary page, you must go through pages in-order.
...if you only have a limited number of unique values compared to the
number of matched documents, doing the paging on client-side would be
more efficient. On the other hand, on high-cardinality-fields, your
first approach based on an exclude would probably be better.

optimal algorithm for adding chosen table rows to the database

I am trying to apply a save method in a backing Java bean which will take the table rows that are selected and save them in the database. However, let's say the user changes his choices a little (changes 1 out of his 5 choices). I am wondering about the algorithm I am going to apply if it matters in efficiency in the long term or not....
here it goes :
every time the user clicks the button (save) I will delete all his previous choices and insert all the current choices to the database
once the button is clicked --- see which rows the user de-selected and delete their rows from the database and add the new ones???
is choice number 2 better or not than choice 1 .......or it doesn't really matter for number of choices that will not exceed 15 ??
Thanks
I would definitely go for option 2, try to figure out the minimum number of operations you need to perform.
It is, however, fairly normal to fall back to option 1 in times of deadlines etc. since it is a bit easier to implement.
There shouldn't, however, be that much harder to figure out what the changes are, since it doesn't seem to me that you're changing the rows themselves. Either you're deleting ones that had their checkmark cleared, or you insert ones that had their checkmark set.
Simply store a list of primary key values of whatever is in the database, then compare to that list when you iterate through the new list when the user wants to persist the changes.
A minimal work solution here would also mean you would be a bit more future-proof in terms of refactoring, changes, or additions. For instance, what if there in the future is data attached to any of those rows. You would need to keep that as well. Generally I'm a bit opposed to writing code just for the sake of "what if", but here I feel it's more like "why wouldn't you ..." than that.
So my advice is go for option 2. Not much more work.

Amortizing the calculation of distribution (and percentile), applicable on App Engine?

This is applicable to Google App Engine, but not necessarily constrained for it.
On Google App Engine, the database isn't relational, so no aggregate functions (such as sum, average etc) can be implemented. Each row is independent of each other. To calculate sum and average, the app simply has to amortize its calculation by recalculating for each individual new write to the database so that it's always up to date.
How would one go about calculating percentile and frequency distribution (i.e. density)? I'd like to make a graph of the density of a field of values, and this set of values is probably on the order of millions. It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
Is there some algorithm to calculate or approximate density/frequency/percentile distribution that can be calculated over a period of time?
By the way, the data is indeterminate in that the maximum and minimum may be all over the place. So the distribution would have to take approximately 95% of the data and only do a density based on that.
Getting the whole row (with that limit of 1000 at a time...) over and over again in order to get a single number per row is sure unappealing. So denormalize the data by recording that single number in a separate entity that holds a list of numbers (to a limit of I believe 1 MB per query, so with 4-byte numbers no more than 250,000 numbers per list).
So when adding a number also fetch the latest "added data values list" entity, if full make a new one instead, append the new number, save it. Probably no need to be transactional if a tiny error in the statistics is no killer, as you appear to imply.
If the data for an item can be changed have separate entities of the same kind recording the "deleted" data values; to change one item's value from 23 to 45, add 23 to the latest "deleted values" list, and 45 to the latest "added values" one -- this covers item deletion as well.
It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
This is the most obvious approach to me, why are you are you trying to avoid it?

Resources