Approach to handle data that keeps changing - algorithm

I am working on a website, and need to present something like 'x% of users who viewed this page bought this product.'
Despite the discussion of the business value, I want to know what would be a acceptable approach to get the data of x%.
I currently have two approaches. Either requires saving the number of users viewed the page and number of users who bought this product.
One approach is to calculate this data on the fly. The pros of this is that it presents accurate data, while the cons is that the wait time for user increases due to the calculation.
The other approach is that for every users viewed or bought this product, calculate the x% amount and persists the data to database. The pros of this is that it allows the users to quickly get the info, while the cons will be a lot of extra calculations, and the data may not be as accurate.
Assuming we expect hundreds of page views per hour, I wonder which is better approach? Or maybe a third approach will work better?
Thanks!

I think your best bet would be to find a balance between calculating the exact value on every user visit, and having the most accurate display.
You could log every user visit and every purchase in a database, then on every 100th visit or so, perform the calculation. Also log that calculation in your database and have your site pull the information from there, rather then calculate it on every visit.
And depending on how accurate you need to be, and how performance heavy the operation is, you can adjust your interval for calculating the value.
So in all we have that each user's visit increments a value in a database. On the back-end, that value is checked to see if it has went over another interval (so if your interval is every 100, the value is checked to see if its value % 100 == 0). And then you have an operation that only takes place 1/100th of the time a user visits the site, and is still accurate to within the hour (according to your calculation of having hundreds of views per hour).
Having said this, I agree with Jim Garrison's comment about premature optimization. I don't think the operation will have a noticable impact on your site's performance, and if you wanted to be as accurate as possible, you can run the calculation every time a user visits the site or purchases an item.

Related

Implementing Trending in Elasticsearch

I'm building a project that indexes celebrity-related content across sites (tmz, people, etc) because I always thought that it would be funny to "bet" on people (and maybe shows, directors, etc) like horse racing or the stock market -- only, you know, not with real money -- where the value of the person changes day to day and hour to hour and even minute to minute if we can figure this out together, stack overflow denizens.
I assign traffic values to users based on mentions in social media. I have some scrapers (probably violating some TOSes) and access to Twitter's API to get relative counts for search results for a time, so I have known "numbers" to associate w/ users outside of elasticsearch for periods of time to build the trends. Now to be clear, I am not looking to implement trending based on the number of documents in the system, that actually stays pretty consistent, but I need to rank documents that already exist based on trends.
So that's what I've got: a few hundred thousand articles with pre-determined associations to individual celebrities. Data for on-the-minute associations of a score to those celebrities which are then merged and applied to each article so that each article has a few scores associated (there's some complexity here that does not matter, but the bottom line is that I have 10 or so values that I want to assign to content to sort it when you're on the market page and I want to sort those w/ a function or script score).
So the question: How the heck do I assign these values without making elasticsearch go crazy with re-indexing? I need to use these values to sort dozens of requests per second coming from feeds on the site, but I am running this on a raspberry pi... literally, I've maxed the poor thing out for memory.
We're real write heavy, but if for some reason celebrity stock markets takes off, we're also real read heavy at the same time. I swear I remember a plugin that had metadata associated with content, but I cannot find it.
I've tried enable=false and index=false, but they seem to still thrash the read times while writing the updates. The best I've gotten to is slowing down the refresh_interval, but that's still pretty expensive and starts to affect the "real-time" nature of the app.
I believe that this is impossible as you've laid it out. Any updates to a field will update _source and fire the full update process.
There are some alternatives that you might consider:
Replication, if another cluster is available
A separate write index on the same cluster, space allowing

Real-time trending algorithm

I'm developing a system that has to return the most trending 'articles' in real time based on the no. of hits that article has.
My first thought was storing for each article the no. of hits vs. time. Then I would normalize this function, and calculate its first derivative which will return the growth rate. Then with the second derivative I'd be able to know how much it's growing and if it reaches a certain threshold -> tag it as trending.
The problem is: I can do that "offline" for example at the end of the day, but I don't know how to do it continously...
I know that there exist such things as Storm but I'm looking for something as specific as this (a written algorithm in pseudocode or an article approaching this problem and not the generic one).
A simple way to keep track of the hottest articles would be to use an LRU-cache. This efficiently keeps track of its most recently accessed elements. Frequent accesses will keep hot articles in the LRU while infrequently accessed articles drop out. You could modify the LRU to keep track of the number of hits an article has received if you want to refine the ranking.
Sure, there are more complicated approaches, but this is easy to implement, has nice computational properties, and will likely provide a similar response.
You could also try using a circular buffer for each article. Once the buffer fills, each access to the article can be used to update a variable indicating the oldest hit on the article. Since you know the newest and oldest hit, you can estimate the hits per time unit. Downside: lots of memory usage, inexact statistics for frequently-accessed articles.
Similarly, you could associate each article with a linked list containing <Time,Count> pairs. Every time a hit happens, add to the count at the top of the linked list if less than a second (or what have you) has passed since Time. If too much time has passed, add a new pair at the front of the list. You can then treat the linked list as a data series for calculating derivatives. Curtail elements which are too old when you walk the list.

How to manage transactions, debt, interest and penalty?

I am making a BI system for a bank-like institution. This system should manage credit contracts, invoices, payments, penalties and interest.
Now, I need to make a method that builds an invoice. I have to calculate how much the customer has to pay right now. He has a debt, which he has to pay for. He also has to pay for the interest. If he was ever late with due payment, penalties are applied for each day he's late.
I thought there were 2 ways of doing this:
By having only 1 original state - the contract's original state. And each time to compute the monthly payment which the customer has to make, consider the actual, made payments.
By constantly making intermediary states, going from the last intermediary state, and considering only the events that took place between the time of these 2 intermediary states. This means having a job that performs periodically (daily, monthly), that takes the last saved state, apply the changes (due payments, actual payments, changes in global constans like the penalty rate which is controlled by the Central Bank), and save the resulting state.
The benefits of the first variant:
Always actual. If changes were made with a date from the past (a guy came with a paid invoice 5 days after he made the payment to the bank), they will be correctly reflected in the results.
The flaws of the first variant:
Takes long to compute
Documents printed with the current results may differ if the correct data changes due to operations entered with a back date.
The benefits of the second variant:
Works fast, and aggregated data is always available for search and reports.
Simpler to compute
The flaws of the second variant:
Vulnerable to failed jobs.
Errors in the past propagate until the end, to the final results.
An intermediary result cannot be changed if new data from past transactions arrives (it can, but it's hard, and with many implications, so I'd rather mark it as Tabu)
Jobs cannot be performed successfully and without problems if an unfinished transaction exists (an issued invoice that wasn't yet paid)
Is there any other way? Can I combine the benefits from these two? Which one is used in other similar systems you've encountered? Please share any experience.
Problems of this nature are always more complicated than they first appear. This
is a consequence of what I like to call the Rumsfeldian problem of the unknown unknown.
Basically, whatever you do now, be prepared to make adjustments for arbitrary future rules.
This is a tough proposition. some future possibilities that may have a significant impact on
your calculation model are back dated payments, adjustments and charges.
Forgiven interest periods may also become an issue (particularly if back dated). Requirements
to provide various point-in-time (PIT) calculations based on either what was "known" at
that PIT (past view of the past) or taking into account transactions occurring after the reference PIT that
were back dated to a PIT before the reference (current view of the past). Calculations of this nature can be
a real pain in the head.
My advice would be to calculate from "scratch" (ie. first variant). Implement optimizations (eg. second variant) only
when necessary to meet performance constraints. Doing calculations from the beginning is a compute intensive
model but is generally more flexible with respect to accommodating unexpected left turns.
If performance is a problem but the frequency of complicating factors (eg. back dated transactions)
is relatively low you could explore a hybrid model employing the best of both variants. Here you store the
current state and calculate forward
using only those transactions that posted since the last stored state to create a new current state. If you hit a
"complication" re-do the entire account from the
beginning to reestablish the current state.
Being able to accommodate the unexpected without triggering a re-write is probably more important in the long run
than shaving calculation time right now. Do not place restrictions on your computation model until you have to. Saving
current state often brings with it a number of built in assumptions and restrictions that reduce wiggle room for
accommodating future requirements.

How to continually filter interesting data to the user?

Take an example of a question/answer site with a 'browse' slideshow that will show one question/answer page at a time. The user clicks the 'next' button and a new question/answer is presented to him.
I need to decide which pages should be returned each time the user clicks 'next'. Some things I don't want and reasons why:
Showing 'newest' questions in descending order:
Say 100 questions get entered, then no user is going to click thru to the 100th item and it'll never get any responses. It also means if no new questions were asked recently, every time the user visits the site, he'll see the same repeated stale data.
Showing 'most active' questions, judged by a lot of suggested answers/comments:
This won't return those questions that have low activity, which are exactly the ones that need more visibility
Showing 'low activity' questions, judged by not a lot of answers/comments:
Once a question starts getting activity, it'll stop being shown. This will stymie the activity on a question, when I'd really like to encourage discussion.
I feel that a mix of these would work well, but I'm unsure of how to judge which pages should be returned. I'll stress that I don't want the user to have to choose which category of items to view (like how SO has the unanswered/active/newest filters).
Are there any common practices for doing this, or any ideas for how it might be done?
Thanks!
Edit:
Here's what I'm leaning towards so far, with much thanks to Tim's comment:
So far I'm thinking of ranking pages by Activity Count / View Count, where activity is incremented each time a user performs an action on a page, like a vote, comment, answer, etc. View will get incremented for each page every time a person views the page.
I'll then rank all pages by their activity/view ratio and show pages with a high ratio more often. This way pages with low activity and high views will be shown the least, while ones with high activity and low views will be shown most frequently. Low activity/low views and high activity/high views will be somewhere in the middle I imagine, but I'll have to keep a close eye on this in the beta release. I also plan on storing which pages the user has viewed in the past 24 hours so they won't see any repeats in the slideshow in a given day.
Some ideas for preventing 'stale' data (if all the above doesn't seem to prevent it): Perhaps run a cron job which will periodically check for pages that haven't been viewed recently and boost their ratio to put them at the top.
As I see it, you are touching upon two interesting questions:
How to define that a post is interesting to a user: Here you could take a weighted combination of various factors that could contribute to interestingness of a post. Amount of activity, how fresh the entry is, if you have a way of knowing that the item matches users interest etc etc. You could pick the weights based on intuition and see how well the result matches your expectation. If you have the time and inclination, you could collect data on how well your users respond to the entries and try to learn the optimum weights for each factor using machine learning techniques.
How to give new posts a chance, otherwise known as exploration-exploitation tradeoff.
BAsically, if you just keep going to known interesting entries then you will maximize instantaneous user happiness, but you will never learn about new interesting stuff hence, overall your users are unhappy.
This is a very well studies problem, and depending upon how much you want to get into it, you can read up literature on things like k-armed bandit problems.
But a simple solution would be to not pick the entry with the highest score, but pick the entry based on a probability distribution such that high score entries have higher probability of showing up. This way most of the times you show interesting stuff, but every post has a chance to show up occasionally.

How to ensure correctness of data gathered via crowdsourcing?

I have a site where users are entering data of some products they buy.
How do I ensure correctness of data entered via crowdsourcing (enabling users to vote/edit products) minimizing amount of work that needs to be done by administrator? I'm looking for some how-tos, best practices, etc.
What sort of data are you collecting ?
You're talking about crowd-sourcing, and thus (I assume) aggregating of data across this crowd. As they're talking about products they buy, I suspect you're going to be athering product attributes and prices.
Some possible approaches. If you users are entering non-numerical data (e.g. colours), just record the most common entries, or the mode (the most commonly entered).
If they're entering numeric data, discard outliers. i.e. bin the lowest and highest results, and average the rest (you could do this for prices, say. This is the approach that electronic exchanges use for resolving closing prices out of many trades).
Depending on your application, you may want to have a historical bias towards the most recent entries.
But this all depends on your application, and how much storage and crunching of data you're prepared to do.
Make sure you keep a log of IP addresses with every action made, malicious users or bots would trample on session data or cookies. Doing this ensures that a single entity cannot skew any results or do anything drastic by appearing to be multiple users.
As a high level data can be gathered from the 'crowd' with an associated correctness value. Looking at SO, an answer or response from someone with 1000+ rep, has more wieght that a casual user. Look for validations and triangulation, if it's a single voice in the crowd that you're listening too, then it's probably not worth that much. If other voices join then you know you're onto something, again in SO terms we all get a chance to upvote questions.
I've recently seen some really good iPhone apps which rely in crowd sourcing for their data, and then validate it by asking other users if it's correct.

Resources