Caching expensive SQL query in memory or in the database?

Caching expensive SQL query in memory or in the database? - asp.net-mvc-3

Let me start by describing the scenario. I have an MVC 3 application with SQL Server 2008. In one of the pages we display a list of Products that is returned from the database and is UNIQUE per logged in user.
The SQL query (actually a VIEW) used to return the list of products is VERY expensive.
It is based on very complex business requirements which cannot be changed at this stage.
The database schema cannot be changed or redesigned as it is used by other applications.
There are 50k products and 5k users (each user may have access to 1 up to 50k products).
In order to display the Products page for the logged in user we use:
SELECT TOP X * FROM [VIEW] WHERE UserID = #UserId -- where 'X' is the size of the page
The query above returns a maximum of 50 rows (maximum page size). The WHERE clause restricts the number of rows to a maximum of 50k (products that the user has access to).
The page is taking about 5 to 7 seconds to load and that is exactly the time the SQL query above takes to run in SQL.
Problem:
The user goes to the Products page and very likely uses paging, re-sorts the results, goes to the details page, etc and then goes back to the list. And every time it takes 5-7s to display the results.
That is unacceptable, but at the same time the business team has accepted that the first time the Products page is loaded it can take 5-7s. Therefore, we thought about CACHING.
We now have two options to choose from, the most "obvious" one, at least to me, is using .Net Caching (in memory / in proc). (Please note that Distributed Cache is not allowed at the moment for technical constraints with our provider / hosting partner).
But I'm not very comfortable with this. We could end up with lots of products in memory (when there are 50 or 100 users logged in simultaneously) which could cause other issues on the server, like .Net constantly removing cache items to free up space while our code inserts new items.
The SECOND option:
The main problem here is that it is very EXPENSIVE to generate the User x Product x Access view, so we thought we could create a flat table (or in other words a CACHE of all products x users in the database). This table would be exactly the result of the view.
However the results can change at any time if new products are added, user permissions are changed, etc. So we would need to constantly refresh the table (which could take a few seconds) and this started to get a little bit complex.
Similarly, we though we could implement some sort of Cache Provider and, upon request from a user, we would run the original SQL query and select the products from the view (5-7s, acceptable only once) and save that result in a flat table called ProductUserAccessCache in SQL. Next request, we would get the values from this cached-table (as we could easily identify the results were cached for that particular user) with a fast query without calculations in SQL.
Any time a product was added or a permission changed, we would truncate the cached-table and upon a new request the table would be repopulated for the requested user.
It doesn't seem too complex to me, but what we are doing here basically is creating a NEW cache "provider".
Does any one have any experience with this kind of issue?
Would it be better to use .Net Caching (in proc)?
Any suggestions?

We were facing a similar issue some time ago, and we were thinking of using EF caching in order to avoid the delay on retrieving the information. Our problem was a 1 - 2 secs. delay. Here is some info that might help on how to cache a table extending EF. One of the drawbacks of caching is how fresh you need the information to be, so you set your cache expiration accordingly. Depending on that expiration, users might need to wait to get the fresh info more than they would like to, but if your users can accept that they migth be seing outdated info in order to avoid the delay, then the tradeoff would worth it.
In our scenario, we decided to better have the fresh info than quick, but as I said before, our waiting period wasn't that long.
Hope it helps

Related

Simulating server-side group and sort in Azure table storage

I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!

First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).

Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.

What would the perfomance and cost of storing every get request made into a "views" table?

I'm thinking about tracking page views for dynamic pages on my website for pages like the url below:
example.com/things/12456
I'm currently using Ruby on Rails with postgresql, on Heroku.
If I store EVERY get request into a table, every time a user views it, the database could grow extremely large, very quickly. Ideally, I'd like to track the time stamp, user id and user role of each request as well, so each view would have to be a row in the table, as opposed to having a "count" column for each resource.
I'd also like to make aggregate queries on this large table, for things like, total count per resource over a time period.
In terms of performance and cost, would this make sense to do? Are there better alternatives out there?
EDIT: Let's say I have a 1000 views a day, with each user viewing 10 pages each. And I'm making 500 aggregate requests/day.
Would this be expensive or non-scalable?
(I'd also need to store POST, PUT and DELETE requests as well, into an actions table, which fits into this very same problem)

Caching strategy suggestions needed

We have a fantasy football application that uses memcached and the classic memcached-object-read-with-sql-server-fallback. This works fairly well, but recently I've been contemplating the overhead involved and whether or not this is the best approach.
Case in point - we need to generate a drop down list of the users teams, so we follow this pattern:
Get a list of the users teams from memcached
If not available get the list from SQL server and store in memcached.
Do a multiget to get the team objects.
Fallback to loading objects from sql store these.
This is all very well - each cached piece of data is relatively easily cached and invalidated, but there are two major downsides to this:
1) Because we are operating on objects we are incurring a rather large overhead - a single team occupies some hundred bytes in memcached and what we really just need for this case is a list of team names and ids - not all the other stuff in the team objects.
2) Due to the fallback to loading individual objects, the number of SQL queries generated on an empty cache or when the items expire can be massive:
1 x Memcached multiget (which misses, which and causes)
1 x SELECT ... FROM Team WHERE Id IN (...)
20 x Store in memcached
So that's 21 network request just for this one query, and also the IN query is slower than a specific join.
Obviously we could just do a simple
SELECT Id, Name FROM Teams WHERE UserId = XYZ
And cache that result, but this this would mean that this data would need to be specifically invalidated whenever the user creates a new team. In this case it might seem relatively simple , but we have many of these type of queries, and many of them operate on axes that are not easily invalidated (like a list of id and names of the teams that your friends have created in a specific game).
Sooo.. My question is - do any of you have ideas for resolving the mentioned drawbacks, or should I just accept that there is an overhead and that cache misses are bad, live with it?

First, cache what you need, maybe that two fields, not a complete record.
Second, cache what you need again, break the result set into records and cache them seperately

about caching:
You generally use caching to offload the slower disc-based storage, in this case mysql. The memory cache scales up rather easily, mysql scales less easy.
Given that, even if you double the cpu/netowork/memory usage of the cache and putting it all together again, it will still offload the db. Adding another nodejs instance or another memcached server is easy.
back to your question
You say its a user's team, you could go and fetch it when the user logs-in, and keep it updated in cache while the user changes it throughout his session.
I presume the team member's names do not change, if so you can load all team members by id,name and store those in cache or even local on nodejs, use the same fallback strategy as you do now. Only step 1 and 2 and 4 will be left then.
personally i usually try to split the sql results into smaller ready-made pieces and cache those, and keep the cache updated as long as possible, untimately trying to use mysql only as storage and never read from it
usually you will run some logic on the returned rows form mysql anyways, theres no need to keep repeating that.

optimizing large selects in hibernate/jpa with 2nd level cache

I have a user object represented in JPA which has specific sub-types. Eg, think of User and then a subclass Admin, and another subclass Power User.
Let's say I have 100k users. I have successfully implemented the second level cache using Ehcache in order to increase performance and have validated that it's working.
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html#performance-cache
I know it does work (ie, you load the object from the cache rather than invoke an sql query) when you call the load method. I've verified this via logging at the hibernate level and also verifying that it's quicker.
However, I actually want to select a subset of all the users...for example, let's say I want to do a count of how many Power Users there are.
Furthermore, my users have an associated ZipCode object...the ZipCode objects are also second level cached...what I'd like to do is actually be able to ask queries like...how many Power Users do i have in New York state...
However, my question is...how do i write a query to do this that will hit the second level cache and not the database. Note that my second level cache is configured to be read/write...so as new users are added to the system they should automatically be added to the cache...also...note that I have investigated the Query cache briefly but I'm not sure it's applicable as this is for queries that are run multiple times...my problem is more a case of...the data should be in the second level cache anyway so what do I have to do so that the database doesn't get hit when I write my query.
cheers,
Brian

(...) the data should be in the second level cache anyway so what do I have to do so that the database doesn't get hit when I write my query.
If the entities returned by your query are cached, have a look at Query#iterate(). This will trigger a first query to retrieve a list of IDs and then subsequent queries for each ID... that would hit the L2 cache.

(ASP.NET) How would you go about creating a real-time counter which tracks database changes?

Here is the issue.
On a site I've recently taken over it tracks "miles" you ran in a day. So a user can log into the site, add that they ran 5 miles. This is then added to the database.
At the end of the day, around 1am, a service runs which calculates all the miles, all the users ran in the day and outputs a text file to App_Data. That text file is then displayed in flash on the home page.
I think this is kind of ridiculous. I was told they had to do this due to massive performance issues. They won't tell me exactly how they were doing it before or what the major performance issue was.
So what approach would you guys take? The first thing that popped into my mind was a web service which gets the data via an AJAX call. Perhaps every time a new "mile" entry is added, a trigger is fired and updates the "GlobalMiles" table.
I'd appreciate any info or tips on this.
Thanks so much!

Answering this question is a bit difficult since there we don't know all of your requirements and something didn't work before. So here are some different ideas.
First, revisit your assumptions. Generating a static report once a day is a perfectly valid solution if all you need is daily reports. Why hit the database multiple times throghout the day if all that's needed is a snapshot (for instance, lots of blog software used to write html files when a blog was posted rather than serving up the entry from the database each time -- many still do as an optimization). Is the "real-time" feature something you are adding?
I wouldn't jump to AJAX right away. Use the same input method, just move the report from static to dynamic. Doing too much at once is a good way to get yourself buried. When changing existing code I try to find areas that I can change in isolation wih the least amount of impact to the rest of the application. Then once you have the dynamic report then you can add AJAX (and please use progressive enhancement).
As for the dynamic report itself you have a few options.
Of course you can just SELECT SUM(), but it sounds like that would cause the performance problems if each user has a large number of entries.
If your database supports it, I would look at using an indexed view (sometimes called a materialized view). It should support allows fast updates to the real-time sum data:
CREATE VIEW vw_Miles WITH SCHEMABINDING AS
SELECT SUM([Count]) AS TotalMiles,
COUNT_BIG(*) AS [EntryCount],
UserId
FROM Miles
GROUP BY UserID
GO
CREATE UNIQUE CLUSTERED INDEX ix_Miles ON vw_Miles(UserId)
If the overhead of that is too much, #jn29098's solution is a good once. Roll it up using a scheduled task. If there are a lot of entries for each user, you could only add the delta from the last time the task was run.
UPDATE GlobalMiles SET [TotalMiles] = [TotalMiles] +
(SELECT SUM([Count])
FROM Miles
WHERE UserId = #id
AND EntryDate > #lastTaskRun
GROUP BY UserId)
WHERE UserId = #id
If you don't care about storing the individual entries but only the total you can update the count on the fly:
UPDATE Miles SET [Count] = [Count] + #newCount WHERE UserId = #id
You could use this method in conjunction with the SPROC that adds the entry and have both worlds.
Finally, your trigger method would work as well. It's an alternative to the indexed view where you do the update yourself on a table instad of SQL doing it automatically. It's also similar to the previous option where you move the global update out of the sproc and into a trigger.
The last three options make it more difficult to handle the situation when an entry is removed, although if that's not a feature of your application then you may not need to worry about that.
Now that you've got materialized, real-time data in your database now you can dynamically generate your report. Then you can add fancy with AJAX.

If they are truely having performance issues due to to many hits on the database then I suggest that you take all the input and cram it into a message queue (MSMQ). Then you can have a service on the other end that picks up the messages and does a bulk insert of the data. This way you have fewer db hits. Then you can output to the text file on the update too.

I would create a summary table that's rolled up once/hour or nightly which calculates total miles run. For individual requests you could pull from the nightly summary table plus any additional logged miles for the period between the last rollup calculation and when the user views the page to get the total for that user.
How many users are you talking about and how many log records per day?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio