Is implementing a cache considered a difficult problem? - caching

There are many questions on here about caching that doesn't work properly, or asking how to implement caches properly, for all sorts of things from HTTP to SQL queries, L1/L2 memory caching, etc..
Is it generally held to be a difficult problem in computer science terms?

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

I don't know if implementing a cache is really considered to be a difficult thing, but I suspect that most people that do don't really test them well enough to be sure that it would work, and that the cache actually does the job it is supposed to do. The thing about implementing a cache is that if you get it wrong, unless you are monitoring cache misses and comparing performance, you won't even know, since the cache failure will just "fail dumb" down to the backing store. So what if the cache has a 100% miss rate, it only makes the default case 1% slower anyway.

I think the issue is deciding how you want your cache to operate ? e.g. what strategies do you use for eviction (least recently used etc.). Is it going to be distributed etc.? Do you want to optimise cache hits etc.? Complexity is based around these issues.
Simple in-process caches can be trivial. Generally I've not had to implement these, since you can get off-the-shelf implementations for most platforms, and I would hesitate to reinvent the wheel (I presume your question is about the difficulty of implementation, rather than asking if you should create one yourself)

I think the difficulty that many programmers have is in restraining themselves from writing caching software for their application. It's good to remember that modern operating systems alrerady do a raft load of caching for you , and that implementing it yourself may well be a pesimisation.

Not until you try it in a nontrivial case.

I think that no matter the task, if there is need for concurrent access to what is difficult is just that: control of concurrent access. Just that can cause serious problems such as dead-lock, serialization in access, limitations on concurrency, among other.

Related

It is reasonable to read data on disk parallelly?

In an application, one may need to read the data/files on disk and load them into memory. Many programming languages have support to use multiple CPUs to do the work. I am wondering whether it is a reasonable option to read the disk parallelly. The parallel/concurrent routines will harm the disk, right?
Could you please provide some advice on how to design this kind of system? Thanks in advance.
If you are after performance, then reading data in parallel is the best thing you can do. The more requests you can provide a disk the faster it can complete the aggregate set of operations.
The only problem with reading data concurrently is that you need to be able to handle it correctly in your application. Typically this means using threads, although you can find OS specific solution that may help with this, such as AIO on linux.
Lastly, the term reasonable is somewhat loaded. While it may be faster to read data concurrently, is there a good use case/does it improve the user experience/is it worth the extra code complexity? In most cases, the answer to that would be no.

How to avoid bottleneck performance?

A distributed system is described as scalable if it remains effective when there is a significant increase the number of resources and the number of users. However, these systems sometimes face performance bottlenecks. How can these be avoided?
The question is pretty broad, and depends entirely on what the system is doing.
Here are some things I've seen in systems to reduce bottlenecks.
Use caches, reducing network and disk bottlenecks. But remember that knowing when to evict from a cache eviction can a hard problem in some situations.
Use message queues to decouple components in the system. This way you can add more hardware to specific parts of the system that need it.
Delay computation when possible (often by using message queues). This takes the heat off the system during high-processing times.
Of course, design the system for parallel processing wherever possible. One host doing processing is NOT scalable. Note: most relational databases fall into the one-host bucket, this is why NoSQL has become suddenly popular; but not always appropriate (theoretically).
Use eventual consistency if possible. Strong consistency is much harder to scale.
Some are proponents for CQRS and DDD. Though I have never seen or designed a "CQRS system" nor a "DDD system," those have definitely affected the way I design systems.
There is a lot of overlap in the points above; some the techniques may use some of the others.
But, experience (your own and others) eventually teaches you about scalable systems. I keep up-to-date by reading about designs from google, amazon, twitter, facebook, and the like. Another good starting point is the high-scalability blog.
Just to build on a point discussed in the abover post, I would like to add that what you need for your distributed system is a distributed cache, so that when you intend on scaling your application the distributed cache acts like a "elastic" data-fabric meaning that you can increase storage capacity of the cache without compromising on performance and at the same time giving you a relaible platform that is accessible to multiple applications.
One such distributed caching solution is NCache. Do take a look!

LAMP stack performance under heavy traffic loads

I know the title of my question is rather vague, so I'll try to clarify as much as I can. Please feel free to moderate this question to make it more useful for the community.
Given a standard LAMP stack with more or less default settings (a bit of tuning is allowed, client-side and server-side caching turned on), running on modern hardware (16Gb RAM, 8-core CPU, unlimited disk space, etc), deploying a reasonably complicated CMS service (a Drupal or Wordpress project for arguments sake) - what amounts of traffic, SQL queries, user requests can I resonably expect to accommodate before I have to start thinking about performance?
NOTE: I know that specifics will greatly depend on the details of the project, i.e. optimizing MySQL queries, indexing stuff, minimizing filesystem hits - assuming web developers did a professional job - I'm really looking for a very rough figure in terms of visits per day, traffic during peak visiting times, how many records before (transactional) MySQL fumbles, so on.
I know the only way to really answer my question is to run load testing on a real project, and I'm concerned that my question may be treated as partly off-top.
I would like to get a set of figures from people with first-hand experience, e.g. "we ran such and such set-up and it handled at least this much load [problems started surfacing after such and such]". I'm also greatly interested in any condenced (I'm short on time atm) reading I can do to get a better understanding of the matter.
P.S. I'm meeting a client tomorrow to talk about his project, and I want to be prepared to reason about performance if his project turns out to be akin FourSquare.
Very tricky to answer without specifics as you have noted. If I was tasked with what you have to do, I would take each component in turn ( network interface, CPU/memory, physical IO load, SMP locking etc) and get the maximum capacity available, divide by rough estimate of use per request.
For example, network io. You might have 1x 1Gb card, which might achieve maybe 100Mbytes/sec. ( I tend to use 80% of theoretical max). How big will a typical 'hit' be? Perhaps 3kbytes average, for HTML, images etc. that means you can achieve 33k requests per second before you bottleneck at the physical level. These numbers are absolute maximums, depending on tools and skills you might not get anywhere near them, but nobody can exceed these maximums.
Repeat the above for every component, perhaps varying your numbers a little, and you will build a quick picture of what is likely to be a concern. Then, consider how you can quickly get more capacity in each component, can you just chuck $$ and gain more performance (eg use SSD drives instead of HD)? Or will you hit a limit that cannot be moved without rearchitecting? Also take into account what resources you have available, do you have lots of skilled programmer time, DBAs, or wads of cash? If you have lots of a resource, you can tend to reduce those constraints easier and quicker as you move along the experience curve.
Do not forget external components too, firewalls may have limits that are lower than expected for sustained traffic.
Sorry I cannot give you real numbers, our workloads are using custom servers, high memory caching and other tricks, and not using all the products you list. However, I would concentrate most on IO/SQL queries and possibly network IO, as these tend to be more hard limits, than CPU/memory, although I'm sure others will have a different opinion.
Obviously, the question is such that does not have a "proper" answer, but I'd like to close it and give some feedback. The client meeting has taken place, performance was indeed a biggie, their hosting platform turned out to be on the Amazon cloud :)
From research I've done independently:
Memcache is a must;
MySQL (or whatever persistent storage instance you're running) is usually the first to go. Solutions include running multiple virtual instances and replicate data between them, distributing the load;
http://highscalability.com/ is a good read :)

How can caches be defeated?

I have this question on my assignment this week, and I don't understand how the caches can be defeated, or how I can show it with an assembly program.. Can someone point me in the right direction?
Show, with assembly program examples, how the two different caches (Associative and Direct Mapping) can be defeated. Explain why this occurs and how it can be fixed. Are the same programs used to defeat the caches the same?
Note: This is homework. Don't just answer the question for me, it won't help me to understand the material.
A cache is there to increase performance. So defeating a cache means finding a pattern of memory accesses that decreases performance (in the presence of the cache) rather than increases it.
Bear in mind that the cache is limited in size (smaller than main memory, for instance) so typically defeating the cache involves filling it up so that it throws away the data you're just about to access, just before you access it.
If you're looking for a hint, think about splitting a data word across 2 cache lines.
(In case you're also looking for the answer, a similar problem was encountered by the x264 developers -- more information available here and here. The links are highly informative, and I really suggest you read them even after you've found your answer.)
Another thing to keep in mind is whether the caches you deal with are virtually or physically indexed / tagged. In some variants, cache aliasing forces line replacements even if the cache as such isn't completely filled. In other variants, cache/page coloring collisions might cause evictions. Finally, in multiprocessor systems under certain workloads, cacheline migrations (between the caches of different CPUs) may limit the usefulness of CPU caches.

How to detect high contention critical sections?

My application uses many critical sections, and I want to know which of them might cause high contention. I want to avoid bottlenecks, to ensure scalability, especially on multi-core, multi-processor systems.
I already found one accidentally when I noticed many threads hanging while waiting to enter critical section when application was under heavy load. That was rather easy to fix, but how to detect such high contention critical sections before they become a real problem?
I know there is a way to create a full dump and get that info from it (somehow?). But this is rather intrusive way. Are there methods application can do on the fly to diagnose itself for such issues?
I could use data from structure _RTL_CRITICAL_SECTION_DEBUG, but there are notes that this could be unsafe across different Windows versions: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/01/434648.aspx
Can someone suggest a reliable and not too complex method to get such info?
What you're talking about makes perfect sense during testing, but isn't really feasible in production code.
I mean.. you CAN do things in production code, such as determine the LockCount and RecursionCount values (this is documented), subtract RecursionCount from LockCount and presto, you have the # of threads waiting to get their hands on the CRITICAL_SECTION object.
You may even want to go deeper. The RTL_CRITICAL_SECTION_DEBUG structure IS documented in the SDK. The only thing that ever changed regarding this structure was that some reserved fields were given names and were put to use. I mean.. it's in the SDK headers (winnt.h), documented fields do NOT change. You misunderstood Raymond's story. (He's partially at fault, he likes a sensation as much as the next guy.)
My general point is, if there's heavy lock contention in your application, you should, by all means, ferret it out. But don't ever make the code inside a critical section bigger if you can avoid it. And reading the debug structure (or even lockcount/recursioncount) should only ever happen when you're holding the object. It's fine in a debug/testing version, but it should not go into production.
There are other ways to handle concurrency besides critical sections (i.e semaphores). One of the best ways is non-blocking synchronization. That means structuring your code to not require blocking even with shared resources. You shoudl read up on concurrency. Also, you can post a code snippet here and someone can give you advice on how ways to improve your concurrecy code.
Take a look at Intel Thread Profiler. It should be able to help to spot such problems.
Also you may want to instrument your code by wrapping critical sections in a proxy that dumps data on the disk for analysis. It really depends on the app itself, but it could be at least the information how long thread been waiting for the CS.

Resources