Performance gain by removing already loaded DOM-nodes? - performance

Assumed a heavy website consists of about 5000 dom nodes – but 20% of them are just used for a limited time of the page-visit. For example just the first 10s and will then be hidden automatically. How will removing those nodes improve the performance of this page in the browser?
I know there are Network efficiency and load performance implications if a page has many dom nodes. It also has implications to the runtime performance, if elements are interacted with or need to be re-computed otherwise. I'm also aware of potential memory performance implications, if there are general query selectors that might have to search / traverse longer.
The question is: Apart from this, practically or theoratically, how will a removal of these dom nodes (instead of just hiding them) improve the performance of a page?

This depends on the specific content: a thousand of hidden <br> elements will probably not have a measurable effect on anything, while a thousand of document.createTextNode("foobar".repeat(10000000)) nodes will take up 60 GB of memory, and I imagine one could come up with a creative pathological testcase that stresses other subsystems.
Realistically, there probably won't be a noticeable difference between a thousand of display: none nodes and lack of them.
But why don't you measure?

Related

Elasticsearch performance with heavy usage of search_after and PIT

We are planning to integrate search_after query with UI pagination. If we use PIT and keep the search context alive between UI page fetches, which could be minutes depending on user think time, would that scale well?. We have to support at least a few hundred concurrent users.
Also how would that affect other searches and the background ingestion process?. Documentation says lucene segment merging is impacted by open PIT contexts. We have a fairly rapidly changing large index.
Answering my own question with a response from Elastic:
For scroll requests we have a limitation for the max number of open scroll context of 500, because PIT contexts are much more lightweight, we don’t have any limit on the number of PIT contexts, so you can open as many PIT contexts as possible. We probably need to introduce some limitation though. In the worst case scenario, when you constantly open PIT contexts with a very long keep_alive parameter and constantly update your indices, you may ran out of file descriptors or heap memory, because as you rightly noticed segments used by PIT contexts are being kept and not being deleted by merge.
On other hand, if you use relatively small keep_alive, say 10-15 mins, use high enough refresh_interval not to create many segments, and regularly monitor the number of PIT contexts with GET /_nodes/stats/indices/search than probably it will work fine.

What is a viable strategy to reach a particular cache hit ratio?

Our team is working on building a cache layer for a key-val lookup service, which have general guideline to use 2 level cache: in-host and distributed layer. There is a requirement of 70% cache hit ratio, so only 30% of traffic is expected to fall into the downstream NoSQL. At the begining, we can figure out some factors that influence to the hit ratio:
TTL
Cache size
The query pattern: e.g. 15% of the keys are usually queried than other.
... other?
We also have some initial ideas on achieve it, like do some prefetching data to cache, e.g 70% data. But at the end of the day I realize that it's more complicated than we think and we need a stronger rationale.
Do we have any resource/research or paper related to the issue? Or what is the proper approach to do some test or spike it?
There are 3 main factors that influence your hit ratio:
Access pattern
Caching strategy
Working set size to cache size relation
The access pattern is generally out of your control because it depends on how users access your service. You do have control over the caching strategy but it is generally not straight forward how to change it to improve your hit ratio. The working set is generally not in your control because it depends on the access pattern but you do have control over your cache size.
I would approach your situation as follows:
Make sure the working set fits into your cache (easy to do)
Improve the cache strategy (more complex and time consuming)
To find out your working set size and make sure it fits in the cache you can start with a small cache and gradually (every couple of days for example) increase the cache size and see how much the hit ratio increases. The hit rate increase will become smaller and smaller the bigger the cache gets and once you hit the point of diminishing returns you know your working set size. The hit rate you get at this point is the maximum you will get for your caching strategy.
If your working set fits into your cache and you hit your 70% requirement, you are done. If not, you will need to tweak your caching strategy. This is basically requires clever engineering. Simulation like Ben Manes suggests is definitely a very useful tool for such clever engineering.

Elasticsearch shard allocation for small indices

I have an elasticsearch setup with 192 active indices ranging from a few hundred mb to possibly 5gb each. I read that for a logstash use case with 1gb indices you should only use 1 shard. The difference with my setup is that I will be having more users (estimate of up to 100) expecting a quick response time. I intend to have 1 replica for reliability.
Will having 1 shard per index still be appropriate for my use case?
In a word: yes.
The need to create multiple primary shards derives from the need to isolate documents, extreme counts (e.g., when you're in the billions of documents volume), or to improve write throughput (write documents across more places, thereby reducing individual burden).
In practice, you want to shard based on your use case, unless you're one of those first two scenarios (isolation or extreme counts).
Are you read heavy?
Are you write heavy? (Less common, but it does happen)
If you're read heavy, as most use cases are, then having fewer shards will help you by limiting the request size (fewer places to look). Given that your shard sizes are also relatively small (I'd consider anything under 5 GB to be relatively small), you can easily get away with having a single primary shard and it should benefit your search performance by doing so.
Indexes that share the same mappings, but are also tiny ("few hundred MBs"), should likely be combined if you search across them. If they're independent, then it really makes no difference and the isolation sounds like good practice at the expense of slightly bloating your cluster state (with each index).
Have a look at this blog: https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index. He has a lot of good pointers to sharding and shard sizing.
However, the question you really should be asking yourself is: How easy is it to change? When it comes to sizing and scalability, the answer often is "it depends" - and the real question is: How quickly can you reconfigure?
This could e.g. mean that you design you application in a way, that allows quick re-spooling of data into a new index, that you use aliases so that you can in fact change these things, where your data lies (not just in Elastic, I hope) etc.
By building a system - from the start - so that you can quickly rebuild indicies enables you to experiment with sizes - and more importantly - change them as your need changes.

Prioritizing Erlang nodes

Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.
We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...
The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).

Performance Optimization For Highly Interactive Websites

I recently completed development of a mid-traficked(?) website (peak 60k hits/hour), however, the site only needs to be updated once a minute - and achieving the required performance can be summed up by a single word: "caching".
For a site like SO where the data feeding the site changes all the time, I would imagine a different approach is required.
Page cache times presumably need to be short or non-existent, and updates need to be propogated across all the webservers very rapidly to keep all users up to date.
My guess is that you'd need a distributed cache to control the serving of data and pages that is updated on the order of a few seconds, with perhaps a distributed cache above the database to mediate writes?
Can those more experienced that I outline some of the key architectural/design principles they employ to ensure highly interactive websites like SO are performant?
The vast majority of sites have many more reads than writes. It's not uncommon to have thousands or even millions of reads to every write.
Therefore, any scaling solution depends on separating the scaling of the reads from the scaling of the writes. Typically scaling reads is really cheap and easy, scaling the writes is complicated and costly.
The most straightforward way to scale reads is to cache entire pages at a time and expire them after a certain number of seconds. If you look at the popular web-site, Slashdot. you can see that this is the way they scale their site. Unfortunately, this caching strategy can result in counter-intuitive behaviour for the end user.
I'm assuming from your question that you don't want this primitive sort of caching. Like you mention, you'll need to update the cache in place.
This is not as scary as it sounds. The key thing to realise is that from the server's point of view. Stackoverflow does not update all the time. It updates fairly rarely. Maybe once or twice per second. To a computer a second is nearly an eternity.
Moreover, updates tend to occur to items in the cache that do not depend on each other. Consider Stack Overflow as example. I imagine that each question page is cached separately. Most questions probably have an update per minute on average for the first fifteen minutes and then probably once an hour after that.
Thus, in most applications you barely need to scale your writes. They're so few and far between that you can have one server doing the writes; Updating the cache in place is actually a perfectly viable solution. Unless you have extremely high traffic, you're going to get very few concurrent updates to the same cached item at the same time.
So how do you set this up? My preferred solution is to cache each page individually to disk and then have many web-heads delivering these static pages from some mutually accessible space.
When a write needs to be done it is done from exactly one server and this updates that particular cached html page. Each server owns it's own subset of the cache so there isn't a single point of failure. The update process is carefully crafted so that a transaction ensures that no two requests are not writing to the file at exactly the same time.
I've found this design has met all the scaling requirements we have so far required. But it will depend on the nature of the site and the nature of the load as to whether this is the right thing to do for your project.
You might be interested in this article which describes how wikimedia's servers are structured. Very enlightening!
The article links to this pdf - be sure not to miss it.

Resources