cassandra key cache hit rate differs between nodetool and opscenter - caching

I checked my key cache hit rate via nodetool and opscenter, the first shows a hit rate of 0.907 percent.
Key Cache : entries 1152104, size 96.73 MB, capacity 100 MB, 52543777 hits, 57954469 requests, 0.907 recent hit rate, 14400 save period in seconds
but in opscenter the graph shows 100%.
any one understands why the difference?

Cassandra has a perhaps bug (or at least typo) here, it lists it as recent hit cache but its of all time:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/nodetool/Info.java#L95
Its grabbing the value of the "total" hitrate:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/metrics/CacheMetrics.java#L66
So although you may be getting 100% hit rate for the last 19 minutes according to opscenter it wasn't always 100%. The total number of hits / total number of requests of all time is ~90%.
This is shown from:
52543777 hits, 57954469 requests
52543777 / 57954469 = 0.907

Related

Azure Table Increased Latency

I'm trying to create an app which can efficiently write data into Azure Table. In order to test storage performance, I created a simple console app, which sends hardcoded entities in a loop. Each entry is 0.1 kByte. Data is sent in batches (100 items in each batch, 10 kBytes each batch). For every batch, I prepare entries with the same partition key, which is generated by incrementing a global counter - so I never send more than one request to the same partition. Also, I control a degree of parallelism by increasing/decreasing the number of threads. Each thread sends batches synchronously (no request overlapping).
If I use 1 thread, I see 5 requests per second (5 batches, 500 entities). At that time Azure portal metrics shows table latency below 100ms - which is quite good.
If I increase the number of treads up to 12 I see x12 increase in outgoing requests. This rate stays stable for a few minutes. But then, for some reason I start being throttled - I see latency increase and requests amount drop.
Below you can see account metrics - highlighted point shows 2K31 transactions (batches) per minute. It is 3850 entries per second. If threads are increased up to 50, then latency increases up to 4 seconds, and transaction rate drops to 700 requests per second.
According to documentation, I should be able to send up to 20K transaction per second within one account (my test account is used only for my performance test). 20K batches mean 200K entries. So the question is why I'm being throttled after 3K entries?
Test details:
Azure Datacenter: West US 2.
My location: Los Angeles.
App is written in C#, uses CosmosDB.Table nuget with the following configuration: ServicePointManager.DefaultConnectionLimit = 250, Nagles Algorithm is disabled.
Host machine is quite powerful with 1Gb internet link (i7, 8 cores, no high CPU, no high memory is observed during the test).
PS: I've read docs
The system's ability to handle a sudden burst of traffic to a partition is limited by the scalability of a single partition server until the load balancing operation kicks-in and rebalances the partition key range.
and waited for 30 mins, but the situation didn't change.
EDIT
I got a comment that E2E Latency doesn't reflect server problem.
So below is a new graph which shows not only E2E latency but also the server's one. As you can see they are almost identical and that makes me think that the source of the problem is not on the client side.

Observing frequent log file switches depsite increasing the redo log size

We had redo log sized 256m and then bumped it up to 512 and eventually 1024M and currently have 8 logs. Despite that we are observing log switch happenign every 1minute and it is eating into our performance,
A snapshot from AWR
Load Profile
Per Second Per Transaction Per Exec Per Call
DB Time(s): 1.0 0.1 0.00 0.01
DB CPU(s): 0.6 0.1 0.00 0.01
Redo size: 34,893.0 4,609.0
Instance Activity Stats - Thread Activity
Statistics identified by '(derived)' come from sources other than SYSSTAT
Statistic Total per Hour
log switches (derived) 82 59.88
Any suggestions on how to reduce the number of log file switches, I have read that ideally it should be about 1 switch in 15-20 minutes.
34893 bytes of redo per second = 125614800 bytes per hour, that is about 120 MB, nowhere near the size of 1 redo log group.
Based on this and the size of the redo logs, I would say something forces log switches periodically. The built-in parameter archive_lag_target forces a log switches after the specified amount of seconds elapses, that is the first thing I would check. Other than that, it could be anything else logging in to the database and forcing a log switch manually, e.g a cron job. (60 log switches per 60 minutes, thats quite suspicious)

Cassandra key cache optimization

I want to optimize the key cache in cassandra. I know about the key_cache_size_in_mb: The capacity in megabytes of all key caches on the node. Now while increasing what stats I need to look in order to determine the increase is actually benefiting the system.
Currently with the default settings I am getting
Key Cache : entries 20342, size 17.51 MB, capacity 100 MB, 4806 hits, 29618 requests, 0.162 recent hit rate, 14400 save period in seconds.
I have opsCenter up and running too.
Thanks
Look at number of hits and recent hit rate.
http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra

Elasticsearch indexing performance issues

We are facing some performance issues with elasticsearch in the last couple of days. As you can see on the screenshot, the indexing rate has some significant drops after the index reaches a certain size. At normal speed, we index arround 3000 logs per second. When the index we write to reaches a size of about ~10 GB, the rate drops.
We are using time based indices and arround 00:00, when a new Index is created by Logstash, the rates climb again to ~3000 logs per second (thats why we think its somehow related to the size of the index).
Server stats show nothing unusal at the CPU or memory stats (they are the same during drop-phases), but one of the servers has alot of I/O waits. Our Elasticsearch config is quite standard, with some adjustments to index performance (taken from the ES guide):
# If your index is on spinning platter drives, decrease this to one
# Reference / index-modules-merge
index.merge.scheduler.max_thread_count: 1
# allows larger segments to flush and decrease merge pressure
index.refresh_interval: 5s
# increase threshold_size from default when you are > ES 1.3.2
index.translog.flush_threshold_size: 1000mb
# JVM settings
bootstrap.mlockall: true (ES_HEAP SIZE is 50% of RAM)
We use two nodes. Both with 8 GB of RAM, 2 CPU cores and 300GB HDD size (dev environment).
I already saw clusters with alot bigger indices than ours. Do you guys have any idea what we could do to fix the issues?
BR
Edit:
Just ran into the performance issues again. Top sometimes shows arround 60% wa (wait), but iotop only reports about 1000 K/s read and write at max. I have no idea where these waits are coming from.

Windows Server Appfabric Caching Timeouts

We have an application that uses Windows Server AppFabric Caching. The cache is on the local machine, local cache is not enabled. Here is the configuration in code, none in .config.
DataCacheFactoryConfiguration configuration= new DataCacheFactoryConfiguration();
configuration.Servers= servers;
configuration.MaxConnectionsToServer= 100; // 100 is maximum
configuration.RequestTimeout= TimeSpan.FromMilliseconds( 1000);
Object expiration on PutAndUnLock is two minutes.
Here are some typical performance monitor values:
Total Data Size Bytes 700MB
Total GetAndLock Requests /sec Average 4
Total Eviction Runs: 0
Total Eviced Objects: 0
Total Object COunt: either 0 or 1.8447e+019 (suspicious, eh?) I think the active object count should be about 500.
This is running on a virtual machine, I don't think we are hardware constrained at all.
The problem: every few minutes, varies from 1 to 20, for a period of one second or so, all requests (Get, GetAndLock, Put, PutAndLock) timeout.
The only remedy I've seen online is to increase RequestTimeout. If we increase to 2 seconds the problem seems to happen somewhat less frequently, but still occurs. We can't increase the timeout more because we need the time to create the object from scratch after the cache times out.

Resources