How can I improve Cassandra read/write performance? - performance

I am working on a single node Cassandra setup. The system which I am using has 4-Core cpu with 8GB RAM.
The properties of the column family which i am using is:
Keyspace: keyspace1:
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Durable Writes: true
Options: [datacenter1:1]
Column Families:
ColumnFamily: colfamily (Super)
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.BytesType
Row cache size / save period in seconds / keys to save : 100000.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 200000.0/14400
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 1.0
Replicate on write: true
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
I tried to insert 1million rows to a column family. The Throughput for writes is around 2500 per sec and reads is around 380 per sec.
How can I improve both the read and write throughput??.

380 per second means, that you are reading data from hard drive with low cache hit rate or OS is swapping. Check Cassandra statistics to find out cache usage:
./nodetool -host <IP> cfstats
You have enabled both row and key cache. row cache will read whole row into RAM - means all columns given by row key. In this case you can disable key cache. But make sure that you have enough free RAM to handle row caching.
If you have Cassandra with off-heap-cache (default from 1.x), it is possible that row cache is very large and OS started swapping - check swap size - this can decrease performance.

Related

How to obtain database info and statistics using mgconsole?

When I use Memgraph Lab I can see the database statistics at the top of the window.
How can I obtain info such as Memgrph version, number of nodes, relationships, etc. when I'm using mgconsole?
To get the information on Memgraph version that is being used use the SHOW VERSION; query.
To get the information about the storage of the current instance use SHOW STORAGE INFO;. This query will give you the following info:
vertex_count - Number of vertices stored
edge_count - Number of edges stored
average_degree - Average number of relationships of a single node
memory_usage - Amount of RAM used reported by the OS (in bytes)
disk_usage - Amount of disk space used by the data directory (in bytes)
memory_allocated - Amount of bytes allocated by the instance
allocation_limit - Current allocation limit in bytes set for this instance

clickhouse disk based dictionaries

Problem description:
I will have regular inserts into a table(lets call it 'daily'), I am planning to build a MV on top of this, I need data from another static table('metadata') .
Option 1: Do a join but that would load the 'metadata' table in memory
Option 2:
How do I limit dictionaries to be on disk and only 'X' bytes loaded in memory?
The dictionary has 600 million rows and takes around 100 GB of memory, I am using low end machines and do not want to use so much RAM.
I am okay with latency.
How do I solve this, is there any setting ?

Fetching data from Greenplum table in the order of 600 million in Apache NiFi is giving GC overhead limit exceeded

I am trying to fetch data from Greenplum table using Apache NiFi - QueryDatabaseTableRecord. I am seeing GC overhead limit exceeded error and the NiFi webpage becomes unresponsive.
I have set the 'Fetch Size' property to 10000 but it seems the property is not being used in this case.
Other settings:
Database Type : Generic
Max Rows Per Flow File : 1000000
Output Batch Size : 2
jvm min/max memory allocation is 4g/8g
Is there an alternative to avoid the GC errors for this task ?
this is a clear case of the "Fetch Size" parameter not being used, see processor info on this.
Try to test the jdbc setFetchsize on its own to see if it works.

HBase table size decreases after period of time

We have one problem with storing data in HBase. We've taken such steps:
Big csv file (size: 20 G) is being processed by Spark application with hfiles as result (result data size: 180 G).
Creation of table by using command: 'TABLE_NAME', {'NAME'=>'cf', 'COMPRESSION'=>'SNAPPY'}
Data from created hfiles are bulkloaded with command hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024 hdfs://ip:8020/path TABLE_NAME
Right after loading of table the size is 180 G, however after some period of time (yesterday it was at 8pm, two days ago around 8am) a process being launched which compacts data to size 14 G.
My question is what is the name of this process? Is that a major compaction? Becouse I'm trying to trigger compaction (major_compact and compact) manually, but this is an output from command started on uncompacted table:
hbase(main):001:0> major_compact 'TEST_TYMEK_CRM_ACTION_HISTORY'
0 row(s) in 1.5120 seconds
This is compactions process. I can suggest the following reason for such big difference in table size. Using Spark application, you will not use a compression codec for HFile, because it specifies it after file creation. HFiles attachment to the table doesn't change it formate (all files in HDFS are immutable). Only after compaction process, data will be compressed. You can monition compaction process via HBase UI; it usually ran on 60000 port.

How cassandra key caches works when same key is exists in more than one sstables?

1) As per datastax key cache stores the primary key index for rowkey.
2) In our case we have enough memory allocated for key cache and same key is present in multiple sstables with diffrent columns.
3) If no of calls are made to access all these same key from multiple sstables then how indexes are stored in key cache? will it store indexes for all the sstables OR just for the last sstable from which key is accessed recently?
From Doc
The key cache holds the location of keys in memory on a per-column
family basis.
Key cache serves as an index for a key in all sstables it is present.
Key cache is maintained per sstable. Hence key cache can save one disk seek per SSTable [minimum]. Every key lookup ends up hitting atleast the bloom filter of all sstable. On success key cache is verified just to skip the sstable index [pointers to key sample # interval of 127 by default] lookup.
Read Path of cassandra goes like this
Memtable -> Row Cache (Off heap) -> Bloom filter -> Key cache -> SSTable Index [if miss] -> Disk
Everything in bold means they are maintained in memory (either in heap or off heap). Hence they don't add up to disk seek
Every sstable should be maintaining its own key cache. Souce from slide no 101 and Source2 from slide no 23
Incase of key cache miss, sstable index is used - that will give the clue over which 128th range might the key lie. From then disk seek for key starts [can be 1 to many].
I'll update the answer once again if I get any clue on how does cassandra descide on key cache size of every sstable may be [key_cache_conf/no_of_sstables]?

Resources