I have installed datastax Cassandra in 2 independent machines(one with 16gb RAM and other with 32GB RAM) and going with most of the default configuration.
I have created a table with some 700 columns, when I try to insert records using java its able to insert 1000 records per 30 seconds, which seems be very less for me as per datastax benchmark it should be around 18000+. For my surprise performance is same in both 32GB & 16GB RAM machines.
I am new to Cassandra, can any one help me in this regard. I feel I doing something wrong with Cassandra.yaml configurations.
I did a Benchmarking and tuning activity on Cassandra some time ago. Found some useful settings which are mentioned below,
In Cassandra data division is based of strategies. Default is a combination of round robin and token aware policy which works best in almost all cases. If you want to customize data distribution then it is possible to write a new data distribution strategy in Cassandra i.e. distribute the data based on a location, based on an attribute etc. which can be best for customized requirement.
Cassandra uses Bloom filters to determine whether an SSTable has data for a particular row. We used bloom filter value is 0.1 to maintain balance between efficiency and overhead
Consistency level is key parameter in NoSQL databases. Try with Quorum or one.
Other options in JVM tuning like, heap memory size, survivor ratio should be optimal to achieve maximum performance
If large memory is available then memTable size can be increased and that can fit into memory and it will improve performance. Flushing memTables to disk interval should be high enough so that it shouldn’t perform unnecessary IO operations
Concurrency settings in Cassandra are important to scale. Based on our tests and observations we found that Cassandra performs better when concurrency is set to no. of cores*5 and native_transport_max_threads set to 256
Follow additional tuning settings recommended by Cassandra like; disable swap, ulimit settings, and compaction settings
Replication factor in Cassandra should be equal to no. of nodes in cluster to achieve maximum throughput of system.
These are mostly for insertion with a little bit of impact of read.
I hope this will help you :)
Are you using async writes?
Try running cassandra-stress, that way you can isolate client issues.
Another option is Brian's cassandra-loader:
https://github.com/brianmhess/cassandra-loader
Since you are writing in Java, use Brian's code as a best practice example.
Related
I am starting a project where I want to load a lot of data into Apache Ignite cache to perform certain computations. My original data load will be about 40Gb and that may grow 4 or 5-fold at certain times. I looked through the Ignite documentation and I didn't find anything in regards of size of the cache limitations. So, would it be fair to assume that as long as I have enough resources ( CPUs and RAM ) I can add as many Nodes as necessary without compromising the performance, which is speed of computations in my case.
Yes, Ignite scales horizontally. There's no explicit limit in the capacity of a table.
We have a 5 node DSE cassandra cluster and an application whose job is to write asynchronously to keyspace A (which is based on a HDD), and read synchronously from keyspace B (which is on an SSD). Reads from table
Additional info:
The table on A is using TWCS with 48h windows, while the table on keyspace B is using LCS with default settings
Spark jobs partition reads in chunks of 20h at most
Both tables are using TDE with AES256 keys and 1KB chunks
Azul Zing is being used as the JVM with default settings apart from heap sizing and GC logging
With this scenario alone the read latencies from keyspace B are fine throughout the day, but everyday we have a spark job that will read from keyspace A and write to B. The moment the spark executors "attack" keyspace A, read latencies from keyspace B suffer a bit (99th percentil goes from 8-12ms to 130ms for a few seconds).
My question is, which cassandra.yaml properties would likely help the most on reducing the read latencies on keyspace B just for this moment the spark job starts? I've been trying different memtable/commitlog settings, but haven't been able to lower the read latency to acceptable levels
It’s hard to generalize without knowing why your latency hurts, if we could we’d bake those defaults into the database
However, I’ll try to guess
Throttle down concurrent reads so there are fewer concurrent requests - this will trade throughout for more consistent performance
if your disk is busy, consider smaller compression chunk sizes
if you’re seeing GC pauses, consider tuning your jvm - the Cassandra-8150 jira has some good suggestions
if your sstables-per-read is more than a few, reconsider your data model to keep your partitions from spanning multiple TWCS windows
make sure your key cache is enabled. If you can spare the heap, raise it, it may help.
Jeff's answer should be your starting point but if that doesn't solve it, consider changing your spark job to off-peak time. Keep in mind that LCS is optimized for read-heavy tables, but from the moment that spark starts to "migrate" the data, that table using LCS, will for some time (until the spark job finishes) become a write-heavy table. This would be an anti-pattern for LCS utilization. I can't know for sure without looking into servers details, but I would say that due to the sheer number of SSTables that are created during the spark job, LCS is not able to keep up with the compaction to maintain the standard read latency.
If you can't schedule the spark job at an off-peak time, then you should consider changing the compaction strategy in the keyspace B to STCS.
I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations
In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA
I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context.
After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M
(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:
conf.set("spark.executor.memory", "2.5g");
Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.
Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?
What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.
Any hint/suggestion would be highly appreciated.
Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.