Cassandra - Optimizing hardware in cluster - parallel-processing

I have been able to get Cassandra working on a macbook cluster (for fun). Now I am trying to operationalize this for research.
Currently, I have a single linux machine running intel 3770K lga 1150. I would like to create a cluster for the purpose on running cassandra. Can I use cheap machines (2-3 nodes with intel i5, 4tb hd, and 8 gb ram)? What is the best configuration to do this right the first time?
Is is possible to use the new nodes to operate cassandra and the current machine just utilize the data for analysis?

8gb ram is pretty low. Id recommend a minimum of 16gb (more the better) so you can safely allocate 8gb heap while leaving room for the offheap stuff. Especially if you want to store multiple TB of data on it you want more then 8gb. Some data models are worse then others. If using non-ssd's be sure to have a dedicated drive to the commitlog so its not competing with data. It will work with what you listed but you wont get good performance once theres a decent amount of data.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAbout_c.html
You can create multiple data centers to separate your different workloads. DSE workload snitch will do that for you if using datastax enterprise.

Related

Can we give more than 32GB Memory to dedicated Machine learning Node in ElasticSearch?

As per the documentation it is recommended by Elasticsearch Team that every Elasticsearch node should have the memory slightly less than 32GB. Now My question is that does this apply to a dedicated Machine learning Node as well. And Even if we give more than 32GB memory to the Dedicated Machine Learning Node what might be the repercussions of that.
Not more than 32GB of HEAP not RAM, and the right amount is between 26GB and 30GB (differs depending on systems, but default is 31GB).
For dedicated ML nodes, it's no different than for data nodes, the default computation is shown here, i.e. 31GB max heap

Elastic cloud vs elastic search on local server

Does the Elastic cloud and elastic search setup on local machine consumes the same speed for data gathering ?
Well it depends on (your local setup).
How many machines/nodes
How much CPU and memory per node. And one of most important is if nodes has SSD
It depends on network.
In ElasticCloud you can choose amount of memory and storage, but not amount of nodes (Nodes depends on amount of memory because its better to have one node with 32 gigs of memory than 3 nodes with 10 gb for instance.) Also important that EsCloud setup is using SSD.
So again all of that depends while local setup will give you more flexibility and control, but cloud could simplify your life.
One more option would be to go with AWS or Azure because you will be able to add remove nodes on demands so it would be a bit easier to experiment and see what setup is better for you.
To Sum up: if we are talking that you have same setup locally and same setup in cloud there will be no difference in terms of performance but, only one thing would be different its latency.

MemSQL performance issues

I have a single node MemSQL install with one master aggregator and two leaves (all on a single box). The machine has 2 cores, 16Gb RAM, and MemSQL columnstore data is ~7Gb (coming from 21Gb CSV). When running queries on the data, memory usage caps at ~2150Mb (11Gb sitting free). I've configured both leaves to have maximum_memory = 7000 in the memsql.cnf files for both nodes (memsql-optimize does similar). During query execution, the master aggregator sits at 100% CPU, with the leaves 0-8% CPU.
This does not seems like an efficient use of system resources, but I'm not sure what I can do to configure the system or MemSQL to make more efficient use of CPU or memory. Any help would be greatly appreciated!
If during query execution your machine is at 100% cpu (on all cores), it doesn't really matter which MemSQL node it is, your workload throughput is still bottlenecked on cpu. However for most queries you wouldn't expect most of the cpu use to be on the aggregator, so you may want to take a look at EXPLAIN or PROFILE of your queries.
Columnstore data is cached in memory as part of the OS file cache - it isn't counted as memory reserved by MemSQL, which is why your memory usage is less than the size of the columnstore data.
My database was coming from some other place than the current memsql install (perhaps an older cluster configuration) despite there only being a single memsql cluster on the machine. Looking at the Databases section in the Web UI was displaying no databases/tables, but my queries were succeeded with the expected answers.
drop database/reload from CSV managed to remedy the situation. All core threads are now used during query.

How to increase AeroSpark read performance?

I am using latest AeroSpark connector to work with AeroSpike and Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
Each record round contains 1k of data. The AeroSpike and Spark hosted on the same node. The data filtered by secondary index.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G
default-ttl 30d
storage-engine device {
file /vol/rmla.data
filesize 900G
}
Without knowing anything about your server, and with just a snippet of config, I'll stick to some generic recommendations that should improve your experience.
Disk IO
You are clearly bound by the read speed from your storage media, which you declared to be a file. If you're storing the data on disk, you can either use file or device in the storage-engine device config block.
There is a big difference in the read and write latency between a file on a HDD versus raw device access to an SSD. Typically Aerospike is used with data stored on enterprise-grade SSD devices. Read the section in the operations manual about initializing and setting up the drive. Declaring multiple devices for the namespace with give you a linear performance boost (two drives will have double the read and write throughput of one of the same kind).
In Amazon EC2 you could use the c3, i2, r3, or i3 instance families for this purpose. The ephemeral SSD devices of EC2 instances don't need to be over-provisioned, have their RAID turned off, etc. They only need to be initialized before they're first used. Do not use EBS drives for primary storage, as they're too slow.
Cluster Configuration
The Spark connector uses lots of scan operations. Make sure that you've configured scan-threads under your service config block to the number of cores. If you don't know how many cores you have, do cat /proc/cpuinfo. If Spark is the only client using the Aerospike cluster, you can tune the scan threads higher.
Connector Configuration
You can modify the connector config options for lower write latency. Optionally set aerospike.commitLevel to CommitLevel.COMMIT_MASTER.
Upgrade Version
As of November 28 2016 aerospike/aerospark supports Spark 2.0. Make sure you're using the latest code.
Note: See the new tutorial for Aerospark on the Aerospike website.

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

Resources