Cassandra Amazon EC2, Read Performance experiments - amazon-ec2

I need some help improving Cassandra read performance. I am concerned about degradation of read performance as the size of the column family increases. We have the following stats on single-node Cassandra.
Operating System: Linux - CentOS release 5.4 (Final)
Cassandra version: apache-cassandra-1.1.0
Java version: "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)
Cassandra Configuration: (cassandra.yaml)
rpc_server_type: hsha
disk_access_mode: mmap
concurrent_reads: 64
concurrent_writes: 32
Platform: Amazon-ec2/Rightscale m1.Xlarge instance with 4 ephemeral disks with raid0. (15 GB Total Memory, 4 Virtual Cores, 2 ECU , Total ECU = 8)
Experiment configurations:
I have tried to do some experiments with GC
Cassandra config:
10 GB RAM is allocated to Cassandra Heap, 3500MB is Heap NEW size.
JVM Config:
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=1000"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=0"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=40"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops"
Result stats from OpsCenter community 2.0:
Read Requests 208 to 240 per second
Write Requests 18 to 28 per second
OS Load 24.5 to 25.85
Write Request Latency 127 to 160 micros
Read Request Latency 82202 to 94612 micros
OS Sent Network Traffic 44646 KB avg per second
OS Recieved Network Traffic 4338 KB avg per second
OS Disk Queue Size 13 to 15 requests
Read Requests Pending 25 to 32
OS Disk latency 48 to 56 ms
OS Disk Read Throughput 4.6 Mb per second
Disk IOPs Reads 420 per second
IOWait 80 % CPU avg
Idle 13 % CPU avg
Rowcache is disabled.
The Column Family
One of the column family i am only reading from is created through CLI
create column family XColFam
with column_type='Standard'
and comparator = CompositeType(BytesType,IntegerType)';"
Column family SSTable Size = 7.10 GB, SSTable Count = 2
XColFam column family has 59499904 no. of estimated row keys (most are utf8 literal with varying length, estimated through mx4jtools) with columns like thin in nature, with the value 0 bytes.....now.
Most of the rows should have very small number of columns, maybe 1 to 10, so with approx 20 to 30 bytes of 1st component of column name and 2nd is of 8 bytes integer....2nd component of composite column is dynamic could repeat but probability is low.......1st component repeats in varieties but number of columns in rows could be different.
I have tried SnappyCompression to compress the column family but there was no change in size.
I have a scheduled service that run for hours with 20 threads and make random read requests for multiple keys (for now its 2 keys per request) to this column family and read full rows, no column slice or etc.
I think it is not performing good now because it is processing too few request per minute. It was working better before when the column family size was not that big. It was around 3 to 4 GB.
I am afraid read performance degrade too fast with the increase in size of the column family.
I have also tried to tweak some GC and memory stuff, because before that I was having lots GC and CPU usage. When data size was smaller and there was very small iowait in wave form.
How can I increase the Cassandra performance. Your suggestions will be appreciated.

Look cassandra is relative I/O dependent.EC instances have "insuficient" I/O by design (Xen virtualization)
And my first recomendation is to use Cassandra on real hardware, where you have a control. e.g u can use SSD disk for CommitLog. Look at Cassandra hardware proposals.
However, switching to own hardware is a bit a radical option. To stay with Amazon try EBS
Amazon Elastic Block Store (EBS) provides block level storage volumes
for use with Amazon EC2 instances. Amazon EBS volumes are
network-attached, and persist independently from the life of an
instance. Amazon EBS provides highly available, highly reliable,
predictable storage volumes that can be attached to a running Amazon
EC2 instance and exposed as a device within the instance. Amazon EBS
is particularly suited for applications that require a database, file
system, or access to raw block level storage.
Amazon EBS allows you to create storage volumes from 1 GB to 1 TB that can be mounted as devices by Amazon EC2 instances. Multiple volumes can be mounted to the same instance. Amazon EBS enables you to provision a specific level of I/O performance if desired, by choosing a Provisioned IOPS volume. This allows you to predictably scale to thousands of IOPS per Amazon EC2 instance.
Also check out Cassandra Performance Testing on EC2

Short Answer: Row Cache and Key Caches.
If your data contains subsets that will be frequently read like most systems try to use row caches and key caches.
Row caches is a in memory cache, which stores the frequently read rows completely in memory. Please keep in mind, that this may have not a desired effect if you are data is spread out.
Key caches are generally more suited as it only stores the partition keys and their offsets on disk. This generally will help skip a lookup by Cassandra(no need to use partition indexes and partition summaries).
Try enabling key cache with the keyspace and table and check out your performance.

Related

High CPU usage on elasticsearch nodes

we have been using a 3 node Elasticsearch(7.6v) cluster running in docker container. I have been experiencing very high cpu usage on 2 nodes(97%) and moderate CPU load on the other node(55%). Hardware used are m5 xlarge servers.
There are 5 indices with 6 shards and 1 replica. The update operations take around 10 seconds even for updating a single field. similar case is with delete. however querying is quite fast. Is this because of high CPU load?
2 out of 5 indices, continuously undergo a update and write operations as they listen from a kafka stream. size of the indices are 15GB, 2Gb and the rest are around 100MB.
You need to provide more information to find the root cause:
All the ES nodes are running on different docker containers on the same host or different host?
Do you have resource limit on your ES docker containers?
How much heap size of ES and is it 50% of host machine RAM?
Node which have high CPU, holds the 2 write heavy indices which you mentioned?
what is the refresh interval of your indices which receives high indexing requests.
what is the segment size of your 15 GB indices, use https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-segments.html to get this info.
What all you have debugged so far and is there is any interesting info you want to share to find the issue?

Performance Tuning for large queries on Apache Drill

Am trying to execute large queries on apache drill and its taking more than 1 hour to run, even though CPUs, memory and I/O are all under utilized.
Size of underlying data in parquet format: 30 GB
Cluster Size: Single node
RAM: 512 GB, 300GB assigned to drill.
CPU: 48
File System: MapR
What are the possible tuning parameters that one can check to improve the performance of apache drill?

How to increase AeroSpark read performance?

I am using latest AeroSpark connector to work with AeroSpike and Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
Each record round contains 1k of data. The AeroSpike and Spark hosted on the same node. The data filtered by secondary index.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G
default-ttl 30d
storage-engine device {
file /vol/rmla.data
filesize 900G
}
Without knowing anything about your server, and with just a snippet of config, I'll stick to some generic recommendations that should improve your experience.
Disk IO
You are clearly bound by the read speed from your storage media, which you declared to be a file. If you're storing the data on disk, you can either use file or device in the storage-engine device config block.
There is a big difference in the read and write latency between a file on a HDD versus raw device access to an SSD. Typically Aerospike is used with data stored on enterprise-grade SSD devices. Read the section in the operations manual about initializing and setting up the drive. Declaring multiple devices for the namespace with give you a linear performance boost (two drives will have double the read and write throughput of one of the same kind).
In Amazon EC2 you could use the c3, i2, r3, or i3 instance families for this purpose. The ephemeral SSD devices of EC2 instances don't need to be over-provisioned, have their RAID turned off, etc. They only need to be initialized before they're first used. Do not use EBS drives for primary storage, as they're too slow.
Cluster Configuration
The Spark connector uses lots of scan operations. Make sure that you've configured scan-threads under your service config block to the number of cores. If you don't know how many cores you have, do cat /proc/cpuinfo. If Spark is the only client using the Aerospike cluster, you can tune the scan threads higher.
Connector Configuration
You can modify the connector config options for lower write latency. Optionally set aerospike.commitLevel to CommitLevel.COMMIT_MASTER.
Upgrade Version
As of November 28 2016 aerospike/aerospark supports Spark 2.0. Make sure you're using the latest code.
Note: See the new tutorial for Aerospark on the Aerospike website.

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.
Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.
I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

ScaleOut Software In Memory DataGrid Using Hadoop

I have been doing some reading on real time processing using hadoop and stumbled upon this http://www.scaleoutsoftware.com/hserver/
From what the documentation says, it looks like they implemented an in memory data grid using the hadoop worker/slave nodes. I have couple of questions here
From my understanding, if i have a data of size 100 GB, i would atleast need 100GB of ram across all nodes on my cluster just for the data + additional ram for task tracker, data node daemons + additional ram for the hServer service that would run on all these nodes. Is my understanding correct?
The software claims they can do real-time data processing by improving the latency issues in hadoop. Is it because, it allows us to write data to the in-memory grid instead of HDFS?
I am new to Big Data technologies. Apologize if some of the questions are naive.
[Full disclosure: I work at ScaleOut Software, the company which created ScaleOut hServer.]
In-memory data grids create a replica for every object to ensure high availability in case of failures.The aggregate amount of memory that is required is the memory used to store the objects with the addition of the memory used to store object replicas. In your example, you will need 200 GB of total memory: 100 GB for objects and 100 GB for replicas. For example, in a four-server cluster, each server needs 50 GB of memory available to the ScaleOut hServer service.
With the current release, ScaleOut hServer takes the first step in enabling real-time analytics by speeding up data access. It does this in two ways, which are implemented using different input/output formats. The first mode of operation uses the grid as a cache for HDFS, and the second uses the grid as the primary storage for a data set, providing support for fast-changing, memory-based data. Accessing data using an in-memory data grid reduces latency by eliminating disk I/O and minimizing network overhead. Also, caching HDFS data provides an additional performance boost by storing keys and values generated by the record reader instead of raw HDFS files in the grid.

Resources