Regarding Amazon ec2 cluster compute instances - amazon-ec2

I have a question regarding ec2 cluster. Can anyone tell me that how many instances are exactly in a cluster of ec2? Let me explain what I meant by this question. I want to benchmark an application by running it on an instance and the bench marker will run on different instances. I will create a Virtual private network between the instances so that when a bench marker will send packets to the application, it will respond to the packets and the bench marker will count the number of responses (this is how the application and the bench marker works). This can be done on multiple instances (I haven't tried it yet, but I believe so). How exactly this can be done on a cluster?

Cluster compute instances are a type of instances, they are names like large, xlarge, etc. So it just contains one virtual machine and you can add as many of these as you need(and can afford).
Here are the specs of the cluster compute instances:
Cluster Compute Quadruple Extra Large 23 GB memory, 33.5 EC2 Compute Units, 1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet
Cluster Compute Eight Extra Large 60.5 GB memory, 88 EC2 Compute Units, 3370 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet
http://aws.amazon.com/ec2/

Related

Suggestions required in increasing utilization of yarn containers on our discovery cluster

Current Setup
we have our 10 node discovery cluster.
Each node of this cluster has 24 cores and 264 GB ram Keeping some memory and CPU aside for background processes, we are planning to use 240 GB memory.
now, when it comes to container set up, as each container may need 1 core, so max we can have 24 containers, each with 10GB memory.
Usually clusters have containers with 1-2 GB memory but we are restricted with the available cores we have with us or maybe I am missing something
Problem statement
as our cluster is extensively used by data scientists and analysts, having just 24 containers does not suffice. This leads to heavy resource contention.
Is there any way we can increase number of containers?
Options we are considering
If we ask the team to run many tez queries (not separately) but in a file, then at max we will keep one container.
Requests
Is there any other way possible to manage our discovery cluster.
Is there any possibility of reducing container size.
can a vcore (as it's a logical concept) be shared by multiple containers?
Vcores are just a logical unit and not in anyway related to a CPU core unless you are using YARN with CGroups and have yarn.nodemanager.resource.percentage-physical-cpu-limit enabled. Most tasks are rarely CPU-bound but more typically network I/O bound. So if you were to look at your cluster's overall CPU utilization and memory utilization, you should be able to resize your containers based on the wasted (spare) capacity.
You can measure utilization with a host of tools but sar, ganglia and grafana are the obvious ones but you can also look at Brendan Gregg's Linux Performance tools for more ideas.

Elastic cloud vs elastic search on local server

Does the Elastic cloud and elastic search setup on local machine consumes the same speed for data gathering ?
Well it depends on (your local setup).
How many machines/nodes
How much CPU and memory per node. And one of most important is if nodes has SSD
It depends on network.
In ElasticCloud you can choose amount of memory and storage, but not amount of nodes (Nodes depends on amount of memory because its better to have one node with 32 gigs of memory than 3 nodes with 10 gb for instance.) Also important that EsCloud setup is using SSD.
So again all of that depends while local setup will give you more flexibility and control, but cloud could simplify your life.
One more option would be to go with AWS or Azure because you will be able to add remove nodes on demands so it would be a bit easier to experiment and see what setup is better for you.
To Sum up: if we are talking that you have same setup locally and same setup in cloud there will be no difference in terms of performance but, only one thing would be different its latency.

How to increase AeroSpark read performance?

I am using latest AeroSpark connector to work with AeroSpike and Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
Each record round contains 1k of data. The AeroSpike and Spark hosted on the same node. The data filtered by secondary index.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G
default-ttl 30d
storage-engine device {
file /vol/rmla.data
filesize 900G
}
Without knowing anything about your server, and with just a snippet of config, I'll stick to some generic recommendations that should improve your experience.
Disk IO
You are clearly bound by the read speed from your storage media, which you declared to be a file. If you're storing the data on disk, you can either use file or device in the storage-engine device config block.
There is a big difference in the read and write latency between a file on a HDD versus raw device access to an SSD. Typically Aerospike is used with data stored on enterprise-grade SSD devices. Read the section in the operations manual about initializing and setting up the drive. Declaring multiple devices for the namespace with give you a linear performance boost (two drives will have double the read and write throughput of one of the same kind).
In Amazon EC2 you could use the c3, i2, r3, or i3 instance families for this purpose. The ephemeral SSD devices of EC2 instances don't need to be over-provisioned, have their RAID turned off, etc. They only need to be initialized before they're first used. Do not use EBS drives for primary storage, as they're too slow.
Cluster Configuration
The Spark connector uses lots of scan operations. Make sure that you've configured scan-threads under your service config block to the number of cores. If you don't know how many cores you have, do cat /proc/cpuinfo. If Spark is the only client using the Aerospike cluster, you can tune the scan threads higher.
Connector Configuration
You can modify the connector config options for lower write latency. Optionally set aerospike.commitLevel to CommitLevel.COMMIT_MASTER.
Upgrade Version
As of November 28 2016 aerospike/aerospark supports Spark 2.0. Make sure you're using the latest code.
Note: See the new tutorial for Aerospark on the Aerospike website.

Cassandra - Optimizing hardware in cluster

I have been able to get Cassandra working on a macbook cluster (for fun). Now I am trying to operationalize this for research.
Currently, I have a single linux machine running intel 3770K lga 1150. I would like to create a cluster for the purpose on running cassandra. Can I use cheap machines (2-3 nodes with intel i5, 4tb hd, and 8 gb ram)? What is the best configuration to do this right the first time?
Is is possible to use the new nodes to operate cassandra and the current machine just utilize the data for analysis?
8gb ram is pretty low. Id recommend a minimum of 16gb (more the better) so you can safely allocate 8gb heap while leaving room for the offheap stuff. Especially if you want to store multiple TB of data on it you want more then 8gb. Some data models are worse then others. If using non-ssd's be sure to have a dedicated drive to the commitlog so its not competing with data. It will work with what you listed but you wont get good performance once theres a decent amount of data.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAbout_c.html
You can create multiple data centers to separate your different workloads. DSE workload snitch will do that for you if using datastax enterprise.

Cassandra Amazon EC2, Read Performance experiments

I need some help improving Cassandra read performance. I am concerned about degradation of read performance as the size of the column family increases. We have the following stats on single-node Cassandra.
Operating System: Linux - CentOS release 5.4 (Final)
Cassandra version: apache-cassandra-1.1.0
Java version: "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)
Cassandra Configuration: (cassandra.yaml)
rpc_server_type: hsha
disk_access_mode: mmap
concurrent_reads: 64
concurrent_writes: 32
Platform: Amazon-ec2/Rightscale m1.Xlarge instance with 4 ephemeral disks with raid0. (15 GB Total Memory, 4 Virtual Cores, 2 ECU , Total ECU = 8)
Experiment configurations:
I have tried to do some experiments with GC
Cassandra config:
10 GB RAM is allocated to Cassandra Heap, 3500MB is Heap NEW size.
JVM Config:
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=1000"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=0"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=40"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops"
Result stats from OpsCenter community 2.0:
Read Requests 208 to 240 per second
Write Requests 18 to 28 per second
OS Load 24.5 to 25.85
Write Request Latency 127 to 160 micros
Read Request Latency 82202 to 94612 micros
OS Sent Network Traffic 44646 KB avg per second
OS Recieved Network Traffic 4338 KB avg per second
OS Disk Queue Size 13 to 15 requests
Read Requests Pending 25 to 32
OS Disk latency 48 to 56 ms
OS Disk Read Throughput 4.6 Mb per second
Disk IOPs Reads 420 per second
IOWait 80 % CPU avg
Idle 13 % CPU avg
Rowcache is disabled.
The Column Family
One of the column family i am only reading from is created through CLI
create column family XColFam
with column_type='Standard'
and comparator = CompositeType(BytesType,IntegerType)';"
Column family SSTable Size = 7.10 GB, SSTable Count = 2
XColFam column family has 59499904 no. of estimated row keys (most are utf8 literal with varying length, estimated through mx4jtools) with columns like thin in nature, with the value 0 bytes.....now.
Most of the rows should have very small number of columns, maybe 1 to 10, so with approx 20 to 30 bytes of 1st component of column name and 2nd is of 8 bytes integer....2nd component of composite column is dynamic could repeat but probability is low.......1st component repeats in varieties but number of columns in rows could be different.
I have tried SnappyCompression to compress the column family but there was no change in size.
I have a scheduled service that run for hours with 20 threads and make random read requests for multiple keys (for now its 2 keys per request) to this column family and read full rows, no column slice or etc.
I think it is not performing good now because it is processing too few request per minute. It was working better before when the column family size was not that big. It was around 3 to 4 GB.
I am afraid read performance degrade too fast with the increase in size of the column family.
I have also tried to tweak some GC and memory stuff, because before that I was having lots GC and CPU usage. When data size was smaller and there was very small iowait in wave form.
How can I increase the Cassandra performance. Your suggestions will be appreciated.
Look cassandra is relative I/O dependent.EC instances have "insuficient" I/O by design (Xen virtualization)
And my first recomendation is to use Cassandra on real hardware, where you have a control. e.g u can use SSD disk for CommitLog. Look at Cassandra hardware proposals.
However, switching to own hardware is a bit a radical option. To stay with Amazon try EBS
Amazon Elastic Block Store (EBS) provides block level storage volumes
for use with Amazon EC2 instances. Amazon EBS volumes are
network-attached, and persist independently from the life of an
instance. Amazon EBS provides highly available, highly reliable,
predictable storage volumes that can be attached to a running Amazon
EC2 instance and exposed as a device within the instance. Amazon EBS
is particularly suited for applications that require a database, file
system, or access to raw block level storage.
Amazon EBS allows you to create storage volumes from 1 GB to 1 TB that can be mounted as devices by Amazon EC2 instances. Multiple volumes can be mounted to the same instance. Amazon EBS enables you to provision a specific level of I/O performance if desired, by choosing a Provisioned IOPS volume. This allows you to predictably scale to thousands of IOPS per Amazon EC2 instance.
Also check out Cassandra Performance Testing on EC2
Short Answer: Row Cache and Key Caches.
If your data contains subsets that will be frequently read like most systems try to use row caches and key caches.
Row caches is a in memory cache, which stores the frequently read rows completely in memory. Please keep in mind, that this may have not a desired effect if you are data is spread out.
Key caches are generally more suited as it only stores the partition keys and their offsets on disk. This generally will help skip a lookup by Cassandra(no need to use partition indexes and partition summaries).
Try enabling key cache with the keyspace and table and check out your performance.

Resources