Performance Tuning for large queries on Apache Drill - parquet

Am trying to execute large queries on apache drill and its taking more than 1 hour to run, even though CPUs, memory and I/O are all under utilized.
Size of underlying data in parquet format: 30 GB
Cluster Size: Single node
RAM: 512 GB, 300GB assigned to drill.
CPU: 48
File System: MapR
What are the possible tuning parameters that one can check to improve the performance of apache drill?

Related

How to increase hive concurrent mappers to more than 4?

Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

how can i evaluate my spark application

hello i just finished creating my first spark application, now i have access to a cluster (12 nodes where each node has 2 processors Intel(R) Xeon(R) CPU E5-2650 2.00GHz, where each processor has 8 cores), i want to know what are criteria that help me to tuning my application and to observe its performance.
i have already visited the official website of spark, it's talking about Data Serialization, but i couldn't get what is it exactly or how to specify it.
it is talking also about "memory management", "Level of Parallelism" but i didn't understand how to control these.
one more thing, i know that the size of data has an effect, but all files.csv that i have have small size, how can i get files with large size (10 GB, 20 GB, 30 GB, 50 GB, 100 GB, 300 GB, 500 GB)
please try to explain well for me, because cluster computing is fresh for me.
For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia
From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.
Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.
I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

HDFS sequence file performance tuning

I'm trying to use Hadoop to process a lot of small files which are stored in sequence file. My program is highly IO bound so I want to make sure that IO throughput is high enough.
I wrote a MR program that reads small sample files from sequence file and write these files to ram disk (/dev/shm/test/). There's another stand alone program that will delete files written to ram disk without any computation. So the test should be almost pure IO bound. However, the IO throughput is not as good as I expected.
I have 5 datanode and each of the datanode has 5 data disk. Each disk can provide about 100MB/s throughput. Theoretically this cluster should be able to provide 100MB/s * 5 (disks) * 5 (machines) = 2500MB/s. However, I get about 600MB/s only. I run "iostat -d -x 1" on the 5 machines and found that the IO loading is not well balanced. Usually only a few of the disk have 100% utilization, some disks have very low utilization ( 10% or less). And some machine even have no IO loading at some time. Here's the screenshot. (Of course the loading for each disk/machine varies quickly)
Here's another screenshot the shows CPU usage by "top -cd1" command:
Here're some more detailed config about my case:
Hadoop cluster hardware: 5 Dell R620 machines which equipped with 128GB ram and 32 core CPU (actually 2 Xeon E5-2650). 2 HDD consist of a RAID 1 disk for CentOS and 5 data disks for HDFS. So uou can see 6 disks in the above screenshot.
Hadoop settings: block size 128MB; data node handler count is 8; 15 maps per task tracker; 2GB Map reduce child heap process.
Testing file set: about 400,000 small files, total size 320GB. Stored in 160 sequence files, each seq file has the size about 2GB. I tried to store all the files in many different size seq files(1GB, 512MB, 256MB, 128MB), but the performance didn't change much.
I won't expect the whole system can have 100% IO throughput(2500MB/s), but I think 40% (1000MB/s) or more should be reasonable. Can anyone provide some guide for performance tuning?
I solved the problem myself. Hint: the high CPU usage.
It's very abnormal that the CPU usage is so high since it's an almost pure IO job.
The root cause is that each task node gets about 500 map and each map use exactly one JVM. By default, hadoop map reduce is configured to create a new JVM for a new map.
Solution: Modify the value of "mapred.job.reuse.jvm.num.tasks" from 1 to -1, which indicates that the JVM will be reused without limitation.

Cassandra Amazon EC2, Read Performance experiments

I need some help improving Cassandra read performance. I am concerned about degradation of read performance as the size of the column family increases. We have the following stats on single-node Cassandra.
Operating System: Linux - CentOS release 5.4 (Final)
Cassandra version: apache-cassandra-1.1.0
Java version: "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)
Cassandra Configuration: (cassandra.yaml)
rpc_server_type: hsha
disk_access_mode: mmap
concurrent_reads: 64
concurrent_writes: 32
Platform: Amazon-ec2/Rightscale m1.Xlarge instance with 4 ephemeral disks with raid0. (15 GB Total Memory, 4 Virtual Cores, 2 ECU , Total ECU = 8)
Experiment configurations:
I have tried to do some experiments with GC
Cassandra config:
10 GB RAM is allocated to Cassandra Heap, 3500MB is Heap NEW size.
JVM Config:
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=1000"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=0"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=40"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops"
Result stats from OpsCenter community 2.0:
Read Requests 208 to 240 per second
Write Requests 18 to 28 per second
OS Load 24.5 to 25.85
Write Request Latency 127 to 160 micros
Read Request Latency 82202 to 94612 micros
OS Sent Network Traffic 44646 KB avg per second
OS Recieved Network Traffic 4338 KB avg per second
OS Disk Queue Size 13 to 15 requests
Read Requests Pending 25 to 32
OS Disk latency 48 to 56 ms
OS Disk Read Throughput 4.6 Mb per second
Disk IOPs Reads 420 per second
IOWait 80 % CPU avg
Idle 13 % CPU avg
Rowcache is disabled.
The Column Family
One of the column family i am only reading from is created through CLI
create column family XColFam
with column_type='Standard'
and comparator = CompositeType(BytesType,IntegerType)';"
Column family SSTable Size = 7.10 GB, SSTable Count = 2
XColFam column family has 59499904 no. of estimated row keys (most are utf8 literal with varying length, estimated through mx4jtools) with columns like thin in nature, with the value 0 bytes.....now.
Most of the rows should have very small number of columns, maybe 1 to 10, so with approx 20 to 30 bytes of 1st component of column name and 2nd is of 8 bytes integer....2nd component of composite column is dynamic could repeat but probability is low.......1st component repeats in varieties but number of columns in rows could be different.
I have tried SnappyCompression to compress the column family but there was no change in size.
I have a scheduled service that run for hours with 20 threads and make random read requests for multiple keys (for now its 2 keys per request) to this column family and read full rows, no column slice or etc.
I think it is not performing good now because it is processing too few request per minute. It was working better before when the column family size was not that big. It was around 3 to 4 GB.
I am afraid read performance degrade too fast with the increase in size of the column family.
I have also tried to tweak some GC and memory stuff, because before that I was having lots GC and CPU usage. When data size was smaller and there was very small iowait in wave form.
How can I increase the Cassandra performance. Your suggestions will be appreciated.
Look cassandra is relative I/O dependent.EC instances have "insuficient" I/O by design (Xen virtualization)
And my first recomendation is to use Cassandra on real hardware, where you have a control. e.g u can use SSD disk for CommitLog. Look at Cassandra hardware proposals.
However, switching to own hardware is a bit a radical option. To stay with Amazon try EBS
Amazon Elastic Block Store (EBS) provides block level storage volumes
for use with Amazon EC2 instances. Amazon EBS volumes are
network-attached, and persist independently from the life of an
instance. Amazon EBS provides highly available, highly reliable,
predictable storage volumes that can be attached to a running Amazon
EC2 instance and exposed as a device within the instance. Amazon EBS
is particularly suited for applications that require a database, file
system, or access to raw block level storage.
Amazon EBS allows you to create storage volumes from 1 GB to 1 TB that can be mounted as devices by Amazon EC2 instances. Multiple volumes can be mounted to the same instance. Amazon EBS enables you to provision a specific level of I/O performance if desired, by choosing a Provisioned IOPS volume. This allows you to predictably scale to thousands of IOPS per Amazon EC2 instance.
Also check out Cassandra Performance Testing on EC2
Short Answer: Row Cache and Key Caches.
If your data contains subsets that will be frequently read like most systems try to use row caches and key caches.
Row caches is a in memory cache, which stores the frequently read rows completely in memory. Please keep in mind, that this may have not a desired effect if you are data is spread out.
Key caches are generally more suited as it only stores the partition keys and their offsets on disk. This generally will help skip a lookup by Cassandra(no need to use partition indexes and partition summaries).
Try enabling key cache with the keyspace and table and check out your performance.

Resources