what is the volume of Cloudera CDH3 for 50 nodes - hadoop

The free version only support limited 50 nodes.
If I use 10 times 2T hard disk for one computer. That means 10*2*50 = 1000T
I could save 1000T data, right?
Thanks

If you don't replicate your data this is true.
Usually in a 50 node environment your replication is set to 3 or 4.
Which then will reduce your amount of unique data stored to 1000T/3 = 33T or to 1000T/4 = 250T.

Related

How to calculate Hadoop Storage?

Im not sure If Im calculating it right but for example im using Hadoop default settings and I want to calculate how much data i can store in my cluster. For example I have 12 nodes and 8 TB total disk space per node allocated to HDFS storage.
Do I just calculate 12/8 = 1.5 TB?
You're not including the replication factor and overhead for processing any of that data. Plus, Hadoop won't run if all the disks are close to full
Therefore, 8 TB would be first divided by 3 (without the new Erasure Coding enabled), and then by the number of nodes
However, you can't technically hit 100% of HDFS usage because the services will start failing once you start going above 85% usage, so really, your starting number should be 7TB

How much data can my Hadoop cluster handle?

I have a 4 node cluster configured to have 1 Namenode and 3 datanodes. Im performing a TPCH benchmark and i would like to know how much data you think my cluster can handle without affecting query response times. My total available HD size is about 700GB, each node has cpu with 8 cores and 16GB of RAM.
I saw some calculations that we could do to find the volume limit but i didnt understand IT, if someone could explain on a simple way how to calculate the data volume that a cluster can handle it would be very helpful.
Thank you
You can use 70 to 80 % of space in ur cluster to store the data, remaining will be used for processing and to store intermediate results in ur cluster.
This way performance will not be impacted
As you mentioned, you already configured your 4 node cluster. You can go and check in NN webUI-->Configured capacity section to find out the storage details, Let me know if you find any difficulties.

virtual segment memory/core allocation in Apache Hawq

I am trying to tweak below Hawq configurations at session level for a query-
SET hawq_rm_stmt_nvseg = 40;
SET hawq_rm_stmt_vseg_memory = '4gb';
Hawq is running on Yarn resource manager with
Minumum Hawq queue Used capacity 5%
hawq_rm_nvseg_perquery_perseg_limit = 6
hawq_rm_min_resource_perseg = 4
When running my query i see only 30 containers being launched. Should it not be 40 containers (1 core per virtual segments)? Please help me understand how virtual segments memory or cores are allocated?
hawq_rm_stmt_nvseg is a quota limit. By default, this is 0. So setting this to 40 won't increase the number of vsegs but instead, limit it.
hawq_rm_nvseg_perquery_perseg_limit controls how many vsegs can be created and you are using the default of 6. So the number of vsegs should be 6 * number of nodes. If you see 30, then you probably have 5 nodes.
If you are using randomly distributed tables, you can increase hawq_rm_nvseg_perquery_perseg_limit to get more vsegs to work on your query.
If you are using hash distributed tables, you can recreate the table with a larger bucketnum value which will give you more vsegs when you query it.

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.
Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.
I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

HDFS Replication - Data Stored

I am a relative newbie to hadoop and want to get a better understanding of how replication works in HDFS.
Say that I have a 10 node system(1 TB each node), giving me a total capacity of 10 TB. If I have a replication factor of 3, then I have 1 original copy and 3 replicas for each file. So, in essence, only 25% of my storage is original data. So my 10 TB cluster is in effect only 2.5 TB of original(un-replicated) data.
Please let me know if my train of thought is correct.
Your thinking is a little off. A replication factor of 3 means that you have 3 total copies of your data. More specifically, there will be 3 copies of each block for your file, so if your file is made up of 10 blocks there will be 30 total blocks across your 10 nodes, or about 3 blocks per node.
You are correct in thinking that a 10x1TB cluster has less than 10TB capacity- with a replication factor of 3, it actually has a functional capacity of about 3.3TB, with a little less actual capacity because of space needed for doing any processing, holding temporary files, etc.

Resources