does cassandra vnode increase disk seek time - cassandra-2.0

Cassandra 1.2 add new feature virtual node. Devide one physical node into multiple virtual nodes. Does increase disk seek time. Because different virtual nodes have different commit logs. When write into different commit logs, it increases disk seek time.

Commit log writes are sequential and does not do a disk seek to write the data.

Virtual nodes do not have individual commit logs. There is only one commit log for each physical node.

Related

Does Clickhouse take the amount of free disk space when scheduling background merges?

I have a Clickhouse cluster (three nodes) that contains a Merge Tree table, and Aggregating Merge Tree and a materialized view that fills the aggregating merge tree with data we insert in the merge tree. All tables are present on each node. (see the full schema in this gist here).
I recently increased the storage size (from 4TB per node to 4.5TB) and I noticed that right after that Clickhouse seemed to have become more aggressive at running background merges. It seems to run longer merges with higher rows merged per second rate, to the point that some merge impacts the IO bandwidth of the servers with negative effects on the insertion rate.
I noticed this setting here. It mentions that Clickhouse would schedule a merge if there are enough free resources in the background pool.
Does anybody know if that takes into account the amount of disk space? More space -> more likely to run merges that would create bigger partitions? The value we use for that parameter is the default one. And I noticed indeed that the biggest active partitions we have are around 150GB though I cannot say how big they were before adding storage.
Please let me know if there is any additional context needed.
Thanks
Yes, CH merge scheduler takes into account the amount of free disk space.
150GB merge able to start only if 300GB+ free disk space available.

what's the actual ideal NameNode memory size when meet a lot files in HDFS

I will have 200 million files in my HDFS cluster, we know each file will occupy 150 bytes in NameNode memory, plus 3 blocks so there are total 600 bytes in NN.
So I set my NN memory having 250GB to well handle 200 Million files. My question is that so big memory size of 250GB, will it cause too much pressure on GC ? Is it feasible that creating 250GB Memory for NN.
Can someone just say something, why no body answer??
Ideal name node memory size is about total space used by meta of the data + OS + size of daemons and 20-30% space for processing related data.
You should also consider the rate at which data comes in to your cluster. If you have data coming in at 1TB/day then you must consider a bigger memory drive or you would soon run out of memory.
Its always advised to have at least 20% memory free at any point of time. This would help towards avoiding the name node going into a full garbage collection.
As Marco specified earlier you may refer NameNode Garbage Collection Configuration: Best Practices and Rationale for GC config.
In your case 256 looks good if you aren't going to get a lot of data and not going to do lots of operations on the existing data.
Refer: How to Plan Capacity for Hadoop Cluster?
Also refer: Select the Right Hardware for Your New Hadoop Cluster
You can have a physical memory of 256 GB in your namenode. If your data increase in huge volumes, consider hdfs federation. I assume you already have multi cores ( with or without hyperthreading) in the name node host. Guess the below link addresses your GC concerns:
https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html

Why Spark choses to send data over the network in shuffle phase instead of writing to some location on HDFS?

As far as I understand , spark tries to send data over the network to another node's in memory buffer and spills to disk if it doesn't fit in memory , why can't spark just write to HDFS from where any node can read ?
Writing it to disk is a much slower transfer.
On top of that, you guarantee that you incur the overhead of synchronizing disk access among the interested nodes.

Why is RAID not recommended for Hadoop HDFS setups?

Various websites (like Hortonworks) recommend to not configure RAID for HDFS setups mainly because of two reasons:
Speed limited to slower disk (JBOD performs better).
Reliability
It is recommended to use RAID on NameNode.
But what about implementing RAID on each DataNode storage disk?
RAID is used for two purposes. Depending on the RAID configuration you can get:
Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.
Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.
HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems.
This ensures:
Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.
Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config).
Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.
RAID0 on and enterprise server is a huge mistake. I sure would like to meet the person that designed this. This makes no common sense to an IT operations manager. If you configure any of your local server disk with a RAID0 you risk a long and painful RAID0 recovery. If a single disk in a RAID0 fails that RAID partition becomes destroyed and it doesn't magically recover when the disk is replaced. Someone has to logon to the server and delete the old RAID partition and create a new one. This creates a lot of overhead in times when man hours and work cycles are at an all time high. An IT operations manager is either going to delay doing this due to more priority workload or refuse to do it because they don't have enough cycles to take people resources away for more important work. Then its going to get pushed off to another team. Then the politics begin and wham then it gets pushed back to the server owner/customer. If you wanted to make a RAID1 or SAN drive available then you could avoid that entire scenario.

The memory consumption of hadoop's namenode?

Can anyone give a detailed analysis of memory consumption of namenode? Or is there some reference material ? Can not find material in the network.Thank you!
I suppose the memory consumption would depend on your HDFS setup, so depending on overall size of the HDFS and is relative to block size.
From the Hadoop NameNode wiki:
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
From https://twiki.opensciencegrid.org/bin/view/Documentation/HadoopUnderstanding:
Namenode: The core metadata server of Hadoop. This is the most critical piece of the system, and there can only be one of these. This stores both the file system image and the file system journal. The namenode keeps all of the filesystem layout information (files, blocks, directories, permissions, etc) and the block locations. The filesystem layout is persisted on disk and the block locations are kept solely in memory. When a client opens a file, the namenode tells the client the locations of all the blocks in the file; the client then no longer needs to communicate with the namenode for data transfer.
the same site recommends the following:
Namenode: We recommend at least 8GB of RAM (minimum is 2GB RAM), preferably 16GB or more. A rough rule of thumb is 1GB per 100TB of raw disk space; the actual requirements is around 1GB per million objects (files, directories, and blocks). The CPU requirements are any modern multi-core server CPU. Typically, the namenode will only use 2-5% of your CPU.
As this is a single point of failure, the most important requirement is reliable hardware rather than high performance hardware. We suggest a node with redundant power supplies and at least 2 hard drives.
For a more detailed analysis of memory usage, check this link out:
https://issues.apache.org/jira/browse/HADOOP-1687
You also might find this question interesting: Hadoop namenode memory usage
There are several technical limits to the NameNode (NN), and facing any of them will limit your scalability.
Memory. NN consume about 150 bytes per each block. From here you can calculate how much RAM you need for your data. There is good discussion: Namenode file quantity limit.
IO. NN is doing 1 IO for each change to filesystem (like create, delete block etc). So your local IO should allow enough. It is harder to estimate how much you need. Taking into account fact that we are limited in number of blocks by memory you will not claim this limit unless your cluster is very big. If it is - consider SSD.
CPU. Namenode has considerable load keeping track of health of all blocks on all datanodes. Each datanode once a period of time report state of all its block. Again, unless cluster is not too big it should not be a problem.
Example calculation
200 node cluster
24TB/node
128MB block size
Replication factor = 3
How much space is required?
# blocks = 200*24*2^20/(128*3)
~12Million blocks
~12,000 MB memory.
I guess we should make the distinction between how namenode memory is consumed by each namenode object and general recommendations for sizing the namenode heap.
For the first case (consumption) ,AFAIK , each namenode object holds an average 150 bytes of memory. Namenode objects are files, blocks (not counting the replicated copies) and directories. So for a file taking 3 blocks this is 4(1 file and 3 blocks)x150 bytes = 600 bytes.
For the second case of recommended heap size for a namenode, it is generally recommended that you reserve 1GB per 1 million blocks. If you calculate this (150 bytes per block) you get 150MB of memory consumption. You can see this is much less than the 1GB per 1 million blocks, but you should also take into account the number of files sizes, directories.
I guess it is a safe side recommendation. Check the following two links for a more general discussion and examples:
Sizing NameNode Heap Memory - Cloudera
Configuring NameNode Heap Size - Hortonworks
Namenode Memory Structure Internals

Resources