Can Increase Storage on Existing Greenplum Data node(vm) - greenplum

Our greenplum storage usage is over 80%, we do not want to expand more nodes. Can we increasing Greenplum storage capacity on existing Greenplum nodes(nodes are VMs )? Thanks!

If you don't have logical volumes, you can add vmdk, move segment data over, then symbolic link back to the original segment directory.
1) Shutdown gpdb cluster
2) mount new vmdk drives
3) move /data/primary/segN to new mount points
4) ln -s /data/primary/segN
5) restart the cluster

Related

Elasticsearch Backup - 800gb indice, 200gb disk space left

I have an Elasticsearch indice that is currently 800gb.
I want to create a backup snapshot and store it offsite in the cloud. However, I only have 200gb of remaining local disk space.
Is there a way I can backup the whole indice to the cloud, perhaps in batches as I don't have enough local disk space remaining to store it entirely - what is the best solution here?

How to move dfs data to a new disk

[newbie question on Hadoop]
I currently have a single node implementation of hadoop 2.7.2.
The machine is running out of disk space:
df -h gives
Filesystem Size Used Avail Use% Mounted on
/dev/vdb 50G 39G 12G 78% /app
As soon as the usage percentage goes up to 80%, the cluster hangs. Therefore, I should add more disk to the machine.
What would be the best way to increase the disk space?
Approach A:
Add a new disk (/dev/vdc)
Mount it to whatever folder (e.g. /hadoop_data)
update hdfs-site.xml to add a dfs.datanode.data.dir node pointing to the mount point
Downsides of approach A:
does not prevent the first configured folder from getting full
kind of 'messy' since all the data are scattered across several mount points
Approach B:
stop hadoop
Add a new disk(/dev/vdc)
Mount this new disk as /app_new
rsync between /edx and /app_new
swap the mount points between the two disks
start hadoop
Downside of Approach B:
if hadoop keeps any reference to the disk ID, this will probably ot work
What would be the 'cleanest' option?
Is there a third way?
Follow Approach A.
Just add bit more steps:
Mention the directory name in Datanodes Directory, so that your cluster gets aware that you have added a new Datanode directory.
Now, Simply run HDFS balancer command and then data will be equally be shifted to both the Datanodes and your error will be gone.

Increase disk space on Hadoop worker nodes

I have a HADOOP cluster setup and i'm starting to run out of disk space. I have an iSCSI LUN presented to all my servers and it is formatted with ext4 running LVM. I want to know if if I need to present a new iSCSI LUN and add it to ext4 if HDFS will see the new space or is there something else I have to do? Could I just increase the LUN from the storage side?
In the cluster, if any of the node reaches the limit, we can recover the disk space issue by copying some of the files to the other nodes.
If there is no space on the entire cluster, then new LUN should be added.

Hadoop adding datanode with smaller hard drives

We are planning to add two new datanodes to our Hadoop cluster. After googling for one day , i can still not answer this question:
What will happen if the hard drives are smaller on the new datanodes?
Will this result in smaller total size of the HDFS ?
Here is an example
Datanode1 with 1TB
Datanode2 with 1TB
Total storage = 2TB
Adding one more node with 500GB disk
Datanode3 with 500GB
What will be the total HDFS storage ? 2.5TB or 1.5TB ?
If it will be 2.5TB (I hope so) , how hadoop balances the storage around different datanodes with different hard drives ?
The total HDFS capacity will be 2.5 TB. The existing blocks will be there as is and won't be moved to the new node once it is added to the cluster. To move some of the blocks from the overloaded to the underloaded node use the bin/start-balancer.sh and the bin/stop-balancer-sh script in the Hadoop installation.
The block placement policy will determine where the clocks will go to. Since the new nodes HDD is empty, there is a better probability that the blocks of the new files put into HDFS will go there.

Transferring whole HDFS from one Cluster to another

I have lots of hive Tables stored in my HDFS on a Test Cluster with 5 nodes. The Data should be around 70 Gb * 3 (Replipication). No i want to transfer the whole setup to a different environment with much more nodes. A Network Connection between the two Clusters is not possible.
The thing is that i dont have much time with the new Cluster and also no possibilities to Test the Transfering with an other Test environment. Therefore i need a solid plan. :)
What options do i have?
How can i transfer the hive setup with a minimum of configuration effort on the new cluster?
Is it possible to just copy the hdfs directorys of the 5 Nodes to 5 Nodes of the new Cluster, then add the rest of the nodes to the new cluster and start the balancer?
Without a network connection, it will be tricky!
I would
Copy the files out of HDFS onto some kind of removable storage (USB stick, external HDD, etc.)
Move the storage to the new cluster
Copy the files back into HDFS
Note that this won't preserve metadata like file creation/last access time, and, more importantly, ownership and permissions.
Small-scale testing of this process should be pretty simple.
If you can get (even temporarily) network connectivity between the two clusters, then distcp would be the way to go. It uses map reduce to parallelise the transfers, potentially resulting in massive time savings.
You can copy directories and files from one cluster to another using hadoop distcp command
Here is a small examples that describes its usage
http://souravgulati.webs.com/apps/forums/topics/show/8534378-hadoop-copy-files-from-one-hadoop-cluster-to-other-hadoop-cluster
you can copy data by using this command :
sudo -u hdfs hadoop --config {PathtotheVpcCluster}/vpcCluster distcp hdfs://SourceIP:8020/user/hdfs/WholeData hdfs://DestinationIP:8020/user/hdfs/WholeData

Resources