We are using cloudera CDH 5.3. I am facing a problem wherein the size of "/dfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr" and "dncp-vlock-verification.log.prev" keeps increasing to TBs within hours. I read in some of the blogs and they mention it is an HDFS bug. A temporary solution to this problem is to stop the datanode services and delete these files. But we have observed that the log file increases in size on either of the datanodes (even on the same node after deleting it). Thus, it requires continuous monitoring.
Does anyone have a permanent solution to this problem?
One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS DataNode configuration the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.
Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, which includes the HDFS-7430 rewrite changes and associated bug fixes. These changes have done away with the use of such a local file, thereby removing the problem.
Related
I am using GreenPlum 6.x and facing issues while performing backup and recovery. Does we have any tool to take the physical backup of whole cluster like pgbackrest for Postgres, further how can we purge the WAL of master and each segment as we can't take the pg_basebackup of whole cluster.
Are you using open source Greenplum 6 or a paid version? If paid, you can download the gpbackup/gprestore parallel backup utility (separate from the database software itself) which will back up the whole cluster with a wide variety of options. If using open source, your options are pretty much limited to pgdump/pgdumpall.
There is no way to purge the WAL logs that I am aware of. In Greenplum 6, the WAL logs are used to keep all the individual postgres engines in sync throughout the cluster. You would not want to purge these individually.
Jim McCann
VMware Tanzu Data Engineer
I would like to better understand the the issues you are facing when you are performing your backup and recovery.
For Open Source user of the Greenplum Database, the gpbackup/gprestore utilities can be downloaded from the Releases page on the Github repo:
https://github.com/greenplum-db/gpbackup/releases
v1.19.0 is the latest.
There currently isn't a pg_basebackup / WAL based backup/restore solution for Greenplum Database 6.x
WAL logs are periodically purged (as they get replicated to mirror and flushed) from master and segments individually. So, no manual purging is required. Have you looked into why the WAL logs are not getting purged? One of the reasons could be mirrors in cluster is down. If that happens WAL will continue mounting on primary and won't get purged. Perform select * from pg_replication_slots; for master or segment for which WAL is building to know more.
If the cause for WAL build is due replication slot as for some reason is mirror down, can use guc max_slot_wal_keep_size to configure max size WAL's should consume, after that replication slot will be disabled and not consume more disk space for WAL.
I have a single namenode HDFS cluster with multiple datanodes that store many terabytes of data. I want to enable high availability on that cluster and add another namenode. What is the most efficient and least error-prone way to achieve that? Ideally that would work without any downtime or with a simple restart.
The two options that came to mind are:
Edit the configuration of the namenode to facilitate the HA features and restart it. Afterwards add the second namenode and reconfigure and restart the datanodes, so that they are aware that the cluster is HA now.
Create an identical cluster in terms of data, but with two namenodes. Then migrate the data from the old datanodes to the new datanodes and finally adjust the pointers of all HDFS clients.
The first approach seems easier, but requires some downtime and I am not sure if that is even possible. The second one is somehow cleaner, but there are potential problems with the data migration and the pointers adjustments.
You won't be able to do this in-place without any down time; a non-HA setup is exactly that, not highly available, and thus any code/configuration changes require downtime.
To incur the least amount of downtime while doing this in-place, you probably want to:
Set up configurations for an HA setup. This includes things like a shared edits directory or journal nodes - see https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html or https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html.
Create a new fsImage using the hdfs dfsadmin command. This will ensure that the NameNode is able to restart quickly (on startup, the NN will read the most recent fsImage, then apply all edits from the EditLog that were created after that fsImage).
Restart your current NameNode and put it into active mode.
Start the new NameNode in standby.
Update configurations on DataNodes and restart them to apply.
Update configurations on other clients and restart to apply.
At this point everything will be HA-aware, and the only downtime incurred was a quick restart of the active NN - equivalent to what you would experience during any code/configuration change in a non-HA setup.
Your second approach should work, but remember that you will need twice as much hardware, and maintaining consistency across the two clusters during the migration may be difficult.
I'm new to Hadoop and learning to use it by working with a small cluster where each node is an Ubuntu Server VM. The cluster consists of 1 name node and 3 data nodes with a replication factor of 3. After a power loss on the machine hosting the VMs, all files stored in the cluster were corrupted and with the blocks storing those files missing. No queries were running at the time power was lost and no files were being written to or read from the cluster.
If I shut down the VMs correctly (even without first stopping the Hadoop cluster), then the data is preserved and I don't run into any issues with missing or corrupted blocks.
The only information I've been able to find suggested setting dfs.datanode.sync.behind.writes to true, but this did not resolve the issue (killing the VMs from the host causes the same issue as a power failure). The information I found here seems to indicate this property will only have an effect when writing data to the disk.
I also tried running hdfs namenode -recover, but this did not resolve the issue. Ultimately I had to remove the data stored in the dfs.namenode.name.dir directory, rebooted each VM in the cluster to remove any Hadoop files in /tmp and reformatted the name node before copying the data back into the cluster from local file storage.
I understand that having all nodes in the cluster running on the same hardware and only 3 data nodes to go with a replication factor of 3 is not an ideal configuration, but I'd like a way to ensure that any data that is already written to disk is not corrupted by a power loss. Is there a property or other configuration I need to implement to avoid this in the future (besides separate hardware, more nodes, power backup, etc.)?
EDIT: To clarify further, the issue I'm trying to resolve is data corruption, not cluster availability. I understand I need to make changes to the overall cluster architecture to improve reliability, but I'd like a way to ensure data is not lost even in the event of a cluster-wide power failure.
I'm running into few issues when writing to HDFS (through flume's HDFS Sink). I think these are caused mostly because of the IO timeouts but not sure.
I end up with files that are open for write for a long long time and give the error "Cannot obtain block length for LocatedBlock{... }". It can be fixed if I explicitly recover the lease. I'm trying to understand what could cause this. I've been trying to reproduce this outside flume but have no luck yet. Could someone help me understand when such a situation could happen - A file on HDFS ends up not getting closed and stay like that until manual intervention to recover lease?
I thought the lease is recovered automatically based on the soft and hard limits. I've tried killing my sample code (I've also tried disconnecting network to make sure no shutdown hooks are executed) that is writing to HDFS to leave a file open for write but couldn't reproduce it.
We have had recurring problems with Flume, but it's substantially better with Flume 1.6+. We have an agent running on servers external to our Hadoop cluster with HDFS as the sink. The agent is configured to roll to new files (close current, and start a new one on the next event) hourly.
Once an event is queued on the channel, the Flume agent operates in a transaction manner -- file is sent, but not dequeued until the agent can confirm successful write to HDFS.
In the case where HDFS is unavailable to the agent (restart, network issue, etc.) there are files left on HDFS that are still open. Once connectivity is restored, Flume agent will find these stranded files and either continue writing to them, or close them normally.
However, we have found several edge cases where files seem to get stranded and left open, even after the hourly rolling has successfully renamed the file. I am not sure if this is a bug, a configuration issue, or just the way it is. When it happens, it completely messes up subsequent processing that needs to read the file.
We can find these files with hdfs fsck /foo/bar -openforwrite and can successfully hdfs dfs -mv them then hdfs dfs -cp from their new location back to their original one -- a horrible hack. We think (but have not confirmed) that hdfs debug recoverLease -path /foo/bar/openfile.fubar will cause the file to be closed, which is far simpler.
Recently we had a case where we stopped HDFS for a couple minutes. This broke the flume connections, and left a bunch of seemingly stranded open files in several different states. After HDFS was restarted, the recoverLease option would close the files, but moments later there would be more files open in some intermediate state. Within an hour or so, all the files had been successfully "handled" -- my assumption is that these files were reassociated with the agent channels. Not sure why it took so long -- not that many files. Another possibility is that it's pure HDFS cleaning up after expired leases.
I am not sure this is an answer to the question (which is also 1 year old now :-) ) but it might be helpful to others.
Is it possible to force ZooKeeper to take a snapshot at a specific time or a specific time interval, Not only via 'snapCount' in ZooKeeper configuration
What I like to do is to schedule a snapshot each day or every 24h. Either that its configured in ZooKeeper or force it via cmdline.
This is to 'rollback' my ZooKeeper to a known state in time, in-case the data gets corrupted or someone adds incorrect information.
The only method I found to force snapshot is to restart Zookeeper node which makes it to create the new snapshot with the latest data. Quite a good solution if you use an ensemble. At least, you will get the consistence data if you will take both - the snapshot and latest log files (snapshot is fuzzy and using it in complex with transaction logs is strongly recommended).
But considering this, forcing the snapshot is not really neccessary, I suppose (you should take log files anyway). Copying log files on active zookeeper is also not a good idea...
So stop zookeeper node and copy it's dataDir (with dataLogDir if it's separated) to have guaranteed healthy backup with consistent data for the moment of time when you stopped zk.
I don't believe Zookeeper internally supports this functionality. However, you can always manually copy the latest snapshot and log files (one each) somewhere for safe keeping. Rolling back to that time is then a matter of copying the files back into the data directory when needed.