In checkpointing, Definitive Guides say
1. The secondary asks the primary to roll its edits file, so new edits goes to a new file
2. The secondary retrieves fsimage and edits from primary (using HTTP GET)
and at the end of checkpointing secondary namenode sends updated fsimage to namenode.
Now secondary namenode has latest fsimage, in next checkpointing will secondary namenode again copy fsimage from namenode?? If so why?? can't it simply compare two using checksum
Yes, when the edit file size in namenode grows to specific size (default: fs.checkpoint.size= 4194304), secondary name would copy the fsimage and the edit file from namenode server.
This code from SecondaryNameNode.java explains that -
long size = namenode.getEditLogSize();
if (size >= checkpointSize ||
now >= lastCheckpointTime + 1000 * checkpointPeriod) {
doCheckpoint();
lastCheckpointTime = now;
}
Please, check when doCheckpoint(); is called.
The answer to why, is in the design Hadoop follows (I don't know why the follow this design though) - see the code below what's being done
(I'm keeping only the statements relevant to this question). You can probably see how the functions downloadCheckpointFiles(sig) and doMerge(sig) are called.
/**
* Create a new checkpoint
*/
void doCheckpoint() throws IOException {
//---other code skipped---
// Tell the namenode to start logging transactions in a new edit file
// Retuns a token that would be used to upload the merged image.
CheckpointSignature sig = (CheckpointSignature)namenode.rollEditLog();
downloadCheckpointFiles(sig); // Fetch fsimage and edits
doMerge(sig); // Do the merge
//
// Upload the new image into the NameNode. Then tell the Namenode
// to make this new uploaded image as the most current image.
//
putFSImage(sig);
namenode.rollFsImage();
checkpointImage.endCheckpoint();
//----other code skipped----
}
Then how the downloadCheckpointFiles(sig); called from within doCheckpoint() above.
See code below -
/**
* Download <code>fsimage</code> and <code>edits</code>
* files from the name-node.
* #throws IOException
*/
private void downloadCheckpointFiles(final CheckpointSignature sig
) throws IOException {
try {
UserGroupInformation.getCurrentUser().doAs(new PrivilegedExceptionAction<Void>() {
#Override
public Void run() throws Exception {
// get fsimage
String fileid = "getimage=1";
File[] srcNames = checkpointImage.getImageFiles();
assert srcNames.length > 0 : "No checkpoint targets.";
TransferFsImage.getFileClient(fsName, fileid, srcNames);
LOG.info("Downloaded file " + srcNames[0].getName() + " size " +
srcNames[0].length() + " bytes.");
// get edits file
fileid = "getedit=1";
srcNames = checkpointImage.getEditsFiles();
assert srcNames.length > 0 : "No checkpoint targets.";
TransferFsImage.getFileClient(fsName, fileid, srcNames);
LOG.info("Downloaded file " + srcNames[0].getName() + " size " +
srcNames[0].length() + " bytes.");
checkpointImage.checkpointUploadDone();
return null;
}
});
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
And, for your third last question - "can't it simply compare two using checksum" -
One possible reason is they don't want to take any risk as checksum for two different files can sometime be same. Say in Namenode you have a fsImage which is different to what's in secondarynamenode but their checksum somehow becomes same. This might happen you might never know. Copying seems to be the best option they got to ensure the copies are same.
Hope this helps.
Related
The information in the MVStore docs on backing up a database is a little vague, and I'm not familiar with all the concepts and terminology, so I wanted to see if the approach I came up with makes sense.
I'm a Clojure programmer, so please forgive my Java here:
// db is an MVStore instance
FileStore fs = db.getFileStore();
FileOutputStream fos = java.io.FileOutputStream(pathToBackupFile);
FileChannel outChannel = fos.getChannel();
try {
db.commit();
db.setReuseSpace(false);
ByteBuffer bb = fs.readFully(0, fs.size());
outChannel.write(bb);
}
finally {
outChannel.close();
db.setReuseSpace(true);
}
Here's what it looks like in Clojure in case my Java is bad:
(defn backup-db
[db path-to-backup-file]
(let [fs (.getFileStore db)
backup-file (java.io.FileOutputStream. path-to-backup-file)
out-channel (.getChannel backup-file)]
(try
(.commit db)
(.setReuseSpace db false)
(let [file-contents (.readFully fs 0 (.size fs))]
(.write out-channel file-contents))
(finally
(.close out-channel)
(.setReuseSpace db true)))))
My approach seems to work, but I wanted to make sure I'm not missing anything or see if there's a better way. Thanks!
P.S. I used the H2 tag because MVStore doesn't exist and I don't have enough reputation to create it.
The docs currently say:
The persisted data can be backed up at any time, even during write
operations (online backup). To do that, automatic disk space reuse
needs to be first disabled, so that new data is always appended at the
end of the file. Then, the file can be copied. The file handle is
available to the application. It is recommended to use the utility
class FileChannelInputStream to do this.
The classes FileChannelInputStream and FileChannelOutputStream convert a java.nio.FileChannel into a standard InputStream and OutputStream. There is existing H2 code in BackupCommand.java that shows how to use them. We can improve upon it using Java 9 input.transferTo(output); to copy the data:
public void backup(MVStore s, File backupFile) throws Exception {
try {
s.commit();
s.setReuseSpace(false);
try(RandomAccessFile outFile = new java.io.RandomAccessFile(backupFile, "rw");
FileChannelOutputStream output = new FileChannelOutputStream(outFile.getChannel(), false)){
try(FileChannelInputStream input = new FileChannelInputStream(s.getFileStore().getFile(), false)){
input.transferTo(output);
}
}
} finally {
s.setReuseSpace(true);
}
}
Note that when you create the FileChannelInputStream you have to pass false to tell it to not close the underlying file channel when the stream is closed. If you don't do that it will close the file that your FileStore is trying to use. That code uses try-with-resource syntax to make sure that the output file is properly closed.
In order to try this, I checked out the mvstore code then modified the TestMVStore to add a testBackup() method which is similar to the existing testSimple() code:
private void testBackup() throws Exception {
// write some records like testSimple
String fileName = getBaseDir() + "/" + getTestName();
FileUtils.delete(fileName);
MVStore s = openStore(fileName);
MVMap<Integer, String> m = s.openMap("data");
for (int i = 0; i < 3; i++) {
m.put(i, "hello " + i);
}
// create a backup
String fileNameBackup = getBaseDir() + "/" + getTestName() + ".backup";
FileUtils.delete(fileNameBackup);
backup(s, new File(fileNameBackup));
// this throws if you accidentally close the input channel you get from the store
s.close();
// open the backup and verify
s = openStore(fileNameBackup);
m = s.openMap("data");
for (int i = 0; i < 3; i++) {
assertEquals("hello " + i, m.get(i));
}
s.close();
}
With your example, you are reading into a ByteBuffer which must fit into memory. Using the stream transferTo method uses an internal buffer that is currently (as at Java11) set to 8192 bytes.
I wrote a Spark application that generates HFiles to be used for bulk loading with the LoadIncrementalHFiles command later. As the source data pool is very big, the input files are splitted into iterations that are processed one after the other. Each iteration creates its own HFile directory, so my HDFS structure looks like this:
/user/myuser/map_data/hfiles_0
... /hfiles_1
... /hfiles_2
... /hfiles_3
...
There are about 500 files in this map_data directory, therefore I'm searching for a way to automatically call the LoadIncrementalHFiles function, to process these subdirectories also in iterations later.
The corresponding command would be this:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable
I need to change this into an iterative command, as this command does not work with subdirectories (when I call it with the /user/myuser/map_data directory)!
I tried to use a Java Process instance to execute the command above automatically, but this doesn't seen to do anything (no output to console and also no more rows in my HBase table).
Using the org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Java class out of my code also doesn't work, it's also not responsing!
Has anybody a working example for me? Or is there a parameter to be able to run the above hbase command on the parent directory? I'm working with HBase 1.1.2 in a Hortonworks Data Platform 2.5 cluster.
EDIT I tried to run the LoadIncrementalHFiles command from a Hadoop client Java application, but I'm getting an exception relating to snappy compression, see Run LoadIncrementalHFiles from Java client
The solution was to split the hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no /user/myuser/map_data/hfiles_0 mytable command into many parts (one per command part), see this Java code snippet:
TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);
for(String hFileDir : subDirs) {
try {
String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
==> String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
ProcessBuilder pb = new ProcessBuilder(execCode);
pb.redirectErrorStream(true);
final Process p = pb.start();
// Write the output of the Process to the console
new Thread(new Runnable() {
public void run() {
BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = null;
try {
while ((line = input.readLine()) != null)
System.out.println(line);
} catch (IOException e) {
e.printStackTrace();
}
}
}).start();
// Wait for the end of the execution
p.waitFor();
...
}
I read from hadoop operations that if a datanode fails during writing process,
A new replication pipeline containing the remaining datanodes is
opened and the write resumes. At this point, things are mostly back to
normal and the write operation continues until the file is closed. The
namenode will notice that one of the blocks in the file is
under-replicated and will arrange for a new replica to be created
asynchronously. A client can recover from multiple failed datanodes
provided at least a minimum number of replicas are written (by
default, this is one).
But what happens if all the datanodes fail? i.e., minimum number of replicas are not written?
Will client ask namenode to give new list of datanodes? or will the job fail?
Note : My question is NOT what happens when all the data nodes fails in the cluster. Question is what happens if all the datanodes to which the client was supposed to write, fails, during the write operation
Suppose namenode told the client to write BLOCK B1 to datanodes D1 in Rack1, D2 in Rack2 and D3 in Rack1. There might be other racks also in the cluster(Rack 4,5,6,...). If Rack1 and 2 failed during the write process, client knows that the data was not written successfully since it didn't receive the ACK from the datanodes, At this point, will it ask Namenode to give new set of datanodes? may be in the still alive Racks ?
OK I got what you are asking. DFSClient will get a list of datanodes from the namenode where it is supposed to write a block (say A) of a file. DFSClient will iterate over that list of Datanodes and write the block A in those locations. If block write fails in the first datanodes, it'll abandon the block write and ask namenode a new set of datanodes where it can attempt to write again.
Here the sample code from DFSClient that explains that -
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {
//----- other code ------
do {
hasError = false;
lastException = null;
errorIndex = 0;
retry = false;
nodes = null;
success = false;
long startTime = System.currentTimeMillis();
lb = locateFollowingBlock(startTime);
block = lb.getBlock();
accessToken = lb.getBlockToken();
nodes = lb.getLocations();
//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, clientName, false);
if (!success) {
LOG.info("Abandoning block " + block);
namenode.abandonBlock(block, src, clientName);
// Connection failed. Let's wait a little bit and retry
retry = true;
try {
if (System.currentTimeMillis() - startTime > 5000) {
LOG.info("Waiting to find target node: " + nodes[0].getName());
}
Thread.sleep(6000);
} catch (InterruptedException iex) {
}
}
} while (retry && --count >= 0);
if (!success) {
throw new IOException("Unable to create new block.");
}
return nodes;
}
I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.
My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks
Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ