What happens when all the datanodes fail in hadoop? - hadoop

I read from hadoop operations that if a datanode fails during writing process,
A new replication pipeline containing the remaining datanodes is
opened and the write resumes. At this point, things are mostly back to
normal and the write operation continues until the file is closed. The
namenode will notice that one of the blocks in the file is
under-replicated and will arrange for a new replica to be created
asynchronously. A client can recover from multiple failed datanodes
provided at least a minimum number of replicas are written (by
default, this is one).
But what happens if all the datanodes fail? i.e., minimum number of replicas are not written?
Will client ask namenode to give new list of datanodes? or will the job fail?
Note : My question is NOT what happens when all the data nodes fails in the cluster. Question is what happens if all the datanodes to which the client was supposed to write, fails, during the write operation
Suppose namenode told the client to write BLOCK B1 to datanodes D1 in Rack1, D2 in Rack2 and D3 in Rack1. There might be other racks also in the cluster(Rack 4,5,6,...). If Rack1 and 2 failed during the write process, client knows that the data was not written successfully since it didn't receive the ACK from the datanodes, At this point, will it ask Namenode to give new set of datanodes? may be in the still alive Racks ?

OK I got what you are asking. DFSClient will get a list of datanodes from the namenode where it is supposed to write a block (say A) of a file. DFSClient will iterate over that list of Datanodes and write the block A in those locations. If block write fails in the first datanodes, it'll abandon the block write and ask namenode a new set of datanodes where it can attempt to write again.
Here the sample code from DFSClient that explains that -
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {
//----- other code ------
do {
hasError = false;
lastException = null;
errorIndex = 0;
retry = false;
nodes = null;
success = false;
long startTime = System.currentTimeMillis();
lb = locateFollowingBlock(startTime);
block = lb.getBlock();
accessToken = lb.getBlockToken();
nodes = lb.getLocations();
//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, clientName, false);
if (!success) {
LOG.info("Abandoning block " + block);
namenode.abandonBlock(block, src, clientName);
// Connection failed. Let's wait a little bit and retry
retry = true;
try {
if (System.currentTimeMillis() - startTime > 5000) {
LOG.info("Waiting to find target node: " + nodes[0].getName());
}
Thread.sleep(6000);
} catch (InterruptedException iex) {
}
}
} while (retry && --count >= 0);
if (!success) {
throw new IOException("Unable to create new block.");
}
return nodes;
}

Related

Flowfiles stuck in queue (Apache NiFi)

I have following flow:
ListFTP -> RouteOnAttribute -> FetchFTP -> UnpackContent -> ExecuteScript.
Some of files are stuck in queue UnpackContent -> ExecuteScript.
ExecuteScript ate some flowfiles and they just disappeared: failure and success relationships are empty. It just showed some activity in Tasks/Time field. All of them stuck in queue before ExecuteScript. I tried to empty queue, but not all of flowfiles have been deleted from this queue. About 1/3 of them still stuck in queue. I tried to disable all processors and empty queue again but it returns: 0 FlowFiles (0 bytes) were removed from the queue.
When i'm trying to change Connection destionation it returns:
Cannot change destination of Connection because FlowFiles from this Connection are currently held by ExecuteScript[id=d33c9b73-0177-1000-5151-83b7b938de39]
ExecuScript from this answer (uses Python).
So, I can't empty queue because its always return message that there is no any flowfile, and i can't remove connection. This has been going on for several hours.
Connection configuration:
Scheduling is set to 0 sec, no penalties for flowfiles, etc.
Is it script problem?
UPDATE
Changed script to:
flowFile = session.get()
if (flowFile != None):
# All processing code starts at this indent
if errorOccurred:
session.transfer(flowFile, REL_FAILURE)
else:
session.transfer(flowFile, REL_SUCCESS)
# implicit return at the end
Same result.
UPDATE v2
I set concurent tasks to 50 and then ran ExecuteScript again and terminated it. I got this error:
UPDATE v3
I created additional ExecuteScript processor with same script and it works fine. But after i stopped this new processor and create new flowfiles, this processor now have same problems: it's just stuck.
Hilarious. Is ExecuteScript for single use?
You need to modify Your code in nifi-1.13.2 because NIFI-8080 caused these bugs. Or you just use nifi 1.12.1
JythonScriptEngineConfigurator:
#Override
public Object init(ScriptEngine engine, String scriptBody, String[] modulePaths) throws ScriptException {
// Always compile when first run
if (engine != null) {
// Add prefix for import sys and all jython modules
prefix = "import sys\n"
+ Arrays.stream(modulePaths).map((modulePath) -> "sys.path.append(" + PyString.encode_UnicodeEscape(modulePath, true) + ")")
.collect(Collectors.joining("\n"));
}
return null;
}
#Override
public Object eval(ScriptEngine engine, String scriptBody, String[] modulePaths) throws ScriptException {
Object returnValue = null;
if (engine != null) {
returnValue = ((Compilable) engine).compile(prefix + scriptBody).eval();
}
return returnValue;
}

IBM MQ tuning for tranfer a large number of file

I have a project to transfer file using IBM MQ. There are 10000 clients and one data center. The largest file size is almost 8MB. The MQ cluster contains three MQ managers which are at different Windows server. Each MQ manager have 5 channels for client and 5 channel for data center. There are two cases for testing. Clients are evenly distributed to MQ manager in each case. Do not lose any file is the most important thing in these cases.
Case 1:
Every client send 50 files to data center at the same time. The files size are between 150KB to 5MB.
In this case, the sum of file size one client send is almost 80MB.
Case 2 :
Data center send the 10 identical files to every client at the same time. In this case, I create a topic named `myTopic` and 10000 clients subscribe this topic. Data center send 10 identical files to the topic.
MQ Mangers have a heavy load. I already set some attribute in IBM MQ:
Queue Manager:
Max handles: 100000
Maximum message length: 100MB
Max channels: 10000
Max channels: 10000
Is there any attribute that could increase the performance?
5/11 update:
First, I have modified the situation of case 2 above. I have a data center server that has a 4 core CPU and 32G RAM. I use 4 clients server to simulate 10000 clients, and each client server has 4 core CPU and 16G RAM.
In case 1, it take about 37 minutes when 1000 clients send files to the data center. There are not enough memory on data center server when data center receive files from 2000 clients. I find there are 20G memory used for buffer/cache. Here is my java code used to receive files:
try {
String filePath = ConfigReader.getInstance().getConfig("filePath");
MQMessage mqMsg = new MQMessage();
mqMsg.messageId = CMQC.MQMI_NONE;
mqMsg.correlationId = CMQC.MQCI_NONE;
mqMsg.groupId = CMQC.MQGI_NONE;
int flag = 1;
while (true) {
try {
MQQueueManager queueManager = new MQQueueManager("QMGR1");
int option = CMQC.MQTOPIC_OPEN_AS_SUBSCRIPTION | CMQC.MQSO_DURABLE;
MQTopic subscriber = queueManager.accessTopic("", "myTopic", option, null, "datacenter");
subscriber.get(mqMsg);
if (mqMsg.getDataLength() != 0) {
String fileName = filePath + "_file" + flag + ".txt";
byte[] b = new byte[mqMsg.getDataLength()];
mqMsg.readFully(b);
System.out.println("Receive " + fileName + ", complete time: " + System.currentTimeMillis());
Path path = Paths.get(fileName);
System.out.println("Write " + fileName + ", start time: " + System.currentTimeMillis());
Files.write(path, b);
System.out.println("Write " + fileName + ", complete time: " + System.currentTimeMillis());
flag++;
}
} catch (MQException e) {
// e.printStackTrace();
if (e.reasonCode != 2033) {
e.printStackTrace();
}
} finally {
mqMsg.clearMessage();
mqMsg.messageId = CMQC.MQMI_NONE;
mqMsg.correlationId = CMQC.MQCI_NONE;
mqMsg.groupId = CMQC.MQGI_NONE;
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I use byte array to read message and write it to disk. Is it possible that the byte array does not release memory and takes 20G memory?
In case 2, I find if I send a 5MB file to myTopic that has 1000 subscribers on MQ manager01, MQ manager01 take a lot of time to sync with cluster member. The disks on the MQ servers are very busy. There are another problem: Sometimes I get only 7 seconds to send a 5MB file, sometimes it takes 90 seconds. Here is my java code to send files:
try {
MQQueueManager queueManager = new MQQueueManager("QMGR1");
MQTopic publisher = queueManager.accessTopic("myTopic", "", CMQC.MQTOPIC_OPEN_AS_PUBLICATION,
CMQC.MQOO_OUTPUT);
System.out.println("---- start publish , time: " + System.currentTimeMillis() + " ----");
publisher.put(InMemoryDataProvider.getInstance().getMessage("my5MBFile"));
System.out.println("---- end publish , time: " + System.currentTimeMillis() + " ----");
publish.getPublisher().close();
} catch (MQException e) {
System.out.println("threadNum: " + publish.getThreadNo() + " publish error");
if (e.reasonCode != 2033) {
e.printStackTrace();
}
}
A couple of things.
MQ has FTE which transfers files for you. I think it does it using non persistent messages, so you avoid the disk overhead.
You might try checking your .ini files for parameters like ClntRcvBuffSize=0
see here.
0 says use the operating system values.
TCP used to send some data in short packets (64KB chunk), then wait till the packets have been acknowledged, and send more. If the connection is reliable, then you get higher throughput by sending bigger logical packets, a technique known as Dynamic Right Sizing. See here
it works best when the connection is long lived and sending a lot if data. For example the first few chunks may be 64KB, then increase it a bit to 128KB chunks, eventually up to 100MB ( or more) if needed.
You need to set both ends.
Depending on platform, you can use Netstat replacement ss command to display the various window sizes.
For your QM to QM channels specify a large batchsz and batchlim - though this may make your disk IO worse as the data gets to the remote end faster.

Spark Streaming: Micro batches Parallel Execution

We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.

Hadoop: how reducer nodes are selected?

I just started learning Hadoop, but don't understand how a datanode becomes a reducer node.
Once the map task completes, the content of its sort buffer is flushed to the local disk
after the KV pairs are sorted and partitioned
Then the jobtracker is notified about the spilled partitions.
After then the reducers start asking the data from a particular partition.
But how the jobtracker decides which node becomes a reducer node? I'm reading the Hadoop Definitive guide but this step is not mentioned in the book.
Thanks,
Bruckwald
Pretty much first-come, first-serve. Tasks are assigned by heartbeats, so if a Tasktracker pings the Jobtracker that it is alive, it will get a response that might contain a new task to run:
List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);
if (tasks == null ) {
tasks = taskScheduler.assignTasks(taskTrackerStatus);
}
if (tasks != null) {
for (Task task : tasks) {
expireLaunchingTasks.addNewTask(task.getTaskID());
LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());
actions.add(new LaunchTaskAction(task));
}
}
Here's the relevant source code of the Jobtracker. So besides which tasktracker comes first, the taskscheduler will check for resource conditions (e.g. if there is a free slot, or a single node is not overloaded).
The relevant code can be found here (which isn't particular exciting):
//
// Same thing, but for reduce tasks
// However we _never_ assign more than 1 reduce task per heartbeat
//
final int trackerCurrentReduceCapacity =
Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity),
trackerReduceCapacity);
final int availableReduceSlots =
Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
boolean exceededReducePadding = false;
if (availableReduceSlots > 0) {
exceededReducePadding = exceededPadding(false, clusterStatus,
trackerReduceCapacity);
synchronized (jobQueue) {
for (JobInProgress job : jobQueue) {
if (job.getStatus().getRunState() != JobStatus.RUNNING ||
job.numReduceTasks == 0) {
continue;
}
Task t = job.obtainNewReduceTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());
if (t != null) {
assignedTasks.add(t);
break;
}
// Don't assign reduce tasks to the hilt!
// Leave some free slots in the cluster for future task-failures,
// speculative tasks etc. beyond the highest priority job
if (exceededReducePadding) {
break;
}
}
}
Basically, the first tasktracker that heartbeats to the Jobtracker and has enough slots available will get a reduce tasks.

checkpointing: Is fsimage always copied from namenode

In checkpointing, Definitive Guides say
1. The secondary asks the primary to roll its edits file, so new edits goes to a new file
2. The secondary retrieves fsimage and edits from primary (using HTTP GET)
and at the end of checkpointing secondary namenode sends updated fsimage to namenode.
Now secondary namenode has latest fsimage, in next checkpointing will secondary namenode again copy fsimage from namenode?? If so why?? can't it simply compare two using checksum
Yes, when the edit file size in namenode grows to specific size (default: fs.checkpoint.size= 4194304), secondary name would copy the fsimage and the edit file from namenode server.
This code from SecondaryNameNode.java explains that -
long size = namenode.getEditLogSize();
if (size >= checkpointSize ||
now >= lastCheckpointTime + 1000 * checkpointPeriod) {
doCheckpoint();
lastCheckpointTime = now;
}
Please, check when doCheckpoint(); is called.
The answer to why, is in the design Hadoop follows (I don't know why the follow this design though) - see the code below what's being done
(I'm keeping only the statements relevant to this question). You can probably see how the functions downloadCheckpointFiles(sig) and doMerge(sig) are called.
/**
* Create a new checkpoint
*/
void doCheckpoint() throws IOException {
//---other code skipped---
// Tell the namenode to start logging transactions in a new edit file
// Retuns a token that would be used to upload the merged image.
CheckpointSignature sig = (CheckpointSignature)namenode.rollEditLog();
downloadCheckpointFiles(sig); // Fetch fsimage and edits
doMerge(sig); // Do the merge
//
// Upload the new image into the NameNode. Then tell the Namenode
// to make this new uploaded image as the most current image.
//
putFSImage(sig);
namenode.rollFsImage();
checkpointImage.endCheckpoint();
//----other code skipped----
}
Then how the downloadCheckpointFiles(sig); called from within doCheckpoint() above.
See code below -
/**
* Download <code>fsimage</code> and <code>edits</code>
* files from the name-node.
* #throws IOException
*/
private void downloadCheckpointFiles(final CheckpointSignature sig
) throws IOException {
try {
UserGroupInformation.getCurrentUser().doAs(new PrivilegedExceptionAction<Void>() {
#Override
public Void run() throws Exception {
// get fsimage
String fileid = "getimage=1";
File[] srcNames = checkpointImage.getImageFiles();
assert srcNames.length > 0 : "No checkpoint targets.";
TransferFsImage.getFileClient(fsName, fileid, srcNames);
LOG.info("Downloaded file " + srcNames[0].getName() + " size " +
srcNames[0].length() + " bytes.");
// get edits file
fileid = "getedit=1";
srcNames = checkpointImage.getEditsFiles();
assert srcNames.length > 0 : "No checkpoint targets.";
TransferFsImage.getFileClient(fsName, fileid, srcNames);
LOG.info("Downloaded file " + srcNames[0].getName() + " size " +
srcNames[0].length() + " bytes.");
checkpointImage.checkpointUploadDone();
return null;
}
});
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
And, for your third last question - "can't it simply compare two using checksum" -
One possible reason is they don't want to take any risk as checksum for two different files can sometime be same. Say in Namenode you have a fsImage which is different to what's in secondarynamenode but their checksum somehow becomes same. This might happen you might never know. Copying seems to be the best option they got to ensure the copies are same.
Hope this helps.

Resources