"Unable to verify integrity of data" while running MR job

"Unable to verify integrity of data" while running MR job - hadoop

I'm running a relatively big MR job using Amazon Elastic Map Reduce.
I ran the job plenty of times on small data sets with no problem.
But when trying to run it on a large dataset I'm getting the following exception:
Error: com.amazonaws.AmazonClientException: Unable to verify integrity
of data download. Client calculated content length didn't match
content length received from Amazon S3. The data may be corrupt.
I googled it and the only recommendation I got was to set the following:
System.setProperty("com.amazonaws.services.s3.disableGetObjectMD5Validation","true");
That didn't help at all.
I'm using replication 3, 11 M1Large datanodes and 1 M1Medium master node.
Any workaround or known fix for this issue?

Apparently, this is a known bug. Or so I've been told by an Amazon employee here.
It occurs when running on large datasets where an S3 object is bigger than 2GB.
I managed to work around it by moving to Hadoop 2.4.0 and AMI 3.1.0.

Related

Azure Databricks stream fails with StorageException: Could not verify copy source

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.

Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.

Hadoop Data Corrupted Following Power Failure

I'm new to Hadoop and learning to use it by working with a small cluster where each node is an Ubuntu Server VM. The cluster consists of 1 name node and 3 data nodes with a replication factor of 3. After a power loss on the machine hosting the VMs, all files stored in the cluster were corrupted and with the blocks storing those files missing. No queries were running at the time power was lost and no files were being written to or read from the cluster.
If I shut down the VMs correctly (even without first stopping the Hadoop cluster), then the data is preserved and I don't run into any issues with missing or corrupted blocks.
The only information I've been able to find suggested setting dfs.datanode.sync.behind.writes to true, but this did not resolve the issue (killing the VMs from the host causes the same issue as a power failure). The information I found here seems to indicate this property will only have an effect when writing data to the disk.
I also tried running hdfs namenode -recover, but this did not resolve the issue. Ultimately I had to remove the data stored in the dfs.namenode.name.dir directory, rebooted each VM in the cluster to remove any Hadoop files in /tmp and reformatted the name node before copying the data back into the cluster from local file storage.
I understand that having all nodes in the cluster running on the same hardware and only 3 data nodes to go with a replication factor of 3 is not an ideal configuration, but I'd like a way to ensure that any data that is already written to disk is not corrupted by a power loss. Is there a property or other configuration I need to implement to avoid this in the future (besides separate hardware, more nodes, power backup, etc.)?
EDIT: To clarify further, the issue I'm trying to resolve is data corruption, not cluster availability. I understand I need to make changes to the overall cluster architecture to improve reliability, but I'd like a way to ensure data is not lost even in the event of a cluster-wide power failure.

Hive / Tez job won't start

I am trying to create an ORC table in Hive by importing from a text file in HDFS. I have tried multiple different ways, searched online for help, and regardless the insert job won't start.
I can get the text file to HDFS, I can read the text file to Hive, but I cannot convert from that to ORC.
I tried many different variations, including this one that can be used as a reference to this question:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html
I have a single-node HDP cluster (being used for development) - version:
HDP-2.3.2.0
(2.3.2.0-2950)
And here are the relevant service versions:
Service Version Status Description
HDFS 2.7.1.2.3 Installed Apache Hadoop Distributed File System
MapReduce2 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
YARN 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
Tez 0.7.0.2.3 Installed Tez is the next generation Hadoop Query Processing framework written on top of YARN.
Hive 1.2.1.2.3 Installed Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service
What happens when I run a SQL like this (again, I've tried many variations including directly from online tutorials):
INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
My job stays like this:
Total number of applications (application-types: [] and states:
[SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1455989658079_0002 HIVE-3f41161c-b806-4e7d-974e-c18e028d683f TEZ hive root.hive ACCEPTED UNDEFINED 0% N/A
And it just hangs there. (Literally, I've tried a 20 row sample table and let it run for hours before killing it).
I am by no means an Hadoop expert (yet) and am sure it's probably a config issue, but I have been unable to figure it out.
All other Hive operations I've tried, such as creating dropping tables, loading a file to a text table, selects, all work fine. It's just when I create an ORC table that it does this. And I need an ORC table for my requirement.
Any advice would be helpful.

Most of the time it has to do with increasing your Yarn Scheduling capacity, but if your resources are already capped you can also reduce the amount of memory requested by individual TEZ tasks, through adjusting the following property in TEZ configuration :
task.resource.memory.mb
In order to increase the Cluster's capacity you can do it in the configuration settings of YARN or directly through Ambari or Cloudera Manager
In order to monitor what is happening behind the hoods you can run Yarn Resource Manager UI and check the diagnostics tab of the specific Application there are useful explicit messages about resource allocation especially when the job is accepted and keeps pending.

SparkException: Master removed our application

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).

As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

'Stream Closed' error when using s3distcp to copy files from HDFS to Amazon S3

I am using s3distcp to copy files from HDFS to Amazon S3. Recently, I started getting the 'Stream Closed' error for reducer tasks. I noticed that the error only happened where there were multiple threads (each thread assigned a single file to upload to S3) under the same reducer task. The reducer task fails on this error, but eventually succeeds due to the retries.
My initial analysis was that the threads were using the same instance of the FileSystem class. To avoid that, I used configuration.setBoolean("fs.hdfs.impl.disable.cache", true) before calling FileSytem.get to avoid getting a cached instance. But, I am still getting the same error. Here is a part of the syslog:
Caused by: java.io.IOException: Stream closed
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2039)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1976)
at java.io.FilterInputStream.read(FilterInputStream.java:66)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$RewindableInputStream.read(XMLEntityManager.java:2932)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:704)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:772)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1163)
Another point that I should mention is that I recently moved from Hadoop 0.20 to Hadoop 0.20.205 and a new AMI (2.0.4). This issue started happening after moving to the new version.
I have been blocked on this for a while now. Any help or guidance would be helpful.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio