I'm running OpsCenter 5.1.1 with Datastax Enterprise 4.5.1. It's a 3-node cluster on AWS and I'm backing up to S3 (still...) I've started seeing a new error. I think this is a different error than any I've posted b4.
$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.8.39 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
I am seeing this error in the agent.log file
node1_agent.log: SEVERE: error after writing 15736832/16777216 bytes to https://cassandra-dev-bkup.s3.amazonaws.com/snapshots/407bb4b1-5c91-43fe-9d4f-767115668037/sstables/1430904167-reporting_test-transaction_lookup-jb-288-Index.db?partNumber=2&uploadId=.MA3X4RYssg7xL_Hr7Msgze.J4exDq9zZ_0Y7qEj9gZhJ570j73kZNr5_nbxactmPMJeKf0XyZfEC0KAplWOz9lpyRCtNeeDCvCmtEXDchH8F1J2c57aq4MrxfBcyiZr
java.io.IOException: Error writing request body to server
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3192)
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3175)
at com.google.common.io.CountingOutputStream.write(CountingOutputStream.java:53)
at com.google.common.io.ByteStreams.copy(ByteStreams.java:179)
at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.writePayloadToConnection(JavaUrlHttpCommandExecutorService.java:308)
at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.convert(JavaUrlHttpCommandExecutorService.java:192)
at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.convert(JavaUrlHttpCommandExecutorService.java:72)
at org.jclouds.http.internal.BaseHttpCommandExecutorService.invoke(BaseHttpCommandExecutorService.java:95)
at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.invoke(InvokeSyncToAsyncHttpMethod.java:128)
at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.apply(InvokeSyncToAsyncHttpMethod.java:94)
at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.apply(InvokeSyncToAsyncHttpMethod.java:55)
at org.jclouds.rest.internal.DelegatesToInvocationFunction.handle(DelegatesToInvocationFunction.java:156)
at org.jclouds.rest.internal.DelegatesToInvocationFunction.invoke(DelegatesToInvocationFunction.java:123)
at com.sun.proxy.$Proxy48.uploadPart(Unknown Source)
at org.jclouds.aws.s3.blobstore.strategy.internal.SequentialMultipartUploadStrategy.prepareUploadPart(SequentialMultipartUploadStrategy.java:111)
at org.jclouds.aws.s3.blobstore.strategy.internal.SequentialMultipartUploadStrategy.execute(SequentialMultipartUploadStrategy.java:93)
at org.jclouds.aws.s3.blobstore.AWSS3BlobStore.putBlob(AWSS3BlobStore.java:89)
at org.jclouds.blobstore2$put_blob.doInvoke(blobstore2.clj:246)
at clojure.lang.RestFn.invoke(RestFn.java:494)
at opsagent.backups.destinations$create_blob$fn__12007.invoke(destinations.clj:69)
at opsagent.backups.destinations$create_blob.invoke(destinations.clj:64)
at opsagent.backups.destinations$fn__12170.invoke(destinations.clj:192)
at opsagent.backups.destinations$fn__11799$G__11792__11810.invoke(destinations.clj:24)
at opsagent.backups.staging$start_staging_BANG_$fn__12338$state_machine__7576__auto____12339$fn__12344$fn__12375.invoke(staging.clj:61)
at opsagent.backups.staging$start_staging_BANG_$fn__12338$state_machine__7576__auto____12339$fn__12344.invoke(staging.clj:59)
at opsagent.backups.staging$start_staging_BANG_$fn__12338$state_machine__7576__auto____12339.invoke(staging.clj:56)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at clojure.core.async.impl.ioc_macros$take_BANG_$fn__7592.invoke(ioc_macros.clj:953)
at clojure.core.async.impl.channels.ManyToManyChannel$fn__4097.invoke(channels.clj:102)
at clojure.lang.AFn.run(AFn.java:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
TL;DR -
Your SSTable which is 38866048 bytes, is both on your filesystem and on S3. This means the file has transferred over and you are in good shape. No need to worry about this error (though I opened an internal ticket to handle this kind of exception rather than throw a dump).
Details - A summary of what I suspect happened
1) There was a file transfer error when you reached 15736832 out of the 16777216 byte slice of the sstable.
2) At this point OpsCenter did not finish transferring the table or leave a partial version in s3
3) Another backup attempt later on moved the sstable with no error and a valid backup exists.
Related
I have a setup in which NiFi Web Server suddenly started failing to start when upgrading from 1.15.3 to 1.16.1 version. The following exception keeps occurring on the Apache NiFi Cluster:
2022-05-11 22:53:40,570 WARN [main] org.apache.nifi.web.server.JettyServer Failed to start web server... shutting down.
org.apache.nifi.encrypt.EncryptionException: Decryption Failed with Algorithm [PBEWITHMD5AND256BITAES-CBC-OPENSSL]
at org.apache.nifi.encrypt.CipherPropertyEncryptor.decrypt(CipherPropertyEncryptor.java:78)
at org.apache.nifi.fingerprint.FingerprintFactory.decrypt(FingerprintFactory.java:931)
at org.apache.nifi.fingerprint.FingerprintFactory.getLoggableRepresentationOfSensitiveValue(FingerprintFactory.java:561)
at org.apache.nifi.fingerprint.FingerprintFactory.addParameter(FingerprintFactory.java:330)
at org.apache.nifi.fingerprint.FingerprintFactory.addParameterContext(FingerprintFactory.java:302)
at org.apache.nifi.fingerprint.FingerprintFactory.addFlowControllerFingerprint(FingerprintFactory.java:210)
at org.apache.nifi.fingerprint.FingerprintFactory.createFingerprint(FingerprintFactory.java:153)
at org.apache.nifi.fingerprint.FingerprintFactory.createFingerprint(FingerprintFactory.java:127)
at org.apache.nifi.controller.inheritance.FlowFingerprintCheck.checkInheritability(FlowFingerprintCheck.java:45)
at org.apache.nifi.controller.XmlFlowSynchronizer.sync(XmlFlowSynchronizer.java:200)
at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:43)
at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1524)
at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:815)
at org.apache.nifi.controller.StandardFlowService.load(StandardFlowService.java:457)
at org.apache.nifi.web.server.JettyServer.start(JettyServer.java:1086)
at org.apache.nifi.NiFi.<init>(NiFi.java:170)
at org.apache.nifi.NiFi.<init>(NiFi.java:82)
at org.apache.nifi.NiFi.main(NiFi.java:330)
Caused by: javax.crypto.BadPaddingException: pad block corrupted
at org.bouncycastle.jcajce.provider.symmetric.util.BaseBlockCipher$BufferedGenericBlockCipher.doFinal(Unknown Source)
at org.bouncycastle.jcajce.provider.symmetric.util.BaseBlockCipher.engineDoFinal(Unknown Source)
at javax.crypto.Cipher.doFinal(Cipher.java:2168)
at org.apache.nifi.encrypt.CipherPropertyEncryptor.decrypt(CipherPropertyEncryptor.java:74)
... 18 common frames omitted
relevant nifi.properties:
nifi.sensitive.props.key=<hidden>
nifi.sensitive.props.key.protected=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.additional.keys=
I have already tried to tear it all down and re-install 1.15.3 with not any other changes, but the same issue still persists. Can someone please share any ideas if there are any on how to fix this?
I'm trying to checkpoint/savepoint my flink state running on EMR to an s3 bucket on AWS. Please note:
The instances (master and core nodes) have the IAM role properly set up to access the s3 bucket and all the directories/files inside it (AmazonS3FullAccess policy is attached to the role and nothing overrides it).
I can use aws s3 cp xxx s3://flink-bc/checkpoints from slave and master nodes successfully to copy files to the bucket
Using hdfs for savepoints/checkpoints work
If I set the checkpoints to use hdfs and then try to savepoint to s3, the savepoint operation error looks like
org.apache.flink.util.FlinkException: Triggering a savepoint for the job 16c162c47f225cddad974056c9494b6d failed.
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723)
at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to trigger savepoint. Decline reason: An Exception occurred while triggering the checkpoint.........
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to trigger savepoint. Decline reason: An Exception occurred while triggering the checkpoint.
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
and the jobmanager logs:
java.io.IOException: Cannot instantiate file system for URI: s3://flink-bc/savepoints
at org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:187)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:399)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.initializeLocationForSavepoint(AbstractFsCheckpointStorage.java:147)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:511)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:370)
at org.apache.flink.runtime.jobmaster.JobMaster.triggerSavepoint(JobMaster.java:951)
I faced a similar kind of issue while using the latest Flink version(1.10.0) with s3 to store the checkpoints in the s3 bucket.
So please find a detailed working answer which I have provided here.
I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in executor upon receiving a message from kafka:
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1178)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at net.juniper.spark.stream.LogDataStreamProcessor$2.call(LogDataStreamProcessor.java:177)
at net.juniper.spark.stream.LogDataStreamProcessor$2.call(LogDataStreamProcessor.java:1)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Does anyone have idea how to resolve this error?
Spark version: 1.5.0
CDH 5.5.1
When encountering issues where only the first run works, it always resulted in issues revolving the checkpoint data. Moreover, the use of checkpoints only happens when there is something to check, which is the first message from kafka.
I suggest you check if you the job is indeed dead, that is, maybe the process is still running on the machine that executed it.
try running a simple ps -fe and see if something is still running. if there are 2 processes trying to use the same checkpoint folder, it will always fail.
hope this helps
I have a small cluster up and running with Cloudera CDH4 Hadoop and Map Reduce v1. Namenode/Secondary Namenode/Jobtracker all on different machines. My three servers are also acting as Zookeeper servers.
I'm trying to install Accumulo 1.4.4 on top of this cluster. I get the same behavior with Accumulo 1.5.0. I am able to bin/accumulo init and initialize Accumulo, but starting the individual components fail. I'm trying to make my Namenode the Accumulo master.
bin/start-server.sh localhost monitor spits out a very encouraging Starting monitor on localhost, but nothing gets started. If I examine logs/monitor_localhost.err I find a stacktrace:
-bash-4.1$ cat logs/monitor_localhost.err
Thread "monitor" died null
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.accumulo.start.Main$1.run(Main.java:91)
at java.lang.Thread.run(Thread.java:701)
Caused by: java.lang.ExceptionInInitializerError
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2464)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2456)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2323)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:163)
at org.apache.accumulo.core.file.FileUtil.getFileSystem(FileUtil.java:554)
at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceIDFromHdfs(ZooKeeperInstance.java:258)
at org.apache.accumulo.server.conf.ZooConfiguration.getInstance(ZooConfiguration.java:65)
at org.apache.accumulo.server.conf.ServerConfiguration.getZooConfiguration(ServerConfiguration.java:49)
at org.apache.accumulo.server.conf.ServerConfiguration.getSystemConfiguration(ServerConfiguration.java:58)
at org.apache.accumulo.server.monitor.Monitor.run(Monitor.java:440)
at org.apache.accumulo.server.monitor.Monitor.main(Monitor.java:433)
... 6 more
Caused by: java.security.AccessControlException: access denied (java.lang.RuntimePermission accessDeclaredMembers)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:399)
at java.security.AccessController.checkPermission(AccessController.java:557)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
at java.lang.Class.checkMemberAccess(Class.java:2237)
at java.lang.Class.getDeclaredFields(Class.java:1805)
at org.apache.hadoop.util.ReflectionUtils.getDeclaredFieldsIncludingInherited(ReflectionUtils.java:315)
at org.apache.hadoop.metrics2.lib.MetricsSourceBuilder.initRegistry(MetricsSourceBuilder.java:92)
at org.apache.hadoop.metrics2.lib.MetricsSourceBuilder.<init>(MetricsSourceBuilder.java:56)
at org.apache.hadoop.metrics2.lib.MetricsAnnotations.newSourceBuilder(MetricsAnnotations.java:42)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:212)
at org.apache.hadoop.metrics2.MetricsSystem.register(MetricsSystem.java:54)
at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:97)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:190)
... 18 more
The AccessControlException: access denied looks like the important line to me, but I can't imagine what access is being restricted. I'm running everything as the hdfs user, which owns the entire /opt/accumulo-1.4.4/ directory where accumulo is un-tarred. The /accumulo directory in HDFS is also owned by the hdfs user. SELinux is permissive. Searching online has proved fruitless, has anyone dealt with this error before?
Much thanks.
I started browsing the Apache accumulo-users mailing list archive and came across the solution.
http://mail-archives.apache.org/mod_mbox/accumulo-user/201312.mbox/%3CB9CB2B2BF27F0F46B8ECF781831E00E710970A9F%400015-its-exmb10.us.saic.com%3E
I was copying the accumulo.policy.example to accumulo.policy because I thought I needed it in my configuration. Once I deleted the accumulo.policy file my issues went away and I've been able to stand up Accumulo (1.5.0 at least, 1.4.4 still has some issues for me)
I want to use distcp over hftp protocol to copy file from cdh3 and cdh4.
The command is like:
hadoop distcp hftp://cluster1:50070/folder1 hdfs://cluster2/folder2
But the job fails due to some http connection error from jobtracker UI
INFO org.apache.hadoop.tools.DistCp: FAIL test1.dat : java.io.IOException: HTTP_OK expected, received 503
*at org.apache.hadoop.hdfs.HftpFileSystem$RangeHeaderUrlOpener.connect(HftpFileSystem.java:376)
at org.apache.hadoop.hdfs.ByteRangeInputStream.openInputStream(ByteRangeInputStream.java:119)
at org.apache.hadoop.hdfs.ByteRangeInputStream.getInputStream(ByteRangeInputStream.java:103)
at org.apache.hadoop.hdfs.ByteRangeInputStream.read(ByteRangeInputStream.java:187)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:424)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)*
Most files in folder1 will be copied to folder2 except some files fail due to the exception above.
Anyone has the same problem with me, and how to solve this problem?
Thanks in advance.
HFTP uses HTTP web server on datanodes to get data. Check if this HTTP web server is working on all the datanodes or not. I got this exact error and after debugging I found out this web server on some data nodes wasnt started due to some corrupt jar file.
This webserver is started when you start a datanode. You can check initial 500 lines of datanode log to see if thi webserver is starting or not.
Is your Hadoop cluster cluster1 and cluster2 running same version of Hadoop? What's the detail release version?
Any security setting you enabled on Hadoop?
HTTP return code 503 is server temporarily unavailable, is there any network issue happened during your copy?