We have small pivotal hadoop cluster.In that cluster, We are using spring-xd as data ingestion tool.
Tried:
When following command executed from spring xd-admin machine:
[root#host ~]# service spring-xd-admin status
xd-admin dead but pid file exists
Outcome:
Both spring-xd-admin and container stopped responding.
Hence,cluster data pipeline has been stopped completely.
Advance Thanks For Help ?
Looks like the service crashed and left a pid file behind. You would have to remove the pid file manually. Look for a file named xd-admin.pid in the /var/run/ directory.
Related
I run Apache Storm in a cluster and I was looking for ways to stop and/or restart Nimbus, Supervisor and UI. Would writting a servise help? What should I write in this service file and where should I place it? Thank you in advance
Yes, writing a service is the recommended way to run Storm. The commands you want to run are storm nimbus to start Nimbus (minimum 1 per cluster), storm supervisor to run the supervisor (1 per worker machine), storm ui (1 per cluster) and storm logviewer (1 per worker machine). There are other commands you can also run, but you can find these by simply running storm, it will print a list.
Regarding how to write the service, take a look at the upstart cookbook http://upstart.ubuntu.com/cookbook/.
There's an example script here you can probably use to get started https://unix.stackexchange.com/a/84289
you can make them as service and start them up as the node starts and same can be used to stop them.
/etc/rc.d/SERVICE start or stop or restart
We can easily stop them using the command "ps -aux | grep nimbus" or supervisor etc. Then we have to find the process id and kill it with the “kill” command.
I just installed Kafka (from Confluent Platform) on my Windows machine. I started up Zookeeper and Kafka and creating topics, producing to and consuming from them works. However, as soon as I delete a topic, Kafka crashes like this:
PS C:\confluent-4.1.1> .\bin\windows\kafka-topics.bat -zookeeper 127.0.0.1:2181 --topic foo --create --partitions 1 --replication-factor 1
Created topic "foo".
PS C:\confluent-4.1.1> .\bin\windows\kafka-topics.bat -zookeeper 127.0.0.1:2181 --topic foo --delete
Topic foo is marked for deletion.
Note: This will have no impact if delete.topic.enable is not set to true.
This is the crash output:
[2018-06-08 09:44:54,185] ERROR Error while renaming dir for foo-0 in log dir C:\confluent-4.1.1\data\kafka (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: C:\confluent-4.1.1\data\kafka\foo-0 -> C:\confluent-4.1.1\data\kafka\foo-0.cf697a92ed5246c0977bf9a279f15de8-delete
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
at kafka.log.Log$$anonfun$renameDir$1.apply$mcV$sp(Log.scala:579)
at kafka.log.Log$$anonfun$renameDir$1.apply(Log.scala:577)
at kafka.log.Log$$anonfun$renameDir$1.apply(Log.scala:577)
at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
at kafka.log.Log.renameDir(Log.scala:577)
at kafka.log.LogManager.asyncDelete(LogManager.scala:828)
at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:240)
at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:235)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:258)
at kafka.cluster.Partition.delete(Partition.scala:235)
at kafka.server.ReplicaManager.stopReplica(ReplicaManager.scala:347)
at kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:377)
at kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:375)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:375)
at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:205)
at kafka.server.KafkaApis.handle(KafkaApis.scala:116)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.nio.file.AccessDeniedException: C:\confluent-4.1.1\data\kafka\foo-0 -> C:\confluent-4.1.1\data\kafka\foo-0.cf697a92ed5246c0977bf9a279f15de8-delete
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694)
... 23 more
[2018-06-08 09:44:54,187] INFO [ReplicaManager broker=0] Stopping serving replicas in dir C:\confluent-4.1.1\data\kafka (kafka.server.ReplicaManager)
[2018-06-08 09:44:54,192] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions (kafka.server.ReplicaFetcherManager)
[2018-06-08 09:44:54,193] INFO [ReplicaAlterLogDirsManager on broker 0] Removed fetcher for partitions (kafka.server.ReplicaAlterLogDirsManager)
[2018-06-08 09:44:54,195] INFO [ReplicaManager broker=0] Broker 0 stopped fetcher for partitions and stopped moving logs for partitions because they are in the failed log directory C:\confluent-4.1.1\data\kafka. (kafka.server.ReplicaManager)
[2018-06-08 09:44:54,195] INFO Stopping serving logs in dir C:\confluent-4.1.1\data\kafka (kafka.log.LogManager)
[2018-06-08 09:44:54,197] ERROR Shutdown broker because all log dirs in C:\confluent-4.1.1\data\kafka have failed (kafka.log.LogManager)
[2018-06-08 09:44:54,198] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions (kafka.server.ReplicaFetcherManager)
The user running Zookeeper and Kafka has full access rights to C:\confluent-4.1.1\data\kafka.
What am I missing?
I know I'm late to the party but keep in mind that even if you delete your topic manually or via some Kafka UI and you delete all the kafka logs, kafka still may not start because of the state that it syncs with ZK.
So, make sure you cleanup the ZK state by deleting ZK's log.
Please know these actions are irreversible. Also run as Administrator
I had a similar problem and it happen only under windows, see KAFKA-1194 and it still apply to Kafka 1.1.0
The only workaround available is to disable the cleaner log.cleaner.enable = false
For local development under windows you can ignore this issue since it does not apply in other OS.
I had similar problem after deleting a topic. I had to go to topic location and delete it manually and it worked.
/tmp/kafka-logs/[yourTopicName]
I am not sure if same will work for you, as I am also new to KAFKA.
1- stop zookeeper & Kafka server,
2- then go to ‘kafka-logs’ folder , there you will see list of kafka topic folders, delete folder with topic name
3- go to ‘zookeeper-data’ folder , delete data inside that.
4- start zookeeper & kafka server again.
note: if you get "The Cluster ID xxxxxxxxxx doesn't match stored clusterId" error, you have to delete all files in the kafkas log dir.
Problem:
I had similar problem after deleting a topic. zookeeper was started successfully but while running kafka I was getting above mentioned issue.
Analysis:
In my case, what I did was I redirected kafka logs to new folder location C:\Tools\kafka_2.13-2.6.0\kafka-test-logs. I forgot to create a folder kafka-test-logs. In this case it will create auto default folder with provided path name ex: Toolskafka_2.13-2.6.0kafka-test-logs. So even after deleting this logs folder it won't worked in my case.
Solution:
First I stopped zookeeper. I created new folder kafka-test-logs which I forgot earlier and then deleted default created logs for kafka and then restarted zookeeper and kafka server. That's all worked for me.
Thank you!! Cheers and Happy Coding.
I was also facing the same issue, then resolved it by downloading the following version of Kafka from this link,
Version 2.8.1
Then changed the zookeeper.properties file in the Config folder to
dataDir=C:/kafka/zookeeper
and server.properties file in the Config folder to
log.dirs=C:/kafka/kafka-logs
Make sure your Kafka folder is extracted and stored in the C:/ drive or else amend the path accordingly in the config file properties.
We have a spark streaming job which reads data from kafka running on a 4 node cluster that uses a checkpoint directory on HDFS ....we had an I/O error where we ran out of the space and we had to go in and delete a few hdfs folders to free up some space and now we have bigger disks mounted ....and want to restart cleanly no need to preserve checkpoint data or kafka offset.....getting the error ..
Application application_1482342493553_0077 failed 2 times due to AM Container for appattempt_1482342493553_0077_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077Then, click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1266542908-96.118.179.119-1479844615420:blk_1073795938_55173 file=/user/hadoopuser/streaming_2.10-1.0.0-SNAPSHOT.jar
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1484420770001
final status: FAILED
tracking URL: http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077
user: hadoopuser
From the error what i can make out is it's still looking for old hdfs blocks which we deleted ...
From research found that ..changing check point directory will help tried changing it and pointing to a new directory ...but still it's not helping to restart spark on clean slate ..it's still giving the same block exception ...Are we missing anything while doing the configuration changes? And how can we make sure that spark is started on a clean slate ?
Also this is how we are setting the checkpoint directory
val ssc = new StreamingContext(sparkConf, Seconds(props.getProperty("spark.streaming.window.seconds").toInt))
ssc.checkpoint(props.getProperty("spark.checkpointdir"))
val sc = ssc.sparkContext
current checkpoint directory in property file is like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir1
previously it used to be like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir
I setup percona_xtradb_cluster-56 with three nodes in the cluster. To start the first cluster, i use the following command and it starts just fine:
#/etc/init.d/mysql bootstrap-pxc
The other two nodes however fail to start when i start them normally using the command:
#/etc/init.d/mysql start
The error i am getting is "The server quit without updating the PID file". The error log contains this message:
Error in my_thread_global_end(): 1 threads didn't exit 150605 22:10:29
mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended.
The cluster nodes are running all Ubuntu 14.04. When i use percona-xtradb-cluster5.5, the cluster ann all the nodes run just fine as expected. But i need to use version 5.6 because i am also using GTID which is only available in version 5.6 and not supported in earlier versions.
I was following these two percona documentation to setup the cluster:
https://www.percona.com/doc/percona-xtradb-cluster/5.6/installation.html#installation
https://www.percona.com/doc/percona-xtradb-cluster/5.6/howtos/ubuntu_howto.html
Any insight or suggestions on how to resolve this issue would be highly appreciated.
The problem is related to memory, as "The Georgia" writes. There should be at least 500MB for default setup and bootstrapping. See here http://sysadm.pp.ua/linux/px-cluster.html
I'm running a few a few Spark Streaming jobs in a chain (one looking for input in the output folder of the previous one) on a Hadoop cluster, using HDFS, running in Yarn-cluster mode.
job 1 --> reads from folder A outputs to folder A'
job 2 --> reads from folder A'outputs to folder B
job 3 --> reads from folder B outputs to folder C
...
When running the jobs independently they work just fine.
But when they are all waiting for input and I place a file in folder A, job1 will change its status from running to accepting to failed.
I can not reproduce this error when using the local FS, only when running it on a cluster (using HDFS)
Client: Application report for application_1422006251277_0123 (state: FAILED)
INFO Client:
client token: N/A
diagnostics: Application application_1422006251277_0123 failed 2 times due to AM Container for appattempt_1422006251277_0123_000002 exited with exitCode: 15 due to: Exception from container-launch.
Container id: container_1422006251277_0123_02_000001
Exit code: 15
Even though Mapreduce ignores files that start with . or _, Spark Streaming does not.
The problem is, when a file is being copied or processes or whatever and there is a trace of a file found on HDFS(i.e. "somefilethatsuploading.txt.tmp") Spark will try to process it.
By the time the process starts to read the file, it's either gone or not complete yet.
That's why the processes kept blowing up.
Ignoring files that start with . or _ or end with .tmp fixes this issue.
Addition:
We kept having issues with the chained jobs. It appears that as soon as Spark notices a file (even if it's not completely written) it will try to process it, ignoring all additional data. The file rename operation is typically atomic and should prevent issues.