I'm trying to add a new node to my existing Cassandra cluster which is running AWS EC2. I've configured the cluster name, the seed node and the listen_address in my cassandra.yaml. When I start the cassandra service the seed node is properly found but upon getting a share of the data an Java exception occurs:
Error during boostrap: Stream failed
Any idea what is causing this and how I could solve this? Simply doing it again is no solution :/
Streams can fail due to either a networking problem or an sstable corruption. Grep your logs for the stream-id to get more details.
Something like this should help:
$ cat system.log| grep "0fb1b0d0-8fc9-11e5-a498-4b9679ec178d" | sed -E 's/([0-9]{1,3}\.){3}[0-9]{1,3}/Source/'|awk '{ split($3,a,":"); $2=a[0] ; $3=""; $4=""; print }'|uniq -c
You should run this both on the source and target ends of the stream to get the underlying reasons for the stream failures.
If the stream failed for networking reasons you'll just have to try
again.
If the stream failed due to an sstable corruption, run scrub for that
sstable at the source node and try again.
If you almost finished your bootstrap but not quite (most of the data
streamed over), you may want to bring the node up with
autobootstrap_false (in the c* yaml) and repair the rest of the way.
This post I wrote is about repairs but it also includes stream troubleshooting and is relevant in the context of bootstrapping.
Related
I'm using Nifi 1.6.0.
I'm trying to write to HDFS and to Hive (cloudera) with nifi.
On "PutHDFS" I'm configure the "Hadoop Confiugration Resources" with hdfs-site.xml, core-site.xml files, set the directories and when I'm trying to Start it I got the following error:
"Failed to properly initialize processor, If still shcedule to run,
NIFI will attempt to initalize and run the Processor again after the
'Administrative Yield Duration' has elapsed. Failure is due to
java.lang.reflect.InvocationTargetException:
java.lang.reflect.InvicationTargetException"
On "PutHiveStreaming" I'm configure the "Hive Metastore URI" with
thrift://..., the database and the table name and on "Hadoop
Confiugration Resources" I'm put the Hive-site.xml location and when
I'm trying to Start it I got the following error:
"Hive streaming connect/write error, flow file will be penalized and routed to retry.
org.apache.nifi.util.hive.HiveWritter$ConnectFailure: Failed connectiong to EndPoint {metaStoreUri='thrift://myserver:9083', database='mydbname', table='mytablename', partitionVals=[]}:".
How can I solve the errors?
Thanks.
For #1, if you got your *-site.xml files from the cluster, it's possible that they are using internal IPs to refer to components like the DataNodes and you won't be able to reach them directly using that. Try setting dfs.client.use.datanode.hostname to true in your hdfs-site.xml on the client.
For #2, I'm not sure PutHiveStreaming will work against Cloudera, IIRC they use Hive 1.1.x and PutHiveStreaming is based on 1.2.x, so there may be some Thrift incompatibilities. If that doesn't seem to be the issue, make sure the client can connect to the metastore port (looks like 9083).
I am running the hadoop in my local system but while running ./start-all.sh command its running all functionality except Name Node while running it's getting connection refused and in log file prints below exception
java.io.ioexception : there appears to be a gap in the edit log, we expected txid 1, but got txid 291.
Can You please help me.
Start namenode with recover flag enabled. Use the following command
./bin/hadoop namenode -recover
The metadata in Hadoop NN consists of:
fsimage: contains the complete state of the file system at a point in time
edit logs: contains each file system change (file creation/deletion/modification) that was made after the most recent fsimage.
If you list all files inside your NN workspace directory, you'll see files include:
fsimage_0000000000000000000 (fsimage)
fsimage_0000000000000000000.md5
edits_0000000000000003414-0000000000000003451 (edit logs, there're many ones with different name)
seen_txid (a separated file contains last seen transaction id)
When NN starts, Hadoop will load fsimage and apply all edit logs, and meanwhile do a lot of consistency checks, it'll abort if the check failed. Let's make it happen, I'll rm edits_0000000000000000001-0000000000000000002 from many of my edit logs in my NN workspace, and then try to sbin/start-dfs.sh, I'll get error message in log like:
java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 3.
So your error message indicates that your edit logs is inconsitent(may be corrupted or maybe some of them are missing). If you just want to play hadoop on your local and don't care its data, you could simply hadoop namenode -format to re-format it and start from beginning, otherwise you need to recovery your edit logs, from SNN or somewhere you backed up before.
I am running a mahout recommenderJob on hadoop in syncfusion. I get the following. But no output... it seems to run indefinitely
Does anyone have an idea why I am not getting an output.txt from this? Why does this seem to run indefinitely?
I suspect this could be due to the insufficient disk space in your machine and in this case, I'd suggest you to clean up your disk space and try this again from your end.
In alternate, I'd also suggest you to use the Syncfusion Cluster Manager - using which you can form a cluster with multiple nodes/machines, so that there will be suffifient memory available to execute your job.
-Ramkumar
I've tested the same map reduce job which you're trying to execute using Syncfusion BigData Studio and it worked for me.
Please find the input details which I've used from the following,
Command:
hadoop jar E:\mahout-examples-0.12.2-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input=/Input.txt --output=output
Sample input (Input.txt):
For input data, I've used the data available in Apache - Mahout site (refer below link) and saved the same in a text file.
http://mahout.apache.org/users/recommender/userbased-5-minutes.html
I've also seen a misspelled word "COOCCURRENCE" used in your command. Please correct this, or else you could face "Class Not Found Exception".
Output:
Please find the generated output from below.
-Ramkumar :)
One of the nodes in a cassandra cluster has died.
I'm using cassandra 2.0.7 throughout.
When I do a nodetool status this is what I see (real addresses have been replaced with fake 10 nets)
[root#beta-new:/opt] #nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.10.1.94 171.02 KB 256 49.4% fd2f76ae-8dcf-4e93-a37f-bf1e9088696e rack1
DN 10.10.1.98 ? 256 50.6% f2a48fc7-a362-43f5-9061-4bb3739fdeaf rack1
I tried to get the token ID for the down node by doing a nodetool ring command, grepping for the IP and doing a head -1 to get the initial one.
[root#beta-new:/opt] #nodetool ring | grep 10.10.1.98 | head -1
10.10.1.98 rack1 Down Normal ? 50.59% -9042969066862165996
I then started following this documentation on how to replace the node:
[http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html?scroll=task_ds_aks_15q_gk][1]
So I installed cassandra on a new node but did not start it.
Set the following options:
cluster_name: 'Jokefire Cluster'
seed_provider:
- seeds: "10.10.1.94"
listen_address: 10.10.1.94
endpoint_snitch: SimpleSnitch
And set the initial token of the new install as the token -1 of the node I'm trying to replace in cssandra.yaml:
initial_token: -9042969066862165995
And after making sure there was no data yet in:
/var/lib/cassandra
I started up the database:
[root#web2:/etc/alternatives/cassandrahome] #./bin/cassandra -f -Dcassandra.replace_address=10.10.1.98
The documentation I link to above says to use the replace_address directive on the command line rather than cassandra-env.sh if you have a tarball install (which we do) as opposed to a package install.
After I start it up, cassandra fails with the following message:
Exception encountered during startup: Cannot replace_address /10.10.10.98 because it doesn't exist in gossip
So I'm wondering at this point if I've missed any steps or if there is anything else I can try to replace this dead cassandra node?
Has the rest of your cluster been restarted since the node failure, by chance? Most gossip information does not survive a full restart, so you may genuinely not have gossip information for the down node.
This issue was reported as a bug CASSANDRA-8138, and the answer was:
I think I'd much rather say that the edge case of a node dying, and then a full cluster restart (rolling would still work) is just not supported, rather than make such invasive changes to support replacement under such strange and rare conditions. If that happens, it's time to assassinate the node and bootstrap another one.
So rather than replacing your node, you need to remove the failed node from the cluster and start up a new one. If using vnodes, it's quite straightforward.
Discover the node ID of the failed node (from another node in the cluster)
nodetool status | grep DN
And remove it from the cluster:
nodetool removenode (node ID)
Now you can clear out the data directory of the failed node, and bootstrap it as a brand-new one.
Some less known issues of Cassandra dead node replacement has been captured in below link based on my experience:
https://github.com/laxmikant99/cassandra-single-node-disater-recovery-lessons
I just started to practice AWS EMR.
I have a sample word-count application set-up, run and completed from the web interface.
Following the guideline here, I have setup the command-line interface.
so when I run the command:
./elastic-mapreduce --list
I receive
j-27PI699U14QHH COMPLETED ec2-54-200-169-112.us-west-2.compute.amazonaws.comWord count
COMPLETED Setup hadoop debugging
COMPLETED Word count
Now, I want to see the log files. I run the command
./elastic-mapreduce --ssh --jobflow j-27PI699U14QHH
Then I receive the following error:
Error: Jobflow entered COMPLETED while waiting to ssh
Can someone please help me understand what's going on here?
Thanks,
When you setup a job on EMR, this means that Amazon is going to provision a cluster on-demand for you for a limited amount of time. During that time, you are free to ssh to your cluster and look at the logs as much as you want, but by the time your job has finished running, then your cluster is going to be taken down ! At that point, you won't be able to ssh anymore because your cluster simply won't exist.
The workflow typically looks like this:
Create your jobflow
It will be for a few minutes in status STARTING. At that point if you try to run ./elastic-mapreduce --ssh --jobflow <jobid> it will simply wait because the cluster is not available yet.
After a while the status will switch to RUNNING. If you had already started the ssh command above it should automatically connect you to your cluster. Otherwise you can initiate your ssh command now and it should connect you directly without any wait.
Depending on the nature of your job, the RUNNING step could take a while or be very short, it depends what amount of data you're processing and the nature of your computations.
Once all your data has been processed, the status will switch to SHUTTING_DOWN. At that point, if you already sshed before you will get disconnected. If you try to use the ssh command at that point, it will not connect.
Once the cluster has finished shutting down it will enter a terminal state of either COMPLETED or FAILED depending on whether your job succeeded or not. At that point your cluster is no longer available, and if you try to ssh you will get the error you are seeing.
Of course there are exceptions, you could setup an EMR cluster in interactive mode, for example you just want to have Hive setup and then ssh there and run Hive queries and you would have to take your cluster down manually. But if you just want a MapReduce job to run, then you will only be able to ssh for the duration of the job.
That being said, if all you want to do is debugging, there is not even a need to ssh in the first place ! When you create your jobflow, you have the option to enable debugging, so you could do something like that:
./elastic-mapreduce --create --enable-debugging --log-uri s3://myawsbucket
What that means is that all the logs for your job will end up being written to the S3 bucket specified (you have to own this bucket of course and have permission to write to it). Also if you do that, you can go into the AWS console afterwards in the EMR section, and you will be able to see next to your job a button to debug as shown below in the screenshot, this should make your life much easier: