Launching using spark-ec2 script results in:
Setting up ganglia RSYNC'ing /etc/ganglia to slaves... <...>
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond: [ OK ]
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond: [ OK ]
Connection to <...> closed. <...> Stopping httpd:
[FAILED] Starting httpd: httpd: Syntax error on line 199 of
/etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into
server: /etc/httpd/modules/libphp-5.5.so: cannot open shared object
file: No such file or directory
[FAILED] [timing]
ganglia setup: 00h 00m 03s Connection to <...> closed.
Spark standalone cluster started at <...>:8080 Ganglia started at
<...>:5080/ganglia
Done!
However, when I netstat, there is no 5080 port listened on.
Is this related to the above error with httpd or it's something else?
EDIT:
So the issue is found (see the answer below), and the fix can be applied locally on the instance, after which Ganglia works fine. However the question is how to fix this issue in the root, so that spark-ec2 script can start Ganglia normally without intervention.
The fact that ganglia is not available is related to these errors - ganglia is php application and it won't run without php module for apache.
Which version of spark you are using to start cluster?
It is wierd error - these file should be present in AMI image.
Just traced the error: /etc/httpd/conf/httpd.conf is trying to load libphp-5.5 library while modules/ contains libphp-5.6 version...
Changing httpd.conf fixes the issue, however I'd be good to know a permanent fix within spark-ec2 script
This is because httpd fails to launch. As you have noted httpd.conf is trying to load modules and failing. You can reproduce the problem via apachectl start and examine exactly what modules are failing to load.
In my case there was one involving "auth" and "core". The last four (maybe five) listed will also fail to load. I did not encounter anything related to PHP so maybe our cases our different. Anyway the hacky solution is to comment out the problems. I did so and am running Ganglia without issue.
Related
Following these two tutorials: i.e tutorial 1 and tutorial 2, I was able to set up HBase cluster in fully-distributed mode. Initially the cluster seems to work okay.
The 'jps' output in HMaster/ Name node
The jps output in DataNodes/ RegionServers
Nevertheless, when every I try to execute hbase shell, it seems that the HBase processors are interrupted due to some Zookeeper error. The error is pasted below:
2021-03-13 11:52:26,047 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper exists failed a│1951 HRegionServer
fter 4 attempts │hduser#master-vm:~$
2021-03-13 11:52:26,048 WARN [main] zookeeper.ZKUtil: hconnection-0x4375b0130x0, quorum=137.4│
3.49.59:2181,137.43.49.58:2181,137.43.49.50:2181,137.43.49.49:2181, baseZNode=/hbase Unable to│
set watcher on znode (/hbase/hbaseid) │
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss│
for /hbase/hbaseid │
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) │
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) │
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) │
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.│
java:221) │
at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:417) │
at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:6│
I tried several attempts to solve this issue (including trying out with different HBase/ Hadoop compatible versions). But still no progress.
Would like to have your input on this.
Shared below are other information required:
in /etc/hosts file:
(I already tried commenting the HBase related hosts in /etc/hosts/, still didn'w work)
in hbase-site.xml
After 5 days of hustle, I learned what went wrong. Posting my solution here. Hope it can help some of the other developers too. Also would like to thank #VV_FS for the comments.
In my scenario, I used virtual machines which I burrowed from an external party. Therefore, there were certain firewalls and other security measures. In case if you follow a similar experimental setup, these steps might help you.
To set up HBase cluster, follow the following tutorials.
Set up Hadoop in distributed mode.
Notes when setting up HBase in fully distributed-mode:
Make sure to open all the ports mentioned in the post. For example, use sudo ufw allow 9000 to open port 9000. Follow the command to open all the ports in relation to running Hadoop.
Set up Zookeeper in distributed mode.
Notes when setting up Zookeeper in fully distributed mode:
Make sure to open all the ports mentioned in the post. For example, use sudo ufw allow 3888 to open port 3888. Follow the command to open all the ports in relation to running Zookeeper.
DO NOT START ZOOKEEPER NODES AFTER INSTALLATION. ZOOKEEPER WILL BE MANAGED HBASE INTERNALLY. THEREFORE, DON'T START ZOOKEEPER AT THIS STAGE.
Set up HBase in distributed mode.
When setting up values for hbase-site.xml, use port number 60000 for hbase.master tag, not 60010. (thanks #VV_FS to point this out in the earlier discussion).
Make sure to open all the ports mentioned in the post. For example, use sudo ufw allow 60000 to open port 60000. Follow the command to open all the ports in relation to running Zookeeper.
[Important thoughts]: If encounters errors, always refer to HBase logs. In my case, hbase-mater-xxxxx.log and zookeeper-master--xxx.log helped me to track down exact errors.
I am very new to hadoop and am trying to set a psuedo-distributed mode execution with Hadoop-3.1.2.
When I try to start yarn service I get the following error, please see the code snippet below.
$ sbin/start-yarn.sh
Starting resourcemanagers on []
localhost: ERROR: Cannot set priority of resourcemanager process 13209
pdsh#manager-4: localhost: ssh exited with exit code 1
Starting nodemanagers
localhost: ERROR: Cannot set priority of nodemanager process 13366
pdsh#manager-4: localhost: ssh exited with exit code 1
I tried solutions at this stackoverflow question, which is very similar to my problem. But nothing worked out. A problem same as mine is posted in another forum here. However, no solution is available there as well.
Then, I tried another option which I am describing in the following text.
I set following exports in the file sbin/start-yarn.sh.
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
Then executed with sbin/start-yarn.sh and I got the following error. Please note that I have done all the settings for passwordless ssh for root#localhost.
$ sudo sbin/start-yarn.sh
Starting resourcemanagers on []
localhost: Permission denied (publickey).
pdsh#manager-4: localhost: ssh exited with exit code 255
Starting nodemanagers
localhost: Permission denied (publickey).
pdsh#manager-4: localhost: ssh exited with exit code 255
In addition to the steps suggested by zhao, ephraimbuddy and qitian.
Please make sure that if you have a firewall running than the firewall is not blocking it in anyway. Also make sure that the user with which you are executing the command has enough permissions to update the priorities.
Before running the start-yarn script, try the command: ssh localhost
When you have set passwordless ssh for localhost change the pdsh_rcmd_type value to ssh:
export PDSH_RCMD_TYPE=ssh
this error info actually very confuse me, later i find it happens because i have not correctly config cgroup. so you can firstly check your config make sure they are all right, you can check you resourcemanager logs
I had the same issue, what helped me was the guide I found in this link!
The message "Cannot set priority of resourcemanager process" is misleading. I checked the resource manager logs and found that there was an error as follows
Unexpected close tag </property>; expected </configuration>
I had the same issue and was finally able to solve it. I got ResourceManager and NodeManager to run. If you're running Hadoop 3.3 and up, the issue might be with the java version you're using. hadoop_compatibility
" Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only)
Please compile Hadoop with Java 8. Compiling Hadoop with Java 11 is not supported"
Solution:
Try switching to java 8.
Then make sure your JAVA_HOME path variables are pointing to java 8 (including any JAVA_HOME path variables in hadoop-env.sh).
If the issue persists, check the error messages in resourcemanager log located in $HADOOP_HOME/logs/.
I'm doing development/research in an Ubuntu 14.04 VM with Hadoop 2.6.2 and I'm getting constantly held back because any commands I issue to hdfs always take about 15 seconds to run. I've tried digging around, but I am unable to locate the source of the problem or even if this is expected behavior.
I followed the directions on Apache's website and successfully got it up and running just fine in /opt/hadoop-2.6.2/
The following is a simple test command that I'm using to evaluate if I have solved the problem.
/opt/hadoop-2.6.2/bin/hdfs dfs -ls /
I have inspected the logs and found no errors or strange warnings. A recommendation that I found online was to set the logger to output the console.
HADOOP_ROOT_LOGGER=DEBUG,console /opt/hadoop-2.6.2/bin/hdfs dfs -ls /
Doing this yields something of interest. You can watch it hang between the following.
16/01/15 11:59:02 DEBUG impl.MetricsSystemImpl: UgiMetrics, User and group related metrics
16/01/15 11:59:17 DEBUG util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
Thoughts: When I first saw this I assumed that it was hanging on authentication, but not only do I not have Kerberos installed, the default configuration for core-site.xml shows the authentication mode set to "simple". This makes wonder why it would be looking for anything Kerberos related to begin with. I attempted to specifically disable it in the xml and the lag/slowness didn't go away. I kinda get the feeling that the delay is because its timing out waiting for something. Does anyone else have any ideas?
I just went ahead and install Kerberos anyways just to see if it would work. Large delays have disappeared now that /etc/krb5.conf is present. I wonder if I could have just created the file with nothing in it. Hrmmm...
sudo apt-get install krb5-kdc krb5-admin-server
I setup percona_xtradb_cluster-56 with three nodes in the cluster. To start the first cluster, i use the following command and it starts just fine:
#/etc/init.d/mysql bootstrap-pxc
The other two nodes however fail to start when i start them normally using the command:
#/etc/init.d/mysql start
The error i am getting is "The server quit without updating the PID file". The error log contains this message:
Error in my_thread_global_end(): 1 threads didn't exit 150605 22:10:29
mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended.
The cluster nodes are running all Ubuntu 14.04. When i use percona-xtradb-cluster5.5, the cluster ann all the nodes run just fine as expected. But i need to use version 5.6 because i am also using GTID which is only available in version 5.6 and not supported in earlier versions.
I was following these two percona documentation to setup the cluster:
https://www.percona.com/doc/percona-xtradb-cluster/5.6/installation.html#installation
https://www.percona.com/doc/percona-xtradb-cluster/5.6/howtos/ubuntu_howto.html
Any insight or suggestions on how to resolve this issue would be highly appreciated.
The problem is related to memory, as "The Georgia" writes. There should be at least 500MB for default setup and bootstrapping. See here http://sysadm.pp.ua/linux/px-cluster.html
I am new to the spark, After installing the spark using parcels available in the cloudera manager.
I have configured the files as shown in the below link from cloudera enterprise:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
After this setup, I have started all the nodes in the spark by running /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-all.sh. But I couldn't run the worker nodes as I got the specified error below.
[root#localhost sbin]# sh start-all.sh
org.apache.spark.deploy.master.Master running as process 32405. Stop it first.
root#localhost.localdomain's password:
localhost.localdomain: starting org.apache.spark.deploy.worker.Worker, logging to /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain: failed to launch org.apache.spark.deploy.worker.Worker:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
localhost.localdomain: at gnu.java.lang.MainThread.run(libgcj.so.10)
localhost.localdomain: full log in /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain:starting org.apac
When I run jps command, I got:
23367 Jps
28053 QuorumPeerMain
28218 SecondaryNameNode
32405 Master
28148 DataNode
7852 Main
28159 NameNode
I couldn't run the worker node properly. Actually I thought to install a standalone spark where the master and worker work on a single machine. In slaves file of spark directory, I given the address as "localhost.localdomin" which is my host name. I am not aware of this settings file. Please any one cloud help me out with this installation process. Actually I couldn't run the worker nodes. But I can start the master node.
Thanks & Regards,
bips
Please notice error info below:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
I met the same error when I installed and started Spark master/workers on CentOS 6.2 x86_64 after making sure that libgcj.x86_64 and libgcj.i686 had been installed on my server, finally I solved it. Below is my solution, wish it can help you.
It seem as if your JAVA_HOME environment parameter didn't set correctly.
Maybe, your JAVA_HOME links to system embedded java, e.g. java version "1.5.0".
Spark needs java version >= 1.6.0. If you are using java 1.5.0 to start Spark, you will see this error info.
Try to export JAVA_HOME="your java home path", then start Spark again.