How are clients for Hortonworks Sandbox properly configured? - hadoop

Related: How connect to Hortonworks sandbox Hbase using Java Client API
I currently make a proof of concept using the Hortonworks Sandbox in a VM. However, I fail at properly configuring the client (outside the VM, but on the same computer). I looked for documentation as to how a client needs to be configured, but didn't find one.
I need client configuration for accessing HBase and MapReduce, but most appreciated would be a documentation that lists configuration for clients to all parts of the sandbox.

It is actually even more stupid than I would have expected. It seems that not all necessary ports are forwarded by default, it is necessary to add them all in the VM configuration.

Related

Nodes discovery on Google Cloud with Dynamic IP: Spring Boot Java application

Our app is based on Java Spring boot. And we totally based on Google cloud, where we have dynamic IP and our serve isntance will work behind Elastic load balance, where an instance may get spawned and get killed based on server resource consumptions.
None we these server instance can be assumed to have static IP.
Looking for solution to connect different server instance with dynamic IP on Google Cloud.
Since 3.6, Hazelcast offers Discovery SPI to integrate external discovery mechanisms into the system. As a result there are many discovery plugins and you can implement your own. See the list of your options here. Kubernetes might be helpful in your case.
Some additional info from what Sertug said,
There is also a Google Compute SPI that might be helpful, you can check it out here:
https://github.com/hazelcast/hazelcast-gcp
Also, here's a blog post (a little old but still valid):
https://blog.hazelcast.com/hazelcast-discovery-spi

Memcache on kubernetes

I have a spring boot api running on google cloud kubernetes cluster, I wanna have a caching server to use for my api so I thought to use memcache.
I tried two ways of doing it:
I downloaded the memcache from the google launcher which is basically deploying an instance of memcache on a vm. And then I assigned an external IP to my vm, whitelisted my ip to try it locally and ofc opened the port 11211 (the default one). For the client side I used, this guy, specified the ip address but I still get connection cancelled : java.util.concurrent.CancellationException: Cancelled and the doc is bad so I could find anything that helps.
I decided to try another way, which is following this tutorial and now I have the memcached cluster but I don't know how to consume these pods from my other cluster or should the pods be on the same cluster i have the api running on?
I would appreciate any help, this is my first encounter with the global caching.
So I figured it out based on Jonah Benton's advice.
It was actually pretty simple, i used this tutorial to create a new pod running memcached in my cluster and then I used this client to connect on it and it worked like a charm!
Hope it helps someone.

SSH access for the headnode of FIWARE-Cosmos

I am following this guide on Hadoop/FIWARE-Cosmos and I have a question about the Hive part.
I can access the old cluster’s (cosmos.lab.fiware.org) headnode through SSH, but I cannot do it for the new cluster. I tried both storage.cosmos.lab.fiware.org and computing.cosmos.lab.fiware.org and failed to connect.
My intention in trying to connect via SSH was to test Hive queries on our data through the Hive CLI. After failing to do so, I checked and was able to connect to the 10000 port of computing.cosmos.lab.fiware.org with telnet. I guess Hive is served through that port. Is this the only way we can use Hive in the new cluster?
The new pair of clusters have not enabled the ssh access. This is because users tend to install a lot of stuff (even not related with Big Data) in the “old” cluster, which had the ssh access enabled as you mention. So, the new pair of clusters are intended to be used only through the APIs exposed: WebHDFS for data I/O and Tidoop for MapReduce.
Being said that, a Hive Server is running as well and it should be exposing a remote service in the 10000 port as you mention as well. I say “it should be” because it is running an experimental authenticator module based in OAuth2 as WebHDFS and Tidoop do. Theoretically, connecting to that port from a Hive client is as easy as using your Cosmos username and a valid token (the same you are using for WebHDFS and/or Tidoop).
And what about a Hive remote client? Well, this is something your application should implement. Anyway, I have uploaded some implementation examples in the Cosmos repo. For instance:
https://github.com/telefonicaid/fiware-cosmos/tree/develop/resources/java/hiveserver2-client

Use spark-submit to submit a application to EC2 cluster

I am new to Spark and I am trying to run it on EC2. I follow the tutorial on spark webpage by using spark-ec2 to launch a Spark cluster. Then, I try to use spark-submit to submit the application to the cluster. The command looks like this:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://ec2-54-88-9-74.compute-1.amazonaws.com:7077 --executor-memory 2G --total-executor-cores 1 ./examples/target/scala-2.10/spark-examples_2.10-1.0.0.jar 100
However, I got the following error:
ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
Please let me know how to fix it. Thanks.
You're seeing this issue because the master node of your spark-standalone cluster cant open a TCP connection back to the drive (on your machine). The default mode of spark-submit is client which runs the driver on the machine that submitted it.
A new cluster mode was added to spark-deploy that submits the job to the master where it is then run on a client, removing the need for a direct connection. Unfortunately this mode is not supported in standalone mode.
You can vote for the JIRA issue here: https://issues.apache.org/jira/browse/SPARK-2260
Tunneling your connection via SSH is possible but latency would be a big issue since the driver would be running locally on your machine.
I'm curious if you still having this issue ... But in case anyone is asking here is a brief answer. As clarified by jhappoldt, the master node of your spark-standalone cluster cant open a TCP connection back to the drive (on your local machine). Two workarounds are possible, tested and succeeded.
(1) From EC2 Management Console, create a new security group and add rules to enable TCP back and forth from your PC (public IP). (what I did was adding TCP rules inbound and outbound) ... Then add this security group to your master instance. (right click --> Networking --> Change security groups). Note: add it and don't remove the already established security groups.
This solution work well, but in your specific scenario, deploying your application from local machine to EC2 cluster, you will face further problems (resource related) so the next option is the best one
(2) Having your .jar file (or .egg) copy it to the master node using scp. You can check this link http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html for information about how to do that; and deploy your application from the master node. Note: spark is already pre-insalled so you will do nothing but write the same exact command you write on your local machine from ~/spark/bin. This shall work perfect.
Are you executing the command on your local machine, or on the created EC2 node? If you're doing it locally, make sure port 7077 is open in the security settings, as its closed to the outside by default.

Session replication in Glassfish Cluster on EC2

I've built a cluster on Glassfish administred via SSH, where there are 2 instances. I deployed an application that shows the "Session id".
This application has in the web.config:
<distributable/>
And in the sun-web.xml:
<session-config>
<cookie-properties>
<property name="cookieDomain" value="compute.amazonaws.com"/>
</cookie-properties>
</session-config>
I enabled "Availability" in Edit Application.
But when I access the 2 web app versions I see different session ids.
Can anyone help me?
EDIT: As some users noticed, in EC2 is not supported multicast. A solution comes with Glassfish v3.1.2, that allows two other different ways to discover a cluster when multicasting is not permitted (by listing instances ip or making it auto-generate the list). Here's specified how to start a cluster in a non-multicasting environment: Administering Glassfish Server Clusters
Read the High Availability Administration Guide for v3.1.2, specifically section "Discovering a Cluster When Multicast Transport Is Unavailable". Haven't tried it yet, but looking forward. Cheers!
First thing to try would be trying to validate if the multicast works on your your setup, use below asadmin command.
asadmin validate-multicast
You can checkout this simple Youtube Video about how to do that
http://www.youtube.com/watch?v=sJTDao9OpWA
In case Multicast does not work, you may want to try the non multicast option that is supported on Recent release of Glassfish 3.1.2
The release notes say that it supports non multicast clustering
New Support for non-Multicast clustering. GlassFish High-Availability
clustering is now possible in environments where multicast is
disabled.
I was not able to find any documentation that provides steps for setting up the non Multicast cluster. There may be one for the enterprise support customers.
As some users noticed, in EC2 is not supported multicast. A solution comes with Glassfish v3.1.2, that allows two other different ways to discover a cluster when multicasting is not permitted (by listing instances ip or making it auto-generate the list). Here's specified how to start a cluster in a non-multicasting environment: Administering Glassfish Server Clusters

Resources