We have deployed Spark on Mesos and we are having problems when we are trying to read from HDFS.
Out of the box we tried using Kerberos, which works fine in local mode, but fails when running on Mesos (even though we made sure that each machine had a valid token).
My question is, what alternatives do we have?
According to this document the only two options are Kerberos and shared secret, from which Kerberos only works on YARN.
I've not found any article how a shared secret could be set up on HDFS.
Related
Is there a way to allow a developer to access a hadoop command line without SSH? I would like to place some hadoop clusters in a specific environment where SSH is not permitted. I have searched for alternatives such as a desktop client but so far have not seen anything. I will also need to federate sign on info for developers.
If you're asking about hadoop fs and similar commands, you don't need SSH for this.
You just need to download Hadoop clients and configure the hdfs-site.xml file to point at a remote cluster. However, this is an administrative security hole, so setting up an edge node that does have trusted and audited SSH access is preferred.
Similarly, Hive or HBase or Spark jobs can be ran with the appropriate clients or configuration files without any SSH access, just local libraries
You don't need SSH to use Hadoop. Also Hadoop is a combination of different stacks, which part of Hadoop are you referring to specifically? If you are talking about HDFS you can use web HDFS. If you are talking about YARN you can use API call. There are also various UI tools such as HUE you can use. Notebook apps such as Zeppelin or Jupiter can also be helpful.
We have a non Kerberized Hortonworks cluster which needs to access services in a Kerberized Cloudera cluster.
Which are the ways in which the non Kerberized cluster can communicate with the kerberized cluster?
Can we
Configure the KDC in the Kerberized cluster to be the common KDC?
Kerberize the Hortonworks cluster by installing and configuring Kerberos, create SPNs and UPNs etc.,?
"Which are the ways in which the non Kerberized cluster can communicate with the kerberized cluster" generally there are none (with exceptions -see below) .. once you kerberize a cluster, it becomes a "secure" cluster that would require Kerberos authentication to talk to many of that cluster resources. If another (source) cluster where you're making requests is not kerberized, it'll not even have a kerberos ticket to authenticate in the other cluster.
Although certain services can control authentication separately. For HBase, those are hbase.security.authentication and hbase.rest.authentication.type (each one can be simple or kerberos). Which ones you're trying to use? Hive doesn't have an equivalent for these HBase settings. Solr does, for example, see "Solr Secure Authentication". Etc (I didn't go through all services)
So certain things can be relaxed for authentication, but it would be for all access not just from that non-kerberized cluster.
What you're looking for might need custom application in between if you'd like to have access from a non-secure to a secure cluster.
If you do not use self-written applications which are able to deal with different configuration properties, I do not see a way to setup a communication between the two clusters. As #Tagar already mentioned it will be necessary to maintain the a similar configuration.
I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.
I have a web app deploy in my localhost.
Also I have a MapReduce job (cleandata.jar) in a hortonworks sandbox in my pc.
How can I call from my web app to my MapReduce .jar?
I'm trying with JSch y Channel Exec to do this in order to perform a call system to the virtual machine and this works. There are a more elegant/easy form to do this?
I didn't use Hortonworks Sandbox but the proper way of launching Yarn (and MapReduce) applications programmatically is by using YarnClient java class. It's quite complicated tough because you need to know some Hadoop internals to do that. First, you should have network access to ResourceManager, NodeManagers, DataNodes and NameNode. Next you should set Configuration properties according to hdfs-site.xml and yarn-site.xml files you will probably find in the sandbox (you can copy them and put them into classpath of your webapp).
You can take a look here: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Note that if your cluster is secured, webapp from which you will submit the job have to be run on Java with extended security (JCE) and you should authenticate using UserGroupInformation.
I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)