Does Embedded flume agent need hadoop to function on cluster? - hadoop

I am trying to write embedded flume agent in my web service to transfer my logs to another hadoop cluster where my flume agent is running. To work with Embedded flume agent, do we need hadoop to be run in server where my web service is running.

TLDR: I think, no.
Longer version: I haven't checked, but in the developer guide (https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent) it says
Note: The embedded agent has a dependency on hadoop-core.jar.
(https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent)
And in the User Guide (https://flume.apache.org/FlumeUserGuide.html#hdfs-sink), you can specify the HDFS path:
HDFS directory path (eg hdfs://namenode/flume/webdata/)
On the other hand, are you sure you want to work with the embedded agent instead of running Flume where you want to put the data and use HTTP Source for example? (https://flume.apache.org/FlumeUserGuide.html#http-source) (...or any other source you can send data to)

Related

Confluent HDFS Sink Connector: How to configure custom hadoop user and group?

We are currently using Confluent HDFS Sink Connector platform within docker container to write data from Kafka(separate Kafka cluster) to HDFS(separate Hadoop cluster). By default the connector platform writes data to HDFS with root user and wheel group.
How can i configure connector to use a specific hadoop user/group ? Is there an environment variable I need to set in docker ?
Thanks.
The Java process in the Docker container runs as root.
You need to either make your own container with your own user account or run the Connect Workers as a different Unix account in some other way.
You could try setting HADOOP_IDENT_USER or HADOOP_USER_NAME environment variables, but I think these are only pulled by the Hadoop scripts, not the Java API
Keep in mind that user accounts in Hadoop don't really matter if you're not using a Kerberized cluster

Spark Job Submission with AWS Hadoop cluster setup

I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Call MapReduce from Local a web app

I have a web app deploy in my localhost.
Also I have a MapReduce job (cleandata.jar) in a hortonworks sandbox in my pc.
How can I call from my web app to my MapReduce .jar?
I'm trying with JSch y Channel Exec to do this in order to perform a call system to the virtual machine and this works. There are a more elegant/easy form to do this?
I didn't use Hortonworks Sandbox but the proper way of launching Yarn (and MapReduce) applications programmatically is by using YarnClient java class. It's quite complicated tough because you need to know some Hadoop internals to do that. First, you should have network access to ResourceManager, NodeManagers, DataNodes and NameNode. Next you should set Configuration properties according to hdfs-site.xml and yarn-site.xml files you will probably find in the sandbox (you can copy them and put them into classpath of your webapp).
You can take a look here: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Note that if your cluster is secured, webapp from which you will submit the job have to be run on Java with extended security (JCE) and you should authenticate using UserGroupInformation.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Resources