How to deploy & run Samza job on HDFS? - hadoop

I want to get a Samza job running on a remote system with the Samza job being stored on HDFS. The example (https://samza.apache.org/startup/hello-samza/0.7.0/) for running a Samza job on a coal machine involves building a tar file, then unzipping the tar file, then running a shell script that's located within the tar file.
The example here for HDFS is not really well-documented at all (https://samza.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html). It says to copy the tar file to HDFS, then to follow the other steps in the non-HDFS example.
That would imply that the tar file that now resides on HDFS needs to be untarred within HDFS, then a shell script to be run on that unzipped tar file. But you can't untar a HDFS tar file with the hadoop fs shell...
Without untarring the tar file, you don't have access to run-job.sh to initiate the Samza job.
Has anyone managed to get this to work please?

We deploy our Samza jobs this way: we have hadoop libraries in /opt/hadoop, we have Samza sh scripts in /opt/samza/bin and we have Samza config file in /opt/samza/config. In this config file there is this line:
yarn.package.path=hdfs://hadoop1:8020/deploy/samza/samzajobs-dist.tgz
When we wanna deploy new version of our Samza job we just create the tgz archive, we move it (without untaring) to HDFS to /deploy/samza/ and we run /opt/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file:///opt/samza/config/$CONFIG_NAME.properties
The only downside is that we ignore config files in the archive. If you change the config in the archive it does not take an effect. You have to change the config files in /opt/samza/config. On the other side we are able to change config of our Samza job without deploying the new tgz archive. The shell scripts under /opt/samza/bin remains the same every build thus you don't need to untar the archive package because of the shell scripts.
Good luck with Samzing! :-)

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

how to add a jar file in hive

I'm trying to add hive-contrib-0.10.0.jar in hive using ADD JAR hive-contrib-0.10.0.jar command but it always saying hive-contrib-0.10.0.jar does not exist.
I'm using HDP 2.1 version right now. I also added this jar file into /user/root folder using hue and run the command
ADD JAR hdfs:///hive-contrib-0.10.0.jar
but it giving me same error jar file doesn't exist.
Is there any way to solve this problem.
Where should I keep this jar file so that it will run successfully and what will be the command to be used?
upload the JAR file into hdfs path
Add the JAR File using Add command and HDFS full PATH
Example:
hadoop fs -put ~/Downloads/hive.jar /lib/
open hive shell
add jar hdfs:///lib/hive.jar
I see following issues with your approach. Before adding make sure you are able to list the file on Local file system or hdfs where ever it exists.
The jar you are trying to add is by default in hive class path as its part of $HIVE_HOME/lib (on local file system where ever you have hive client/service installed)
on the other hand in regards to your question about how to add jars in hive, we can add using local file system or hadoop distributed file system (HDFS)
Add jar file:///root/hive-contrib-0.10.0.jar (Given that you copied this jar on LFS root directory)
Add jar hdfs://<namenode_hostname>:8020/user/root/hive-contrib-0.10.0.jar (Given that you copied to HDFS root home)
if you want to permanently add the jars you need to do the following.
1. Hive-site.xml ( /etc/hive/conf )
<property>
<name>hive.aux.jars.path</name>
<value>file:///mnt1/hive-jars/hive-contrib-2.1.1.jar</value>
</property>
add hive-contrib-2.1.1.jar to the path "/mnt1/hive-jars" configured in hive-site.xml
This should ideally work after restarting hive-server2.
3. sudo stop hive-server2
4. sudo start hive-server2
But sometimes it does not work. i am not sure why so you can use the following dirty way.
put your jar file in the following path so that hive automatically picks it up while restart.
add hive-contrib-2.1.1.jar to /usr/lib/hive-hcatalog/share/hcatalog
sudo stop hive-server2
sudo start hive-server2
I have read these answers above which was very useful. And I combined all into one solution:
put jars into local disk and give read/write permission
chmod -R 777 /tmp/json.jar
upload to hdfs file system and give permissions too:
hdfs dfs -put /tmp/json.jar hdfs://1.1.1.1:8020/jars/
hdfs dfs -chmod -R 777 hdfs://1.1.1.1:8020/jars/
add jar into hive env.
add jar hdfs://1.1.1.1:8020/jars/json.jar
You have to give the full path to the jar JAR and not only its name.
Don't guess the location. Check the file system to see that it is there, before trying to add it.

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.
Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

HDFS file FTP from cluster to another machine

I want to create an Oozie workflow to transfer an HDFS file from an HDFS cluster to another server.
Since Oozie can run commands or scripts on any node in a system, is it possible to run a shell script or SFTP on one of the nodes and transfer the file to the destination server.
I think this task can be easily done by performing, from the remote server, a http GET (open operation) on the HDFS file (you can use curl for that).
Anyway, if you want to do it through Oozie, I think you can create a script in charge of moving the desired file from HDFS to the local file system, and then perform a scp in order to move the file within the local file system to the remote file system.

How to make your mapper write on local file system in hadoop

I wish to write a file and create a directory in my local file system through m MapReduce code. Also if I create a directory in the working directory during the job execution, how can I move it to my local file system before the cleanup.
As your mapper runs on some/any machine in your cluster, of course you can use basic Java file operations to write files. You can use org.apache.hadoop.hdfs.DFSClient to access any files on the HDFS to copy to a local file (I'd suggest you copy inside the HDFS and fetch any files from it after the jobs are finished).
Of course your local files will be local to the client-machine (I assume separate machines), so something like NFS will be needed to let the written files be available to you on any client. Watch out for concurreny problems.
I'm interested as well on writing files locally on the datanode. For that, I used java.io.FileWriter and java.io.BufferedWriter:
FileWriter fstream = new FileWriter("log.out",true);
BufferedWriter bout = new BufferedWriter(fstream);
bout.append(build.toString());
bout.close();
It only creates the file when is executed through eclipse. When run as a .jar with the next command:
hadoop jar jarFile.jar Mainclass
it doesn't create anything. I don't know whether it is a problem of a misexecution, misconfiguration or just that sth is missing
Actually this is only to create a log file for debugging. The actual files I want the datanode to write locally are created through Runtime.getRuntime(). However, the same thing happens. If the execution is carried out through eclipse it's ok. Outside eclipse, it seems fine but no file is ever created.
Before doing it on a cluster it should work on a single node, so the whole thing is donde on a single computer for now.

Resources