I wrote smth like custom oozie FTP action (simple example described in "Professional Hadoop Solutions By: Boris Lublinsky; Kevin T. Smith; Alexey Yakubovich"). We have HDFS on node1 and Oozie server on node2. Node2 also has HDFS client.
My problem:
Oozie job started from node1 (All needed files located on HDFS on node1).
Oozie custom FTP action successfully downloaded CSV files from FTP on node2 (oozie server located)
I should pass file into HDFS and create external table from CSV on node1.
I tried to use Java action and call fileSystem.moveFromLocalFile(...) method. Also I tried to use Shell action like /usr/bin/hadoop fs -moveFromLocal /tmp\import_folder/filename.csv /user/user_for_import/imported/filename.csv but I hadn't effect. All actions seems tried to look files on node1. The same result if I start oozie job from node2.
Question: can I set node for FTP action to load files from FTP on node1? Or can I have any other ways to pass downloaded files in HDFS instead described?
Oozie runs all its actions as MR jobs on nodes from a configured Map Reduce cluster. There is no way to make Oozie run some actions on a particular node.
Basically, you should use Flume to ingest files into HDFS. Set up a Flume agent on your FTP node.
Ozzie allows user to run a shell script on a particular node via oozie sssh shell extension.
https://oozie.apache.org/docs/4.2.0/DG_SshActionExtension.html
Related
May I know how to execute HDFS copy commands on DataProc cluster using airflow.
After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.
You can execute hdfs commands on dataproc cluster using something like this
gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster --
region=europe-west1
The easiest way is [1] via
gcloud dataproc jobs submit pig --execute 'fs -ls /'
or otherwise [2] as a catch-all for other shell commands.
For a single small file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because
hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
For a large file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consider [3] for details.
[1] https://pig.apache.org/docs/latest/cmds.html#fs
[2] https://pig.apache.org/docs/latest/cmds.html#sh
[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.
https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/
Airflow Dataproc operator to run shell scripts
I was learning hadoop and till now I configured 3 Node cluster
127.0.0.1 localhost
10.0.1.1 hadoop-namenode
10.0.1.2 hadoop-datanode-2
10.0.1.3 hadoop-datanode-3
My hadoop Namenode directory looks like below
hadoop
bin
data-> ./namenode ./datanode
etc
logs
sbin
--
--
As I learned that when we upload a large file in the cluster in divide the file into blocks, I want to upload a 1Gig file in my cluster and want to see how it is being stored in datanode.
Can anyone help me with the commands to upload file and see where these blocks are being stored.
First, you need to check if you have Hadoop tools in your path, if not - I recommend integrate them into it.
One of the possible ways of uploading a file to HDFS:hadoop fs -put /path/to/localfile /path/in/hdfs
I would suggest you read the documentation and get familiar with high-level commands first as it will save you time
Hadoop Documentation
Start with "dfs" command, as this one of the most often used commands
If one of the tasks in the Luigi graph need to run on a remote Hadoop cluster, is that possible? The machine on which Luigi runs is different from the Hadoop cluster. Can luigi still check the if the HDFS file in the remote cluster exists?
I tried to find documentation for this but wasn't able to.
You can run a job that launches any script.
The HDFS target documentation is here:
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.html
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.target.html
I want to create an Oozie workflow to transfer an HDFS file from an HDFS cluster to another server.
Since Oozie can run commands or scripts on any node in a system, is it possible to run a shell script or SFTP on one of the nodes and transfer the file to the destination server.
I think this task can be easily done by performing, from the remote server, a http GET (open operation) on the HDFS file (you can use curl for that).
Anyway, if you want to do it through Oozie, I think you can create a script in charge of moving the desired file from HDFS to the local file system, and then perform a scp in order to move the file within the local file system to the remote file system.
I am trying to submit an example map reduce oozie job and all the properties are configured properly with regards to the path and name node and job-tracker port etc. I validated the workflow.xml too . when I deploy the job I get a job id and when I check the status I see a status KILLED and the details basically say that
/var/tmp/oozie/oozie-oozi7188507762062318929.dir/map-reduce-launcher.jar does not exist.
In order to resolve this error, just crate hdfs folders and give appropriate permissions to them.
http://kadirsert.blogspot.com.tr/2014/03/oozie-says-jar-does-not-exist.html
Local file system (no HDFS) should have '/var/tmp/oozie' directory.
If the directory doesn't exist, create the directory and restart the Oozie server. Then there comes a lot of files under /var/tmp/oozie including *-launcher.jar files.
'/var/tmp/oozie' is the value of -Djava.io.tmpdir variable in Oozie server start-up command line. You can check the value using 'ps -ef | grep oozie' where the Oozie server is running.