Permanently store pig.jar on hdfs - hadoop

Whenever I run a pig script through the java PigServer interface, a fresh copy of pig-XXX.jar is copied to my remote hdfs cluster. Is there a way to avoid this and instead have the pig.jar permanently stored on my cluster?

Related

How to run HDFS Copy commands using Airflow?

May I know how to execute HDFS copy commands on DataProc cluster using airflow.
After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.
You can execute hdfs commands on dataproc cluster using something like this
gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster --
region=europe-west1
The easiest way is [1] via
gcloud dataproc jobs submit pig --execute 'fs -ls /'
or otherwise [2] as a catch-all for other shell commands.
For a single small file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because
hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
For a large file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consider [3] for details.
[1] https://pig.apache.org/docs/latest/cmds.html#fs
[2] https://pig.apache.org/docs/latest/cmds.html#sh
[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.
https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/
Airflow Dataproc operator to run shell scripts

how do you create a hdfs data directory?

everytime my hadoop server reboots, I have to format the namenode to start the hadoop. This removes all of the files in my hadoop installation.
I need to move my hadoop hdfs location from /tmp file to permenant location where whenever the server reboots, I don't have to format the namenode etc.
I am very new to hadoop.
How do I create a hdfs file in another directory?
How do I reference this data directory in config file so that I don't have to format the namenode?
These two properties of the hdfs-site.xml determine where local files are stored.
The defaults are under /tmp
dfs.namenode.name.dir
dfs.datanode.data.dir
You typically have to format a namenode only when the HDFS processes failed to terminate correctly (such as a power failure or forced shutdown). It is encouraged to run a standby Namenode to prevent these scenarios.

Retrieve files from remote HDFS

My local machine does not have an hdfs installation. I want to retrieve files from a remote hdfs cluster. What's the best way to achieve this? Do I need to get the files from hdfs to one of the cluster machines fs and then use ssh to retrieve them? I want to be able to do this programmatically through say a bash script.
Here are the steps:
Make sure there is connectivity between your host and the target cluster
Configure your host as client, you need to install compatible hadoop binaries. Also your host needs to be running using same operating system.
Make sure you have the same configuration files (core-site.xml, hdfs-site.xml)
You can run hadoop fs -get command to get the files directly
Also there are alternatives
If Webhdfs/httpFS is configured, you can actually download files using curl or even your browser. You can write bash scritps if Webhdfs is configured.
If your host cannot have Hadoop binaries installed to be client, then you can use following instructions.
enable password less login from your host to the one of the node on the cluster
run command ssh <user>#<host> "hadoop fs -get <hdfs_path> <os_path>"
then scp command to copy files
You can have the above 2 commands in one script

How to copy files from a remote server to hdfs location

I want to copy files from a remote server using sftp to an hdfs location directly without copying the files to local. The hdfs location is a secured cluster. Please suggest if this is feasible and how to proceed in that case.
Also I would want to know if there is any other way to connect and copy apart from sftp.
I think the most convenient way (given that your remote machine is able to connect to the hadoop cluster) is to make that remote machine act as an HDFS client. Just ssh to that machine, install the hadoop distribution, configure it properly, then run:
hadoop fs -put /local/path /hdfs/path

how do i backup hbase using distcp?

I would like to do a back up of hbase files using distcp. Then point hbase to the newly copied files and work with the stored tables.
I realize that there are tools out there which are recommended for this job. However, I'd like to know what I need to do after I've copied the files to get hbase to recognize the copied files.
For example, i'd like to start hbase shell and scan the stored tables from the newly copied file.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. So if you want to backup your clusterA to clusterB, you'll have to:
do the copy from clusterA to clusterB using distcp
start an Hbase master and some RegionServers
enjoy the command line interface on clusterB
This means have 2 clusters each with HDFS and Hbase.
But, if you want to backup your data in the same cluster, this is simplier:
do the intra copy in a different folder: hadoop distcp hdfs://nn:8020/hbase hdfs://nn:8020/backuptest
stop all the Hbase processes and change the property hbase.rootdir from "hbase" to "backuptest"
restart all the processes

Resources