Download hadoop - hadoop

hadoop file not downloading as zip file
I want to download hadoop in my system but when I go to this url: https://dlcdn.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz it download hadoop but not in zip file I need it in zip file to extract in C drive also when I hit compress to zip file it convert it but it goes again to the same structure all the folders and things are not showing. hope you got it willing for a working reply

Related

How can I transfer a zip file directly from a URL to HDFS by using Java?

How can I transfer a zip file from a URL to HDFS by using Java? I am not supposed to download it. I can only transfer the file directly from URL to HDFS. Anyone has some idea would work out?
You can use a simple ssh code like:
wget http://domain/file.zip
and then
hadoop fs -put /path/file.zip
In java, you should download the file and then put it in hdfs

Mount HAR using HDFS-Fuse

Is it possible to mount a Hadoop Archive File when using hdfs-fuse-dfs?
I followed the notes on Cloudera for setting up hdfs-fuse-dfs and am able to mount hdfs. I can view hdfs as expected. However, on our HDFS we have .har files. Within hdfs-fuse-dfs I can see the .har files, but I am not able to access the files within (aside from viewing a part-0, _index, etc files).

Copy a file using WebHDFS

Is there a way to copy a file from (let's say) hdfs://old to hdfs://new without first downloading the file and then uploading it again?
Don't know about WebHDFS, but this is achievable using hadoop distcp.
The command looks something like this:
hadoop distcp hdfs://old_nn:8020/old/location/path.file hdfs://new_nn:8020/new/location/path.file

How to deploy & run Samza job on HDFS?

I want to get a Samza job running on a remote system with the Samza job being stored on HDFS. The example (https://samza.apache.org/startup/hello-samza/0.7.0/) for running a Samza job on a coal machine involves building a tar file, then unzipping the tar file, then running a shell script that's located within the tar file.
The example here for HDFS is not really well-documented at all (https://samza.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html). It says to copy the tar file to HDFS, then to follow the other steps in the non-HDFS example.
That would imply that the tar file that now resides on HDFS needs to be untarred within HDFS, then a shell script to be run on that unzipped tar file. But you can't untar a HDFS tar file with the hadoop fs shell...
Without untarring the tar file, you don't have access to run-job.sh to initiate the Samza job.
Has anyone managed to get this to work please?
We deploy our Samza jobs this way: we have hadoop libraries in /opt/hadoop, we have Samza sh scripts in /opt/samza/bin and we have Samza config file in /opt/samza/config. In this config file there is this line:
yarn.package.path=hdfs://hadoop1:8020/deploy/samza/samzajobs-dist.tgz
When we wanna deploy new version of our Samza job we just create the tgz archive, we move it (without untaring) to HDFS to /deploy/samza/ and we run /opt/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file:///opt/samza/config/$CONFIG_NAME.properties
The only downside is that we ignore config files in the archive. If you change the config in the archive it does not take an effect. You have to change the config files in /opt/samza/config. On the other side we are able to change config of our Samza job without deploying the new tgz archive. The shell scripts under /opt/samza/bin remains the same every build thus you don't need to untar the archive package because of the shell scripts.
Good luck with Samzing! :-)

MrJob spends a lot of time Copying local files into hdfs

The problem I'm encountering is this:
Having already put my input.txt (50MBytes) file into HDFS, I'm running
python ./test.py hdfs:///user/myself/input.txt -r hadoop --hadoop-bin /usr/bin/hadoop
It seems that MrJob spends a lot of time copying files to hdfs (again?)
Copying local files into hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/
Is this logical? Shouldn't it use input.txt directly from HDFS?
(Using Hadoop version 2.6.0)
Look at the contents of hdfs:///user/myself/tmp/mrjob/test.myself.20150927.104821.148929/files/ and you will see that input.txt isn't the file that's being copied into HDFS.
What's being copied is mrjob's entire python directory, so that it can be unpacked on each of your nodes. (mrjob assumes that mrjob is not installed on each of the nodes in your cluster.)

Resources