HDInsight hadoop-mapreduce-examples.jar where is the Output? - hadoop

I ran the sample wordcount application in HDInsight The command ran successfully, but I cannot find the output.
The command that I ran
is
hadoop jar hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /user/joe/WordCountOutput
I am expecting something to be created on the file system . But I don't see /user/joe/ created.
Please advice.

HDInsight uses Azure blob storage as its HDFS store by default therefore your output is in your storage account associated with the cluster. You can use something like CloudXplorer to easily read your blob storage account and find this data. It will be in your default WABS container under /user/joe/WordCountOutput
You could also run your command like this to have more control over your output location
hadoop jar hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt wabs://<contatiner>#<storageaccount>.blob.core.windows.net/user/joe/WordCountOutput

It is not in your machine's filesystem, but on Azure blobs. Typically, Hadoop MapReduce uses the Hadoop Distributed File System (HDFS), but as Thomas Jungblut correctly pointed in his comment, Azure blobs has completely replaced HDFS in HDInsight. Still, you should be able to access the output using the hdfs shell commands, like:
hadoop dfs -ls /user/jow/WordCountOutput
Perhaps HDInsight offers more ways to browse this filesystem (see Andrew Moll's answer), but I am not familiar with them, and this is actually quite easy already.

Related

No passwd entry for user 'hdfs'

I trying to set up a hive environment on my google compute engine hadoop clusters which was deployed from one click deployment.
When I try to switch to hdfs user(su hdfs), I get below error message.
No passwd entry for user 'hdfs'
The "one-click deployment" is an older sample which perhaps showcases installation from shell scripts and tarballs, but isn't intended for use as a supported Hadoop service, and doesn't set up typical Hadoop installation configurations like an hdfs user or adding commands to /usr/bin.
If you want a more Hadoop (and Pig+Hive+Spark) specialized service, you may want to consider using Google Cloud Dataproc, which is Google's managed Hadoop solution. You can create clusters from the cloud console UI in Dataproc just like click-to-deploy, and you'll get a more fully installed Hadoop/Hive environment, including a per-cluster persistent MySQL-based Hive metastore which is shared with SparkSQL to make it easy to play with Spark without modifying your Hive environment if you so choose.

Hadoop distcp command using a different S3 destination

I am using a Eucalyptus private cloud on which I have set up an CDH5 HDFS. I would like to backup my HDFS to the Eucalyptus S3. The classic way to use distcp as suggested here: http://wiki.apache.org/hadoop/AmazonS3 , ie hadoop distp hdfs://namenode:9000/user/foo/data/fil1 s3://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#bucket/key doesn't work.
It seems that hadoop is pre-configured with an S3 location on Amazon and I cannot find where is this configuration in order to change this to the IP address of my S3 service running on Eucalyptus. I would expect to be able to just change the uri of S3 in the same way you can change your NameNode uri when using an hdfs:// prefix. But is seems this is not possible... Any insights?
I have already found workarounds for transferring my data. In particular the s3cmd tools here: https://github.com/eucalyptus/eucalyptus/wiki/HowTo-use-s3cmd-with-Eucalyptus and the s3curl scripts here: aws.amazon.com/developertools/Amazon-S3/2880343845151917 work just fine but I would prefer if I could transfer my data using map-reduce with the distcp command.
It looks like hadoop is using the jets3t library for S3 access. You might be able to use the configuration described in this blog to access eucalyptus, but note that for version 4 onwards the path is "/services/objectstorage" rather than "/services/Walrus".

define a HDFS file on Elastic MapReduce on Amazon Web Services

I'm starting the implementation of the KMeans algorithm on the Hadoop MapReduce framework. I'm using in this regard the elastic MapReduce offered by Amazon Web Services. I want to create an HDFS file to save on it the initial cluster coordinates, and to store on it the final results of the reducers. I'm totally confused here. Is there anyway to create or "upload" this file into the HDFS format in order to be seen by all the mapper.
Any clarification in this regard?
Thanks.
At the end I got how to do it.
So, in order to upload the HDFS file into the cluster. You have to connect to your cluster via putty (by using the security key).
Then write these commands
hadoop distcp s3://bucke_name/data/fileNameinS3Bucket HDFSfileName
with
fileNameinS3Bucket is the name of the file in the s3 bucket
HDFSfileName is what do you want to name your file when I uploaded.
To check if the file has been uploaded
hadoop fs -ls

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Resources