Get hadoop configuration in Java util - hadoop

I'm writing a Java utility that needs to access the DFS, so I need a Configuration object.
When I create one simply by using
Configuration conf = new Configuration()
it doesn't seem to find the DFS, and just uses the local file system; printing
fs.getHomeDirectory()
gives my local home directory. I've tried adding
core-site.xml,mapred-site.xml,yarn-site.xml,and hdfs-site.xml to the Configuration as resources, but it doesn't change anything. What do I need to do to get it to pick up the HDFS settings?
Thanks for reading

The reason why it's pointing to your local file system is core-site.xml and hdfs-site.xml is not added properly. Below code snippet will help you.
Configuration conf = new Configuration();
conf.addResource(new Path("file:///etc/hadoop/conf/core-site.xml")); // Replace with actual path
conf.addResource(new Path("file:///etc/hadoop/conf/hdfs-site.xml")); // Replace with actual path
Path pt = new Path("."); // HDFS Path
FileSystem fs = pt.getFileSystem(conf);
System.out.println("Home directory :"+fs.getHomeDirectory());
Update :
Above option should've worked, It seems some issues in the configuration file or path. You have another option instead of adding configuration files using addResource method, use set method. Open your core-site.xml file and find the value of fs.defaultFS. Use set method instead of addResource method.
conf.set("fs.defaultFS","hdfs://<Namenode-Host>:<Port>"); // Refer you core-site.xml file and replace <Namenode-Host> and <Port> with your cluster namenode and Port (default port number should be `8020`).

To get access to the File system you have to use the configuration and a file system as outlined below
Get the instance of Configuration
Get the HDFS instance
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://"+HadoopLocation+":8020"), configuration);
In this case HadoopLocation is the location in which you have your hadoop server (Possibly Localhost)

Related

Write file to HDFS from (non-cluster) local machine

I am trying to find some tips to solve my problem. But I really have no idea how to do it.
I have a HADOOP HDFS and I need to write data from my local machine to HDFS using Java. Currently, I've solved it in this way:
Create file on my local machine
Create SSH connection to cluster where Hadoop is installed
Copy file to cluster(local OS)
Copy file from local cluster OS to HDFS(hadoop fs -put).
It works fine but now I need to copy files without SSH. I mean, I need to do in this way:
Create file on my local machine(not in Cluster)
Copy file directly to HDFS
All examples what can I could find show how to copy files from local OS (cluster) to HDFS. Has anyone solved a problem like this?
I write code like this:
System.setProperty("java.security.auth.login.config", "jaas.conf");
System.setProperty("java.security.krb5.conf", "krb5.conf");
UsernamePasswordHandler passHandler = new UsernamePasswordHandler("user", "pass");
LoginContext loginContextHadoop = new LoginContext("Client", passHandler);
loginContextHadoop.login();
Configuration configuration = new Configuration();
configuration.set("hadoop.security.authentication", "Kerberos");
org.apache.hadoop.security.UserGroupInformation.setConfiguration(configuration);
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(loginContextHadoop.getSubject());
FileSystem hdfs = FileSystem.get(new URI("hdfs://URI:50020"), configuration);
System.out.println(hdfs.getUsed());
And getting error like this:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchProtocolException): Unknown protocol: org.apache.hadoop.hdfs.protocol.ClientProtocol
I think i use wrong port, i get it from dfs.datanode.ipc.address. Does anybody has any ideas?

Intellij Accessing file from hadoop cluster

As part of my intellij environment set up I need to connect to a remote hadoop cluster and access the files in my local spark code.
Is there any way to connect to hadoop remote environment without creating hadoop local instance?
A connection code snippet would be the ideal answer.
If you have a keytab file to authenticate to the cluster, this is one way I've done it:
val conf: Configuration: = new Configuration()
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("user-name", "path/to/keytab/on/local/machine")
FileSystem.get(conf)
I believe to do this, you might also need some configuration xml docs. Namely core-site.xml, hdfs-site.xml, and mapred-site.xml. These are somewhere usually under /etc/hadoop/conf/.
You would put those under a directory in your program and mark it as Resources directory in IntelliJ.

Hadoop copying file to hadoop filesystem

I have copied a file from a local to the hdfs file system and the file got copied -- /user/hduser/in
hduser#vagrant:/usr/local/hadoop/hadoop-1.2.1$ bin/hadoop fs -copyFromLocal /home/hduser/afile in
Question:-
1.How does hadoop by default copies the file to this directory -- /user/hduser/in ...Where is this mapping specified in the conf file?
If you write the command like above, the file gets copied to your user's HDFS home directory, which is /home/username. See also here: HDFS Home Directory.
You can use an absolute pathname (one starting with "/") just like in a Linux filesystem, if you want to write the file to a different location.
Are u using a default vm? Basically if you configure hadoop from binaries without using the preconfigure yum package. It doesnt have a default path. But if you use yum via hortin or cloudera vm. It comes with default path i guess
Check the hdfs-site.xml to see the default fs path. So "/" will point to the base URL set in the above mentioned XML. Any folder mentioned in the command without the use of home path will be appended to that.
hadoop picks the default path defined in hdfs-site.xml and write data.
below image clear how writes works in HDFS.

hadoop cluster use HDFS as default.fs, Client wants to use file on local file system

My hadoop cluster is using HDFS as default fs. But on client side, the input file is located on local file system. (for some reason, I don't want to move that into HDFS). If I give mapreduce job a Uri like 'file:///opt/myDoc.txt', I've got a file not exists error. How can I get access to local file system in this case?
This is the place where you give the details of input path
FileInputFormat.setInputPaths(conf, new Path(args[0]));
Instead of args[0], give local path.
possible duplicate

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

Resources