Copying files from HDFS to local file system with JAVA - hadoop

I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks.
try {
Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString());
Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString());
FileSystem fs = FileSystem.get(context.getConfiguration());
fs.copyToLocalFile(phdfs_input, plocal_input);
/* String localoutput_file = "/home/hduser/Destop/output/"+value.toString();
String cmd1[] = {"mafia", "-mfi", ".5", "-ascii", "~/Desktop/"+value.toString(), localoutput_file };
File mafia_dir = new File("/home/hduser/");
ShellCommandExecutor s = new ShellCommandExecutor(cmd1, mafia_dir);*/
} catch (Exception e) {
e.printStackTrace();
}

Try using /user/hduser/conninput/"+value.toString() in the Path constructor instead of providing the master:54310 part. It should figure out master:54310 from the Configuration.

Related

Append to file in HDFS (CDH 5.4.5)

Brand new to HDFS here.
I've got this small section of code to test out appending to a file:
val path: Path = new Path("/tmp", "myFile")
val config = new Configuration()
val fileSystem: FileSystem = FileSystem.get(config)
val outputStream = fileSystem.append(path)
outputStream.writeChars("what's up")
outputStream.close()
It is failing with this message:
Not supported
java.io.IOException: Not supported
at org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:352)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1163)
I looked at the source for ChecksumFileSystem.java, and it seems to be hardcoded to not support appending:
#Override
public FSDataOutputStream append(Path f, int bufferSize,
Progressable progress) throws IOException {
throw new IOException("Not supported");
}
How to make this work? Is there some way to change the default file system to some other implementation that does support append?
It turned out that I needed to actually run a real hadoop namenode and datanode. I am new to hadoop and did not realize this. Without this, it will use your local filesystem which is a ChecksumFileSystem, which does not support append. So I followed the blog post here to get it up and running on my system, and now I am able to append.
The append method has to be called on outputstream not on filesystem. filesystem.get() is just used to connect to your HDFS. First set dfs.support.append as true in hdfs-site.xml
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
stop all your demon services using stop-all.sh and restart it again using start-all.sh. Put this in your main method.
String fileuri = "hdfs/file/path"
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(fileuri),conf);
FSDataOutputStream out = fs.append(new Path(fileuri));
PrintWriter writer = new PrintWriter(out);
writer.append("I am appending this to my file");
writer.close();
fs.close();

Reading in a parameter file in Amazon Elastic MapReduce and S3

I am trying to run my hadoop program in Amazon Elastic MapReduce system. My program takes an input file from the local filesystem which contains parameters needed for the program to run. However, since the file is normally read from the local filesystem with FileInputStream the task fails when executed in AWS environment with an error saying that the parameter file was not found. Note that, I already uploaded the file into Amazon S3. How can I fix this problem? Thanks. Below is the code that I use to read the paremeter file and consequently read the parameters in the file.
FileInputStream fstream = new FileInputStream(path);
FileInputStream os = new FileInputStream(fstream);
DataInputStream datain = new DataInputStream(os);
BufferedReader br = new BufferedReader(new InputStreamReader(datain));
String[] args = new String[7];
int i = 0;
String strLine;
while ((strLine = br.readLine()) != null) {
args[i++] = strLine;
}
If you must read the file from the local file system, you can configure your EMR job to run with a boostrap action. In that action, simply copy the file from S3 to a local file using s3cmd or similar.
You could also go through the Hadoop FileSystem class to read the file, as I'm pretty sure EMR supports direct access like this. For example:
FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));
I did not try Amazon Elastic yet, however it looks like a classical application of distributed cache. You add file do cache using -files option (if you implement Tool/ToolRunner) or job.addCacheFile(URI uri) method, and access it as if it existed locally.
You can add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}

Unable to load OpenNLP sentence model in Hadoop map-reduce job

I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:
public AnalysisFile analyze(String content) {
InputStream modelIn = null;
String[] sentences = null;
// references an absolute path to en-sent.bin
logger.info("sentenceModelPath: " + sentenceModelPath);
try {
modelIn = getClass().getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
sentences = sentenceBreaker.sentDetect(content);
} catch (FileNotFoundException e) {
logger.error("Unable to locate sentence model.");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (modelIn != null) {
try {
modelIn.close();
} catch (IOException e) {
}
}
}
logger.info("number of sentences: " + sentences.length);
<snip>
}
When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:
I've verified that the model file exists in the location sentenceModelPath refers to.
I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
Hadoop is just running on my local machine.
Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?
Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).
To get around this you have a number of options:
Amend your code to load the file from HDFS:
modelIn = FileSystem.get(context.getConfiguration()).open(
new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):
modelIn = new FileInputStream("en-sent.bin");
Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>

How to Read file from Hadoop using Java without command line

I wanted to read file from hadoop system, I could do that using the below code
String uri = theFilename;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
To run this file I have to run hadoop jar myjar.jar com.mycompany.cloud.CatFile /filepathin_hadoop
That works. But How can I do that same from other program, I mean without using hadoop jar command.
You can add your core-site.xml to that Configuration object so it knows the URI for your HDFS instance. This method requires HADOOP_HOME to be set.
Configuration conf = new Configuration();
Path coreSitePath = new Path(System.getenv("HADOOP_HOME"), "conf/core-site.xml");
conf.addResource(coreSitePath);
FileSystem hdfs = FileSystem.get(conf);
// rest of code the same
Now, without using hadoop jar you can open a connection to your HDFS instance.
Edit: Have to use conf.addResource(Path). If you use a String arg it, looks in the classpath for that filename.
There is another configuration method set(parameterName,value). If you use this method, you dont have to specify the location of core-site.xml. This would be useful for accessing HDFS from remote location like webserver.
Usage as follows :
String uri = theFilename;
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://10.132.100.211:8020/");
FileSystem fs = FileSystem.get(conf);
// Rest of the code

How to design my mapper?

I have to write a mapreduce job but I dont know how to go about it,
I have jar MARD.jar through which I can instantiate MARD objects.
Using which I call the mard.normalize file meathod on it i.e. mard.normaliseFile(bunch of arguments).
This inturn creates certain output file.
For the normalise meathod to run it needs a folder called myMard in the working directory.
So I thought that I would give the myMard folder as the in input path to hadoop job, but m not sure if that would help beacuse mard.normaliseFile(bunch of arguments) will search for the myMard folder in the working directory but it will not find it as (**this is what I think) the Mapper will only be able to access the content of files through the "values" obtained from the fileSplit, it cannot give direct access to the files in the myMard folder.
In short I have to execute the follwing code through the MapReduce
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
MARD mard = new MARD(setupFolder);
Text valuz = new Text();
IntWritable intval = new IntWritable();
File original = new File("Vca1652.txt");
File mardedxml = new File("Vca1652-mardedxml.txt");
File marded = new File("Vca1652-marded.txt");
mardedxml.createNewFile();
marded.createNewFile();
NormalisationStats stats;
try {
stats = mard.normaliseFile(original,mardedxml,marded,50.0);
//This meathod requires access to the myMardfolder
System.out.println(stats);
} catch (MARDException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Please help

Resources