FileNotFoundException when using DistributedCache to access MapFile

FileNotFoundException when using DistributedCache to access MapFile - hadoop

I am using hadoop cdf4.7 run in yarn mode. There is a MapFile in hdfs://test1:9100/user/tagdict_builder_output/part-00000
and it has two file index and data
I used the following code to add it to distributedCache:
Configuration conf = new Configuration();
Path tagDictFilePath = new Path("hdfs://test1:9100/user/tagdict_builder_output/part-00000");
DistributedCache.addCacheFile(tagDictFilePath.toUri(), conf);
Job job = new Job(conf);
And initialize a MapFile.Reader at setup of Mapper：
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (localFiles != null && localFiles.length > 0 && localFiles[0] != null) {
String mapFileDir = localFiles[0].toString();
LOG.info("mapFileDir " + mapFileDir);
FileSystem fs = FileSystem.get(context.getConfiguration());
reader = new MapFile.Reader(fs, mapFileDir, context.getConfiguration());
}
else {
throw new IOException("Could not read lexicon file in DistributedCache");
}
}
But it throws FileNotFoundException:
Error: java.io.FileNotFoundException: File does not exist: /home/mps/cdh/local/usercache/mps/appcache/application_1405497023620_0045/container_1405497023620_0045_01_000012/part-00000/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1704)
at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:452)
at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:426)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:396)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:405)
at aps.Cdh4MD5TaglistPreprocessor$Vectorizer.setup(Cdh4MD5TaglistPreprocessor.java:61)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)
I've also tried /user/tagdict_builder_output/part-00000 as path,or use a symbol link. But these do not work either.How to solve this?Many thanks.

As it says here:
Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks.
So you should try to access your files through the File object:
File f = new File("./part-00000");
EDIT1
My last suggestion:
DistributedCache.addCacheFile(new URI(tagDictFilePath.toString() + "#cache-file"), conf);
DistributedCache.createSymlink(conf);
...
// in mapper
File f = new File("cache-file");

Related

Can't connect to remote hdfs using spring hadoop

I'm just starting in Hadoop and I'm trying to read from remote hdfs (remote hdfs in a docker container, accessible from localhost:32783) trough Spring Hadoop but I get the following error:
org.springframework.data.hadoop.HadoopException:
Cannot list resources Failed on local exception:
java.io.EOFException; Host Details : local host is: "user/127.0.1.1";
destination host is: "localhost":32783;
I'm triying to read the file using the following code:
HdfsClient hdfsClient = new HdfsClient();
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://localhost:32783");
FileSystem fs = FileSystem.get(conf);
SimplerFileSystem sFs = new SimplerFileSystem(fs);
hdfsClient.setsFs(sFs);
String filePath = "/tmp/tmpTestReadTest.txt";
String output = hdfsClient.readFile(filePath);
What hdfsClient.readFile(filePath) does is the following:
public class HdfsClient {
private SimplerFileSystem sFs;
public String readFile(String filePath) throws IOException {
FSDataInputStream inputStream = this.sFs.open(filePath);
output = getStringFromInputStream(inputStream.getWrappedStream());
inputStream.close();
}
return output;
}
Any guess why I can't read from the remote hdfs? Removing the conf.set("fs.defaultFS","hdfs://localhost:32783"); I can read, but just from local filepath.
I understand that the "hdfs://localhost:32783" is correct because changing it by a random uri gives Connection refused error
There migh be something wrong into my hadoop configuration?
Thank you!

Append to file in HDFS (CDH 5.4.5)

Brand new to HDFS here.
I've got this small section of code to test out appending to a file:
val path: Path = new Path("/tmp", "myFile")
val config = new Configuration()
val fileSystem: FileSystem = FileSystem.get(config)
val outputStream = fileSystem.append(path)
outputStream.writeChars("what's up")
outputStream.close()
It is failing with this message:
Not supported
java.io.IOException: Not supported
at org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:352)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1163)
I looked at the source for ChecksumFileSystem.java, and it seems to be hardcoded to not support appending:
#Override
public FSDataOutputStream append(Path f, int bufferSize,
Progressable progress) throws IOException {
throw new IOException("Not supported");
}
How to make this work? Is there some way to change the default file system to some other implementation that does support append?

It turned out that I needed to actually run a real hadoop namenode and datanode. I am new to hadoop and did not realize this. Without this, it will use your local filesystem which is a ChecksumFileSystem, which does not support append. So I followed the blog post here to get it up and running on my system, and now I am able to append.

The append method has to be called on outputstream not on filesystem. filesystem.get() is just used to connect to your HDFS. First set dfs.support.append as true in hdfs-site.xml
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
stop all your demon services using stop-all.sh and restart it again using start-all.sh. Put this in your main method.
String fileuri = "hdfs/file/path"
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(fileuri),conf);
FSDataOutputStream out = fs.append(new Path(fileuri));
PrintWriter writer = new PrintWriter(out);
writer.append("I am appending this to my file");
writer.close();
fs.close();

FileNotFoundException on hadoop

Inside my map function, I am trying to read a file from the distributedcache, load its contents into a hash map.
The sys output log of the MapReduce job prints the content of the hashmap. This shows that it has found the file, has loaded into the data structure and performed the needed operation. It iterates through the list and prints its contents. Thus proving that the operation was successful.
However, I still get the below error after a few minutes of running the MR job:
13/01/27 18:44:21 INFO mapred.JobClient: Task Id : attempt_201301271841_0001_m_000001_2, Status : FAILED
java.io.FileNotFoundException: File does not exist: /app/hadoop/jobs/nw_single_pred_in/predict
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here's the portion which initializes Path with the location of the file to be placed in the distributed cache
// inside main, surrounded by try catch block, yet no exception thrown here
Configuration conf = new Configuration();
// rest of the stuff that relates to conf
Path knowledgefilepath = new Path(args[3]); // args[3] = /app/hadoop/jobs/nw_single_pred_in/predict/knowledge.txt
DistributedCache.addCacheFile(knowledgefilepath.toUri(), conf);
job.setJarByClass(NBprediction.class);
// rest of job settings
job.waitForCompletion(true); // kick off load
This one is inside the map function:
try {
System.out.println("Inside try !!");
Path files[]= DistributedCache.getLocalCacheFiles(context.getConfiguration());
Path cfile = new Path(files[0].toString()); // only one file
System.out.println("File path : "+cfile.toString());
CSVReader reader = new CSVReader(new FileReader(cfile.toString()),'\t');
while ((nline=reader.readNext())!=null)
data.put(nline[0],Double.parseDouble(nline[1])); // load into a hashmap
}
catch (Exception e)
{// handle exception }
Help appreciated.
Cheers !

Did a fresh installation of hadoop and ran the job with the same jar, the problem disappeared. Seems to be a bug rather than programming errors.

Reading in a parameter file in Amazon Elastic MapReduce and S3

I am trying to run my hadoop program in Amazon Elastic MapReduce system. My program takes an input file from the local filesystem which contains parameters needed for the program to run. However, since the file is normally read from the local filesystem with FileInputStream the task fails when executed in AWS environment with an error saying that the parameter file was not found. Note that, I already uploaded the file into Amazon S3. How can I fix this problem? Thanks. Below is the code that I use to read the paremeter file and consequently read the parameters in the file.
FileInputStream fstream = new FileInputStream(path);
FileInputStream os = new FileInputStream(fstream);
DataInputStream datain = new DataInputStream(os);
BufferedReader br = new BufferedReader(new InputStreamReader(datain));
String[] args = new String[7];
int i = 0;
String strLine;
while ((strLine = br.readLine()) != null) {
args[i++] = strLine;
}

If you must read the file from the local file system, you can configure your EMR job to run with a boostrap action. In that action, simply copy the file from S3 to a local file using s3cmd or similar.
You could also go through the Hadoop FileSystem class to read the file, as I'm pretty sure EMR supports direct access like this. For example:
FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));

I did not try Amazon Elastic yet, however it looks like a classical application of distributed cache. You add file do cache using -files option (if you implement Tool/ToolRunner) or job.addCacheFile(URI uri) method, and access it as if it existed locally.

You can add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}

Files not put correctly into distributed cache

I am adding a file to distributed cache using the following code:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
Then I read the file into the mappers:
protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));
String line;
try {
while ((line = joinReader.readLine()) != null) {
s = line.toString().split("\t");
do stuff to s
} finally {
joinReader.close();
}
The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.
Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.

Common mistake in your job configuration:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.
I suggest you use symlinking which will make the files available in the local working directory, and with a known name:
DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);
// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

FileNotFoundException when using DistributedCache to access MapFile - hadoop

Related

Can't connect to remote hdfs using spring hadoop

Append to file in HDFS (CDH 5.4.5)

FileNotFoundException on hadoop

Reading in a parameter file in Amazon Elastic MapReduce and S3

Files not put correctly into distributed cache

Categories

Resources