hadoop DistributedCache returns null - hadoop

i'm using hadoop DistributedCache,but i got some troubles.
my hadoop is in pseudo-distributed mode.
from here we can see in pseudo-distributed mode we use
DistributedCache.getLocalCache(xx) to retrive cached file.
first i put my file into DistributedCache:
DistributedCache.addCacheFile(new Path(
"hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());
then retrieve in mapper setup(),but DistributedCache.getLocalCache returns null.i can see my cached file through
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
and it print out:
hdfs://localhost:8022/user/administrator/myfile
here is my Pseudocode:
public static class JoinMapper{
#Override
protected void setup(Context context){
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
System.out.println("Cache
:"+context.getConfiguration().get("mapred.cache.files"));
Path cacheFile;
if (cacheFiles != null) {}
}
}
xx....
public static void main(String[] args){
Job job = new Job(conf, "Join Test");
DistributedCache.addCacheFile(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());}
sorry about poor Typesetting.anyone help please....
btw,i can get uris using
URI[] uris = DistributedCache.getCacheFiles(context
.getConfiguration());
uris returns :
hdfs://localhost:8022/user/administrator/myfile
when i try to read from uri,error with file not found exception.

The Distributed Cache will copy your files from HDFS to the local file system of all TaskTracker.
How are u reading the file? If the file is in HDFS u will have to get HDFS FileSystem, otherwise it is going to use the default (probably the local one). So to read the file in HDFS try:
FileSystem fs = FileSystem.get(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(), new Configuration());
Path path = new Path (url);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));

Related

Can we use a file in the reduce function in hadoop?

I want to access a different file (other than the input file to map) in reduce function. Is this possible ?
Have a look at Distributed Cache. You can send a small file to mapper or reducer.
(if you use Java)
In your main/driver, set file for job:
job.addCacheFile(new URI("path/to/file/inHadoop/file.txt#var"));
Note: var is a variable name used to access your file in mapper/reducer i.e. fn[1] in below code.
In mapper or reducer, get file from context:
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
String[] fn = cacheFiles[0].toString().split("#");
BufferedReader br = new BufferedReader(new FileReader(fn[1]));
String line = br.readLine();
//do something with line
}
Note: cacheFiles[0] refers to the file you sent from your main/driver
More information

Using Hadoop DistributedCache with archives

Hadoop's DistributedCache documentation doesn't seem to sufficently describe how to use the distributed cache. Here is the example given:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
File f = new File("./map.zip/some/file/in/zip.txt");
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
I've been searching around for over an hour trying to figure out how to use this. After piecing together a few other SO questions, here's what I came up with:
public static void main(String[] args) throws Exception {
Job job = new Job(new JobConf(), "Job Name");
JobConf conf = job.getConfiguration();
DistributedCache.createSymlink(conf);
DistributedCache.addCacheArchive(new URI("/ProjectDir/LookupTable.zip", job);
// *Rest of configuration code*
}
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>
{
private Path[] localArchives;
public void configure(JobConf job)
{
// Get the cached archive
File file1 = new File("./LookupTable.zip/file1.dat");
BufferedReader br1index = new BufferedReader(new FileInputStream(file1));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{ // *Map code* }
}
Where am I supposed to call the void configure(JobConf job) function?
Where do I use the private Path[] localArchives object?
Is my code in the configure() function the correct way to access files within an archive and to link a file with a BufferedReader?
I will answer your questions w.r.t new API and common practices in use for distributed cache
Where am I supposed to call the void configure(JobConf job) function?
Framework will call protected void setup(Context context) method once at beginning of every map task, the logic associated with using cache files is usually handled here. For example, reading file and storing data in variable to be used in map() function which is called after setup()
Where do I use the private Path[] localArchives object?
It will be typically used in setup() method to retrieve path of cache files . Something like this.
Path[] localArchive =DistributedCache.getLocalCacheFiles(context.getConfiguration());
Is my code in the configure() function the correct way to access
files within an archive and to link a file with a BufferedReader?
Its missing a call to method to retrive path where cache files are stored (shown above). Once the path is retrieved the file(s) can be read as below.
FSDataInputStream in = fs.open(localArchive);
BufferedReader br = new BufferedReader(new InputStreamReader(in));

How to get Output data from hadoop?

I have created jar that runs the mapReduce and generates the output at some directory.
I need to read data from output data from output dir from my java code which not runs in hadoop environment without copying it into local directory.
I am using ProcessBuilder to run Jar.can any one help me..??
You can write the following code to read the output of the job within your MR driver code.
job.waitForCompletion(true);
FileSystem fs = FileSystem.get(conf);
Path[] outputFiles = FileUtil.stat2Paths(fs.listStatus(output,new OutputFilesFilter()));
for (Path file : outputFiles ) {
InputStream is = fs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
---
---
}
What's the problem in reading HDFS data using HDFS API??
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("/mapout/input.txt"));
System.out.println(inputStream.readLine());
}
Your program might be running out of your hadoop cluster but hadoop daemons must be running.

Accessing files in hadoop distributed cache

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs
Then, my setup function looks like this:
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
//etc
}
However, this localFiles array is always null.
I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either
I'm using hadoop 1.0.3
thanks
Peter
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}

loading an external properties file in udf

When writing a UDF let's say a EvalFunc, is it possible to pass a configuration file with
properties = new Properties();
properties.load(new FileInputStream("conf/config.properties"));
when running in Hadoop Mode?
Best,
Will
Here is Simple Example to Read and Write files from Hadoop DFS from http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
maybe you can find some useful code in it to complete your job.
Following is my code, it successfully load a properties file in hadoop, I used the Apache Commons Configuration http://commons.apache.org/configuration/
public static void loadProperites(String path) throws ConfigurationException, IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(path);
FSDataInputStream in = fs.open(inFile);
PropertiesConfiguration config = new PropertiesConfiguration();
config.load(in);
in.close();
}
Use the Apache Commons Configuration2 and vfs2:
Parameters params = new Parameters();
FileBasedConfigurationBuilder<PropertiesConfiguration> builder =
new FileBasedConfigurationBuilder<>(PropertiesConfiguration.class)
.configure(params.fileBased().setFileSystem(new VFSFileSystem())
.setLocationStrategy(new FileSystemLocationStrategy())
.setEncoding("UTF-8").setFileName(propertyPath));
config = builder.getConfiguration();

Resources