How to get Output data from hadoop? - hadoop

I have created jar that runs the mapReduce and generates the output at some directory.
I need to read data from output data from output dir from my java code which not runs in hadoop environment without copying it into local directory.
I am using ProcessBuilder to run Jar.can any one help me..??

You can write the following code to read the output of the job within your MR driver code.
job.waitForCompletion(true);
FileSystem fs = FileSystem.get(conf);
Path[] outputFiles = FileUtil.stat2Paths(fs.listStatus(output,new OutputFilesFilter()));
for (Path file : outputFiles ) {
InputStream is = fs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
---
---
}

What's the problem in reading HDFS data using HDFS API??
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("/mapout/input.txt"));
System.out.println(inputStream.readLine());
}
Your program might be running out of your hadoop cluster but hadoop daemons must be running.

Related

Can we use a file in the reduce function in hadoop?

I want to access a different file (other than the input file to map) in reduce function. Is this possible ?
Have a look at Distributed Cache. You can send a small file to mapper or reducer.
(if you use Java)
In your main/driver, set file for job:
job.addCacheFile(new URI("path/to/file/inHadoop/file.txt#var"));
Note: var is a variable name used to access your file in mapper/reducer i.e. fn[1] in below code.
In mapper or reducer, get file from context:
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
String[] fn = cacheFiles[0].toString().split("#");
BufferedReader br = new BufferedReader(new FileReader(fn[1]));
String line = br.readLine();
//do something with line
}
Note: cacheFiles[0] refers to the file you sent from your main/driver
More information

hadoop DistributedCache returns null

i'm using hadoop DistributedCache,but i got some troubles.
my hadoop is in pseudo-distributed mode.
from here we can see in pseudo-distributed mode we use
DistributedCache.getLocalCache(xx) to retrive cached file.
first i put my file into DistributedCache:
DistributedCache.addCacheFile(new Path(
"hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());
then retrieve in mapper setup(),but DistributedCache.getLocalCache returns null.i can see my cached file through
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
and it print out:
hdfs://localhost:8022/user/administrator/myfile
here is my Pseudocode:
public static class JoinMapper{
#Override
protected void setup(Context context){
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
System.out.println("Cache
:"+context.getConfiguration().get("mapred.cache.files"));
Path cacheFile;
if (cacheFiles != null) {}
}
}
xx....
public static void main(String[] args){
Job job = new Job(conf, "Join Test");
DistributedCache.addCacheFile(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());}
sorry about poor Typesetting.anyone help please....
btw,i can get uris using
URI[] uris = DistributedCache.getCacheFiles(context
.getConfiguration());
uris returns :
hdfs://localhost:8022/user/administrator/myfile
when i try to read from uri,error with file not found exception.
The Distributed Cache will copy your files from HDFS to the local file system of all TaskTracker.
How are u reading the file? If the file is in HDFS u will have to get HDFS FileSystem, otherwise it is going to use the default (probably the local one). So to read the file in HDFS try:
FileSystem fs = FileSystem.get(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(), new Configuration());
Path path = new Path (url);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));

Accessing files in hadoop distributed cache

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs
Then, my setup function looks like this:
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
//etc
}
However, this localFiles array is always null.
I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either
I'm using hadoop 1.0.3
thanks
Peter
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}

How to load an external property file in Hadoop

I have a hadoop job which includes some spring beans. Also, in the spring context file, there is a PropertyPlaceholderConfigurer named app.properties.
This app.properties is within the jar file, the idea is remove it from the jar file in order to change some properties without re compile.
I tried the -file option, the -jarlibs option but neither worked.
Any ideas?
What I did was:
Subclass the PropertyPlaceholderConfigurer
Override loadProperties method
If there is a custom System.getProperty("hdfs_path")
try {
Path pt = new Path(hdfsLocationPath);
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
props.load(br);
} catch (Exception e) {
LOG.error(e);
}
works like a charm ...
You can add this properties file to the distributed cache as follows :
...
String s3PropertiesFilePath = args[0];
DistributedCache.addCacheFile(new URI(s3PropertiesFilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3PropertiesFilePath;
Properties prop = new Properties();
#Override
public void configure(JobConf job) {
s3PropertiesFilePath = DistributedCache.getLocalCacheFiles(job)[0];
//load the properties file
prop.load(new FileInputStream(s3PropertiesFilePath.toString()));
...
}
PS: If you are not running it on Amazon EMR, then you can keep this properties file in your hdfs and provide that path instead.

loading an external properties file in udf

When writing a UDF let's say a EvalFunc, is it possible to pass a configuration file with
properties = new Properties();
properties.load(new FileInputStream("conf/config.properties"));
when running in Hadoop Mode?
Best,
Will
Here is Simple Example to Read and Write files from Hadoop DFS from http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
maybe you can find some useful code in it to complete your job.
Following is my code, it successfully load a properties file in hadoop, I used the Apache Commons Configuration http://commons.apache.org/configuration/
public static void loadProperites(String path) throws ConfigurationException, IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(path);
FSDataInputStream in = fs.open(inFile);
PropertiesConfiguration config = new PropertiesConfiguration();
config.load(in);
in.close();
}
Use the Apache Commons Configuration2 and vfs2:
Parameters params = new Parameters();
FileBasedConfigurationBuilder<PropertiesConfiguration> builder =
new FileBasedConfigurationBuilder<>(PropertiesConfiguration.class)
.configure(params.fileBased().setFileSystem(new VFSFileSystem())
.setLocationStrategy(new FileSystemLocationStrategy())
.setEncoding("UTF-8").setFileName(propertyPath));
config = builder.getConfiguration();

Resources