loading an external properties file in udf - hadoop

When writing a UDF let's say a EvalFunc, is it possible to pass a configuration file with
properties = new Properties();
properties.load(new FileInputStream("conf/config.properties"));
when running in Hadoop Mode?
Best,
Will

Here is Simple Example to Read and Write files from Hadoop DFS from http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
maybe you can find some useful code in it to complete your job.
Following is my code, it successfully load a properties file in hadoop, I used the Apache Commons Configuration http://commons.apache.org/configuration/
public static void loadProperites(String path) throws ConfigurationException, IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(path);
FSDataInputStream in = fs.open(inFile);
PropertiesConfiguration config = new PropertiesConfiguration();
config.load(in);
in.close();
}

Use the Apache Commons Configuration2 and vfs2:
Parameters params = new Parameters();
FileBasedConfigurationBuilder<PropertiesConfiguration> builder =
new FileBasedConfigurationBuilder<>(PropertiesConfiguration.class)
.configure(params.fileBased().setFileSystem(new VFSFileSystem())
.setLocationStrategy(new FileSystemLocationStrategy())
.setEncoding("UTF-8").setFileName(propertyPath));
config = builder.getConfiguration();

Related

Can we use a file in the reduce function in hadoop?

I want to access a different file (other than the input file to map) in reduce function. Is this possible ?
Have a look at Distributed Cache. You can send a small file to mapper or reducer.
(if you use Java)
In your main/driver, set file for job:
job.addCacheFile(new URI("path/to/file/inHadoop/file.txt#var"));
Note: var is a variable name used to access your file in mapper/reducer i.e. fn[1] in below code.
In mapper or reducer, get file from context:
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
String[] fn = cacheFiles[0].toString().split("#");
BufferedReader br = new BufferedReader(new FileReader(fn[1]));
String line = br.readLine();
//do something with line
}
Note: cacheFiles[0] refers to the file you sent from your main/driver
More information

hadoop DistributedCache returns null

i'm using hadoop DistributedCache,but i got some troubles.
my hadoop is in pseudo-distributed mode.
from here we can see in pseudo-distributed mode we use
DistributedCache.getLocalCache(xx) to retrive cached file.
first i put my file into DistributedCache:
DistributedCache.addCacheFile(new Path(
"hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());
then retrieve in mapper setup(),but DistributedCache.getLocalCache returns null.i can see my cached file through
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
and it print out:
hdfs://localhost:8022/user/administrator/myfile
here is my Pseudocode:
public static class JoinMapper{
#Override
protected void setup(Context context){
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
System.out.println("Cache
:"+context.getConfiguration().get("mapred.cache.files"));
Path cacheFile;
if (cacheFiles != null) {}
}
}
xx....
public static void main(String[] args){
Job job = new Job(conf, "Join Test");
DistributedCache.addCacheFile(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());}
sorry about poor Typesetting.anyone help please....
btw,i can get uris using
URI[] uris = DistributedCache.getCacheFiles(context
.getConfiguration());
uris returns :
hdfs://localhost:8022/user/administrator/myfile
when i try to read from uri,error with file not found exception.
The Distributed Cache will copy your files from HDFS to the local file system of all TaskTracker.
How are u reading the file? If the file is in HDFS u will have to get HDFS FileSystem, otherwise it is going to use the default (probably the local one). So to read the file in HDFS try:
FileSystem fs = FileSystem.get(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(), new Configuration());
Path path = new Path (url);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));

How to get Output data from hadoop?

I have created jar that runs the mapReduce and generates the output at some directory.
I need to read data from output data from output dir from my java code which not runs in hadoop environment without copying it into local directory.
I am using ProcessBuilder to run Jar.can any one help me..??
You can write the following code to read the output of the job within your MR driver code.
job.waitForCompletion(true);
FileSystem fs = FileSystem.get(conf);
Path[] outputFiles = FileUtil.stat2Paths(fs.listStatus(output,new OutputFilesFilter()));
for (Path file : outputFiles ) {
InputStream is = fs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
---
---
}
What's the problem in reading HDFS data using HDFS API??
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(new Path("/mapout/input.txt"));
System.out.println(inputStream.readLine());
}
Your program might be running out of your hadoop cluster but hadoop daemons must be running.

Accessing files in hadoop distributed cache

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs
Then, my setup function looks like this:
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
//etc
}
However, this localFiles array is always null.
I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either
I'm using hadoop 1.0.3
thanks
Peter
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}

How to load an external property file in Hadoop

I have a hadoop job which includes some spring beans. Also, in the spring context file, there is a PropertyPlaceholderConfigurer named app.properties.
This app.properties is within the jar file, the idea is remove it from the jar file in order to change some properties without re compile.
I tried the -file option, the -jarlibs option but neither worked.
Any ideas?
What I did was:
Subclass the PropertyPlaceholderConfigurer
Override loadProperties method
If there is a custom System.getProperty("hdfs_path")
try {
Path pt = new Path(hdfsLocationPath);
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
props.load(br);
} catch (Exception e) {
LOG.error(e);
}
works like a charm ...
You can add this properties file to the distributed cache as follows :
...
String s3PropertiesFilePath = args[0];
DistributedCache.addCacheFile(new URI(s3PropertiesFilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3PropertiesFilePath;
Properties prop = new Properties();
#Override
public void configure(JobConf job) {
s3PropertiesFilePath = DistributedCache.getLocalCacheFiles(job)[0];
//load the properties file
prop.load(new FileInputStream(s3PropertiesFilePath.toString()));
...
}
PS: If you are not running it on Amazon EMR, then you can keep this properties file in your hdfs and provide that path instead.

Resources