How to load an external property file in Hadoop - spring

I have a hadoop job which includes some spring beans. Also, in the spring context file, there is a PropertyPlaceholderConfigurer named app.properties.
This app.properties is within the jar file, the idea is remove it from the jar file in order to change some properties without re compile.
I tried the -file option, the -jarlibs option but neither worked.
Any ideas?

What I did was:
Subclass the PropertyPlaceholderConfigurer
Override loadProperties method
If there is a custom System.getProperty("hdfs_path")
try {
Path pt = new Path(hdfsLocationPath);
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
props.load(br);
} catch (Exception e) {
LOG.error(e);
}
works like a charm ...

You can add this properties file to the distributed cache as follows :
...
String s3PropertiesFilePath = args[0];
DistributedCache.addCacheFile(new URI(s3PropertiesFilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3PropertiesFilePath;
Properties prop = new Properties();
#Override
public void configure(JobConf job) {
s3PropertiesFilePath = DistributedCache.getLocalCacheFiles(job)[0];
//load the properties file
prop.load(new FileInputStream(s3PropertiesFilePath.toString()));
...
}
PS: If you are not running it on Amazon EMR, then you can keep this properties file in your hdfs and provide that path instead.

Related

Hadoop PathFilter config is null

I've got a path filter that looks like this:
public class AvroFileInclusionFilter extends Configured implements PathFilter {
Configuration conf;
#Override
public void setConf(Configuration conf) {
this.conf = conf;
}
#Override
public boolean accept(Path path) {
System.out.println("FileInclusion: " + conf.get("fileInclusion"));
return true;
}
}
I am explicitly setting the fileInclusion property on the configuration. For some reason, the configuration being used in the path filter is not the same configuration that I am setting up in my job, like so:
Job job = Job.getInstance(getConf(), "Stock Updater");
job.getConfiguration().set("outputPath", opts.outputPath);
String[] inputPaths = findPathsForDays(job.getConfiguration(),
new Path(opts.inputPath), findDaysToQuery(job.getConfiguration(),
opts.updatefile)).toArray(new String[]{});
job.getConfiguration().set("fileInclusion", "hello`");
AvroKeyValueInputFormat.addInputPath(job, new Path(opts.inputPath));
job.getConfiguration().set("mapred.input.pathFilter.class", AvroFileInclusionFilter.class.getName());
job.setInputFormatClass(AvroKeyValueInputFormat.class);
LazyOutputFormat.setOutputFormatClass(job, AvroKeyValueOutputFormat.class);
AvroKeyValueOutputFormat.setOutputPath(job, new Path(opts.outputPath));
job.addCacheFile(new Path(opts.updatefile).toUri());
AvroKeyValueOutputFormat.setCompressOutput(job, true);
job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC, snappyCodec().toString());
AvroJob.setInputKeySchema(job, DateKey.SCHEMA$);
AvroJob.setInputValueSchema(job, StockUpdated.SCHEMA$);
AvroJob.setMapOutputKeySchema(job, DateKey.SCHEMA$);
AvroJob.setMapOutputValueSchema(job, StockUpdated.SCHEMA$);
AvroJob.setOutputKeySchema(job, DateKey.SCHEMA$);
AvroJob.setOutputValueSchema(job, StockUpdated.SCHEMA$);
job.setMapperClass(StockUpdaterMapper.class);
job.setReducerClass(StockUpdaterReducer.class);
AvroMultipleOutputs.addNamedOutput(job, "output", AvroKeyValueOutputFormat.class,
DateKey.SCHEMA$, StockUpdated.SCHEMA$);
job.setJarByClass(getClass());
boolean success = job.waitForCompletion(true);
The conf.get("fileInclusion") is always null and I cannot seem to figure out why. I've been working on this for quite awhile not and I'm pretty much at the end of my rope. Why is the configuration different? I'm submitting the job using "hadoop jar" and "yarn jar".
Instead of creating the object job by giving getConf() method as argument, try the following
Configuration conf = new Configuration();
conf.set("outputPath", opts.outputPath);
conf.set("mapred.input.pathFilter.class", AvroFileInclusionFilter.class.getName());
..
..
// After setting up the required key values in Configuration object Create Job object by supplying conf
Job job = new Job(conf, "Stock Updater");
PathFilter should 'implements Configurable' instead of 'extends Configured'

hadoop DistributedCache returns null

i'm using hadoop DistributedCache,but i got some troubles.
my hadoop is in pseudo-distributed mode.
from here we can see in pseudo-distributed mode we use
DistributedCache.getLocalCache(xx) to retrive cached file.
first i put my file into DistributedCache:
DistributedCache.addCacheFile(new Path(
"hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());
then retrieve in mapper setup(),but DistributedCache.getLocalCache returns null.i can see my cached file through
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
and it print out:
hdfs://localhost:8022/user/administrator/myfile
here is my Pseudocode:
public static class JoinMapper{
#Override
protected void setup(Context context){
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
System.out.println("Cache
:"+context.getConfiguration().get("mapred.cache.files"));
Path cacheFile;
if (cacheFiles != null) {}
}
}
xx....
public static void main(String[] args){
Job job = new Job(conf, "Join Test");
DistributedCache.addCacheFile(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());}
sorry about poor Typesetting.anyone help please....
btw,i can get uris using
URI[] uris = DistributedCache.getCacheFiles(context
.getConfiguration());
uris returns :
hdfs://localhost:8022/user/administrator/myfile
when i try to read from uri,error with file not found exception.
The Distributed Cache will copy your files from HDFS to the local file system of all TaskTracker.
How are u reading the file? If the file is in HDFS u will have to get HDFS FileSystem, otherwise it is going to use the default (probably the local one). So to read the file in HDFS try:
FileSystem fs = FileSystem.get(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(), new Configuration());
Path path = new Path (url);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));

File inside jar is not visible for spring

All
I created a jar file with the following MANIFEST.MF inside:
Manifest-Version: 1.0
Ant-Version: Apache Ant 1.8.3
Created-By: 1.6.0_25-b06 (Sun Microsystems Inc.)
Main-Class: my.Main
Class-Path: . lib/spring-core-3.2.0.M2.jar lib/spring-beans-3.2.0.M2.jar
In its root there is a file called my.config which is referenced in my spring-context.xml like this:
<bean id="..." class="...">
<property name="resource" value="classpath:my.config" />
</bean>
If I run the jar, everything looks fine escept the loading of that specific file:
Caused by: java.io.FileNotFoundException: class path resource [my.config] cannot be resolved to absolute file path because it does not reside in the file system: jar:file:/D:/work/my.jar!/my.config
at org.springframework.util.ResourceUtils.getFile(ResourceUtils.java:205)
at org.springframework.core.io.AbstractFileResolvingResource.getFile(AbstractFileResolvingResource.java:52)
at eu.stepman.server.configuration.BeanConfigurationFactoryBean.getObject(BeanConfigurationFactoryBean.java:32)
at eu.stepman.server.configuration.BeanConfigurationFactoryBean.getObject(BeanConfigurationFactoryBean.java:1)
at org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:142)
... 22 more
classes are loaded the from inside the jar
spring and other dependencies are loaded from separated jars
spring context is loaded (new ClassPathXmlApplicationContext("spring-context/applicationContext.xml"))
my.properties is loaded into PropertyPlaceholderConfigurer ("classpath:my.properties")
if I put my .config file outside the file system, and change the resource url to 'file:', everything seems to be fine...
Any tips?
If your spring-context.xml and my.config files are in different jars then you will need to use classpath*:my.config?
More info here
Also, make sure you are using resource.getInputStream() not resource.getFile() when loading from inside a jar file.
In the spring jar package, I use new ClassPathResource(filename).getFile(), which throws the exception:
cannot be resolved to absolute file path because it does not reside in the file system: jar
But using new ClassPathResource(filename).getInputStream() will solve this problem. The reason is that the configuration file in the jar does not exist in the operating system's file tree,so must use getInputStream().
I know this question has already been answered. However, for those using spring boot, this link helped me - https://smarterco.de/java-load-file-classpath-spring-boot/
However, the resourceLoader.getResource("classpath:file.txt").getFile(); was causing this problem and sbk's comment:
That's it. A java.io.File represents a file on the file system, in a
directory structure. The Jar is a java.io.File. But anything within
that file is beyond the reach of java.io.File. As far as java is
concerned, until it is uncompressed, a class in jar file is no
different than a word in a word document.
helped me understand why to use getInputStream() instead. It works for me now!
Thanks!
The error message is correct (if not very helpful): the file we're trying to load is not a file on the filesystem, but a chunk of bytes in a ZIP inside a ZIP.
Through experimentation (Java 11, Spring Boot 2.3.x), I found this to work without changing any config or even a wildcard:
var resource = ResourceUtils.getURL("classpath:some/resource/in/a/dependency");
new BufferedReader(
new InputStreamReader(resource.openStream())
).lines().forEach(System.out::println);
I had similar problem when using Tomcat6.x and none of the advices I found was helping.
At the end I deleted work folder (of Tomcat) and the problem gone.
I know it is illogical but for documentation purpose...
I was having an issue recursively loading resources in my Spring app, and found that the issue was I should be using resource.getInputStream. Here's an example showing how to recursively read in all files in config/myfiles that are json files.
Example.java
private String myFilesResourceUrl = "config/myfiles/**/";
private String myFilesResourceExtension = "json";
ResourceLoader rl = new ResourceLoader();
// Recursively get resources that match.
// Big note: If you decide to iterate over these,
// use resource.GetResourceAsStream to load the contents
// or use the `readFileResource` of the ResourceLoader class.
Resource[] resources = rl.getResourcesInResourceFolder(myFilesResourceUrl, myFilesResourceExtension);
// Recursively get resource and their contents that match.
// This loads all the files into memory, so maybe use the same approach
// as this method, if need be.
Map<Resource,String> contents = rl.getResourceContentsInResourceFolder(myFilesResourceUrl, myFilesResourceExtension);
ResourceLoader.java
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;
import org.springframework.core.io.Resource;
import org.springframework.core.io.support.PathMatchingResourcePatternResolver;
import org.springframework.core.io.support.ResourcePatternResolver;
import org.springframework.util.StreamUtils;
public class ResourceLoader {
public Resource[] getResourcesInResourceFolder(String folder, String extension) {
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
try {
String resourceUrl = folder + "/*." + extension;
Resource[] resources = resolver.getResources(resourceUrl);
return resources;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public String readResource(Resource resource) throws IOException {
try (InputStream stream = resource.getInputStream()) {
return StreamUtils.copyToString(stream, Charset.defaultCharset());
}
}
public Map<Resource, String> getResourceContentsInResourceFolder(
String folder, String extension) {
Resource[] resources = getResourcesInResourceFolder(folder, extension);
HashMap<Resource, String> result = new HashMap<>();
for (var resource : resources) {
try {
String contents = readResource(resource);
result.put(resource, contents);
} catch (IOException e) {
throw new RuntimeException("Could not load resource=" + resource + ", e=" + e);
}
}
return result;
}
}
For kotlin users, I solved it like this:
val url = ResourceUtils.getURL("classpath:$fileName")
val response = url.openStream().bufferedReader().readText()
The answer by #sbk is the way we should do it in spring-boot environment (apart from #Value("${classpath*:})), in my opinion. But in my scenario it was not working if the execute from standalone jar..may be I did something wrong.
But this can be another way of doing this,
InputStream is = this.getClass().getClassLoader().getResourceAsStream(<relative path of the resource from resource directory>);
I was having an issue more complex because I have more than one file with same name, one is in the main Spring Boot jar and others are in jars inside main fat jar.
My solution was getting all the resources with same name and after that get the one I needed filtering by package name.
To get all the files:
ResourceLoader resourceLoader = new FileSystemResourceLoader();
final Enumeration<URL> systemResources = resourceLoader.getClassLoader().getResources(fileNameWithoutExt + FILE_EXT);
In Spring boot 1.5.22.RELEASE Jar packaging this worked for me:
InputStream resource = new ClassPathResource("example.pdf").getInputStream();
"example.pdf" is in src/main/resources.
And then to read it as byte[]
FileCopyUtils.copyToByteArray(resource);
I had the same issue, ended up using the much more convenient Guava Resources:
Resources.getResource("my.file")
While this is a very old thread, but I also faced the same issue while adding FCM in a Spring Boot Application.
In development, the file was getting opened and no errors but when I deployed the application to AWS Elastic beanstalk , the error of FileNotFoundException was getting thrown and FCM was not working.
So here's my solution to get it working on both development env and jar deployment production.
I have a Component class FCMService which has a method as follows:
#PostConstruct
public void initialize() {
log.info("Starting FCM Service");
InputStream inputStream;
try {
ClassPathResource resource = new ClassPathResource("classpath:fcm/my_project_firebase_config.json");
URL url = null;
try {
url = resource.getURL();
} catch (IOException e) {
}
if (url != null) {
inputStream = url.openStream();
} else {
File file = ResourceUtils.getFile("classpath:fcm/my_project_firebase_config.json");
inputStream = new FileInputStream(file);
}
FirebaseOptions options = FirebaseOptions.builder().setCredentials(GoogleCredentials.fromStream(inputStream))
.build();
FirebaseApp.initializeApp(options);
log.info("FCM Service started");
} catch (IOException e) {
log.error("Error starting FCM Service");
e.printStackTrace();
}
}
Hope this helps someone looking for a quick fix with implementing FCM.
Can be handled like:
var serviceAccount = ClassLoader.getSystemResourceAsStream(FB_CONFIG_FILE_NAME);
FirebaseOptions options = new FirebaseOptions.Builder()
.setCredentials(GoogleCredentials.fromStream(serviceAccount))
.build();
Where FB_CONFIG_FILE_NAME is name of file in your 'resources' folder.

Accessing files in hadoop distributed cache

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs
Then, my setup function looks like this:
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
//etc
}
However, this localFiles array is always null.
I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either
I'm using hadoop 1.0.3
thanks
Peter
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}

loading an external properties file in udf

When writing a UDF let's say a EvalFunc, is it possible to pass a configuration file with
properties = new Properties();
properties.load(new FileInputStream("conf/config.properties"));
when running in Hadoop Mode?
Best,
Will
Here is Simple Example to Read and Write files from Hadoop DFS from http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
maybe you can find some useful code in it to complete your job.
Following is my code, it successfully load a properties file in hadoop, I used the Apache Commons Configuration http://commons.apache.org/configuration/
public static void loadProperites(String path) throws ConfigurationException, IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(path);
FSDataInputStream in = fs.open(inFile);
PropertiesConfiguration config = new PropertiesConfiguration();
config.load(in);
in.close();
}
Use the Apache Commons Configuration2 and vfs2:
Parameters params = new Parameters();
FileBasedConfigurationBuilder<PropertiesConfiguration> builder =
new FileBasedConfigurationBuilder<>(PropertiesConfiguration.class)
.configure(params.fileBased().setFileSystem(new VFSFileSystem())
.setLocationStrategy(new FileSystemLocationStrategy())
.setEncoding("UTF-8").setFileName(propertyPath));
config = builder.getConfiguration();

Resources