How to run Hadoop as part of test suite of Spring application? - spring

I would like to set up a simple "Hello, World!" to get an understanding of how to use basic Hadoop functionality such storing/reading files using HDFS.
Is it possible to:
Run an embedded Hadoop as part of my application?
Run an embedded Hadoop as part of my tests?
I would like to put together a minimal Spring Boot set up for this. What is the minimal Spring configuration required for this? There are sufficient examples illustrating how to read/write files using HDFS, but I still haven't been able to work out the what I need as Spring configuration. It's a bit hard to figure out what libraries one really needs, as it seems that the Spring Hadoop examples are out of date. Any help would be much appreciated.

You can easily use Hadoop Filesystem API 1 2 with any local POSIX filesystem without Hadoop cluster.
The Hadoop API is very generic and provides many concrete implementations for different storage systems such as HDFS, S3, Azure Data Lake Store, etc.
You can embed HDFS within your application (i.e run Namenode and Datanodes withing single JVM process), but this is only reasonable for tests.
There is Hadoop Minicluster which you can start from command-line (CLI MiniCluster) 3 or via Java API in your unit-tests with MiniDFSCluster class 4 found in hadoop-minicluster package.
You can start Mini Cluster with Spring by making a separate configuration for it and using it as #ContextConfiguration with your unit tests.
#org.springframework.context.annotation.Configuration
public class MiniClusterConfiguration {
#Bean(name = "temp-folder", initMethod = "create", destroyMethod = "delete")
public TemporaryFolder temporaryFolder() {
return new TemporaryFolder();
}
#Bean
public Configuration configuration(final TemporaryFolder temporaryFolder) {
final Configuration conf = new Configuration();
conf.set(
MiniDFSCluster.HDFS_MINIDFS_BASEDIR,
temporaryFolder.getRoot().getAbsolutePath()
);
return conf;
}
#Bean(destroyMethod = "shutdown")
public MiniDFSCluster cluster(final Configuration conf) throws IOException {
final MiniDFSCluster cluster = new MiniDFSCluster.Builder(conf)
.clusterId(String.valueOf(this.hashCode()))
.build();
cluster.waitClusterUp();
return cluster;
}
#Bean
public FileSystem fileSystem(final MiniDFSCluster cluster) throws IOException {
return cluster.getFileSystem();
}
#Bean
#Primary
#Scope(BeanDefinition.SCOPE_PROTOTYPE)
public Path temp(final FileSystem fs) throws IOException {
final Path path = new Path("/tmp", UUID.randomUUID().toString());
fs.mkdirs(path);
return path;
}
}
You will inject FileSystem and a temporary Path into your tests, and as I've mentioned above, there's no difference from API stand point in either it's a real cluster, mini-cluster, or local filesystem. Note that there is a startup cost of this, so you likely want to annotated your tests with #DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD) in order to prevent cluster restart for each test.
In you want this code to run on Windows you will need a compatibility layer called wintuils 5 (which makes possible to access Windows filesystem in a POSIX way).
You have to point environment variable HADOOP_HOME to it, and depending on version load its shared library
String HADOOP_HOME = System.getenv("HADOOP_HOME");
System.setProperty("hadoop.home.dir", HADOOP_HOME);
System.setProperty("hadoop.tmp.dir", System.getProperty("java.io.tmpdir"));
final String lib = String.format("%s/lib/hadoop.dll", HADOOP_HOME);
System.load(lib);

Related

Spring with Apache Beam

I want to use Spring with Apache Beam that will run on Google Cloud Data flow Runner. Dataflow job should be able to use Spring Runtime application context while executing the Pipeline steps. I want to use Spring feature in my Apache Beam pipeline for DI and other stuff. After browsing hours on google, I couldn't find any post or documentation which shows Spring integration in Apache Beam. So, if anyone has tried spring with Apache beam, please let me know.
In main class i have initialised the spring application context but it is not available while execution of pipeline steps. I get null pointer exception for autowired beans. I guess the problem is, at runtime context is not available to worker threads.
public static void main(String[] args) {
initSpringApplicationContext();
GcmOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(GcmOptions.class);
Pipeline pipeline = Pipeline.create(options);
// pipeline definition
}
I want to inject the spring application context to each of the ParDo functions.
The problem here is that the ApplicationContext is not available on any worker, as the main method is only called when constructing the job and not on any worker machine. Therefore, initSpringApplicationContext is never called on any worker.
I've never tried to use Spring within Apache Beam, but I guess moving initSpringApplicationContext in a static initializer block will lead to your expected result.
public class ApplicationContextHolder {
private static final ApplicationContext CTX;
static {
CTX = initApplicationContext();
}
public static ApplicationContext getContext() {
return CTX;
}
}
Please be aware that this alone shouldn't be considered as a best practice of using Spring within Apache Beam since it doesn't integrate well in the lifecycle of Apache Beam. For example, when an error happens during the initialization of the application context, it will appear in the first place where the ApplicationContextHolder is used. Therefore, I'd recommend to extract initApplicationContext out of the static initializer block and call it explicitly with regards to Apache Beam's Lifecycle. The setup phase would be a good place for this.

Hazelcast configure file storage

I have a need for a java-cache with file storage, that survives JVM crashes.
Previously I used ehcache, configured with .heap().disk().
However, it has a problem with unclear JVM shutdowns - next startup clears the store.
My only requirement is that at least parts of the data survive a restart.
I tried to use hazelcast, however with following code snippet, even subsequent run of the program returns prints "null".
Please suggest how to configure hazelcast, so that cache.put is written to a disk and loaded on startup.
public class HazelcastTest {
public static void main(String[] args) throws InterruptedException {
System.setProperty("hazelcast.jcache.provider.type", "server");
Config config = new Config();
HotRestartPersistenceConfig hotRestartPersistenceConfig = new HotRestartPersistenceConfig()
.setEnabled(true)
.setBaseDir(new File("cache"))
.setBackupDir(new File("cache/backup"))
.setParallelism(1)
.setClusterDataRecoveryPolicy(HotRestartClusterDataRecoveryPolicy.FULL_RECOVERY_ONLY);
config.setHotRestartPersistenceConfig(hotRestartPersistenceConfig);
HazelcastInstance instance = Hazelcast.newHazelcastInstance(config);
CacheConfig<String, String> cacheConfig = new CacheConfig<>();
cacheConfig.getHotRestartConfig().setEnabled(true);
cacheConfig.getHotRestartConfig().setFsync(true);
CachingProvider cachingProvider = Caching.getCachingProvider();
Cache<String, String> data = cachingProvider.getCacheManager().createCache("data", cacheConfig);
System.out.println(data.get("test"));
data.put("test", "value");
data.close();
instance.shutdown();
}
}
Suggestions for other frameworks that could complete the task are also welcome.
#Igor, Hot Restart is an Enterprise Feature of Hazelcast. You need to use Hazelcast Enterprise edition with a valid License Key.
Do you really need to store in a file, or just persist cache data somewhere else? If you can use a database, you can use MapStore which is available in Open Source version & write data to a persistent data store. You can even use write-behind mode to speed up writes.
See these sample project: https://github.com/hazelcast/hazelcast-code-samples/tree/master/distributed-map/mapstore

ClassNotFound error for Ignite User Defined Function in Flink Cluster

I am trying to cache the data, streamed by Apache flink, into Apache Ignite cache. I also want to run the query which uses a User Defined Function. As per Ignite, I am using cacheConf.setSqlFunctionClasses(GetCacheKey.class) setting while declaring the cache. The class declaration is as follows,
public static class GetCacheKey implements Serializable{
#QuerySqlFunction
public static long getCacheKey(int mac, long local) {
long key=(local << 5) + mac;
return key;
}
}
When I run the code locally with Apache Flink, it works. But when I go for cluster execution of the code in Flink Cluster, I got an error that GetCacheKey class is not found. What will be the reason behind this?
Please, check if GetCacheKey.class is in ignite nodes classpaths.
The Flink directory must be available on every worker under the same path. You can use a shared NFS directory, or copy the entire Flink directory to every worker node.
Also ensure the Ignite libs are present in worker nodes classpath.

Initialize spring cloud config client, for SDK development

This question is somewhat similar to this existing question
I am still trying to navigate or trying to find right spring boot code, which i can customize. I need to develop java SDK which connects with existing config server and provides values to key. This SDK will be used in java applications, which might or might not be spring application. Same SDK will be used by QA for regression testing of config server.
So question is, if given
Config server URL
application name
active profile (no need for label, it will be default master),
Can I initialize some config client class which will give me simple methods like public String getKeyValue(final String key)
I am looking at source of classes like ConfigServicePropertySourceLocator, CompositePropertySource, ConfigClientAutoConfiguration, ConfigServiceBootstrapConfiguration etc.
Do I need to build Environment object manually? If yes, how?
I have some success. Posting a possible answer for others to further fine tune it.
#SpringBootApplication
public class ConfigSDKApp {
#Autowired
public SomeSpringBean someBean = null;
private static ConfigSDKApp INSTANCE = null;
public synchronized static ConfigSDKApp getInstance(String[] args) {
if (null != INSTANCE) {
return INSTANCE;
}
SpringApplication sprApp = new SpringApplication(ConfigSDKApp.class);
sprApp.setWebEnvironment(false);
ConfigurableApplicationContext appContext = sprApp.run(args);
ConfigSDKApp app = appContext.getBean(ConfigSDKApp.class);//new ConfigSDKApp();
INSTANCE = app;
return INSTANCE;
}
}
It's kind of singleton class (but public constructor). Hence code smell.
Also, what if this SDK is running with-in springboot client. ApplicationContext & environment is already initialized.

using Spring integration with spring batch

I have a spring batch application which reads from file , does some processing and finally write a customized output. This all happens in one step. In next step i have a tasklet which archives the input files (move to another folder). This application works fine.But, now i got a requirement to sftp output files on a remote servers where they would further processed. I got a way to sftp using spring integration where i have created a input channel which feeds to outboundchannel adapter. I put my files as payload in message and send messages to channel. The only problem i see here is that everytime I have to get the context i eed to load the spring config file, which seems kind of hackish way to do the task. Does anyone know about any way to integrate SI with SB.
Let me know if you want to see my config...
Thanks in Advance !!
code to access the same app-context without loading the spring config again
public class AppContextProvider implements ApplicationContextAware{
private static ApplicationContext ctx;
public ApplicationContext getApplicationContext() {
return ctx;
}
public void setApplicationContext(ApplicationContext appContext) throws BeansException {
ctx = appContext;
}
}
code to push the output file to sftp server
log.info("Starting transfer of outputFile : " + absoluteOutputFileName);
final File file = new File(absoluteOutputFileName);
final Message<File> message = MessageBuilder.withPayload(file).build();
AppContextProvider context = new AppContextProvider();
final MessageChannel inputChannel = context.getApplicationContext().getBean("toChannel",MessageChannel.class);
inputChannel.send(message);
log.info("transfer complete for : " + absoluteOutputFileName);
Take a look at the spring-batch-integration module within the Spring Batch project. In there, we have components for launching jobs via messages. In your situation, you'd FTP the file down then have the JobLaunchingMessageHandler launch the job.
You can also watch this video of a talk I co-presented at SpringOne a couple years ago on this topic: https://www.youtube.com/watch?v=8tiqeV07XlI
As Michael said, you'll definitely want to look at and leverage spring-batch-integration. We actually use Spring Integration as a wrapper of sorts to launch 100% of our Spring Batch jobs.
One use case we've found particularly useful is leveraging the spring-integration-file Inbound Channel Adapters to poll staging directories to indicate when a new batch file has landed. As the poller finds a new file, we then launch a new batch job using the input filename as a parameter.
This has been a real help when it comes to restartability, because we now have one job instance per file as opposed to having a job kick off at arbitrary intervals and then partition across however many files happen to be in the staging folder. Now if an exception occurs during processing, you can target a specific job for restart immediately rather than waiting for 99 of the 100 "good" files to finish first.

Resources