Passing objects to MapReduce from a driver

Passing objects to MapReduce from a driver - hadoop

I created a driver which reads a config file, builds a list of objects (based on the config) and passes that list to MapReduce (MapReduce has a static attribute which holds a reference to that list of object).
It works but only locally. As soon as I run the job on a cluster config I will get all sort of errors suggesting that the list hasn't been built. It makes me think that I'm doing it wrong and on a cluster setup MapReduce is being run independently from the driver.
My question is how to correctly initialise a Mapper.
(I'm using Hadoop 2.4.1)

This is related to the problem of side data distribution.
There are two approaches for side data distribution.
1) Distributed Caches
2) Configuration
As you have the objects to be shared, we can use the Configuration class.
This discussion will depend on the Configuration class to make available an Object across the cluster, accessible to all Mappers and(or) Reducers. The approach here is quite simple. The setString(String, String) setter of the Configuration classed is harnessed to achieve this task. The Object that has to be shared across is serialized into a java string at the driver end and is de-serialized back to the object at the Mapper or Reducer.
In the example code below, I have used com.google.gson.Gson class for the easy serialization and deserialization. You can use Java Serialization as well.
Class that Represents the Object You need to Share
public class TestBean {
String string1;
String string2;
public TestBean(String test1, String test2) {
super();
this.string1 = test1;
this.string2 = test2;
}
public TestBean() {
this("", "");
}
public String getString1() {
return string1;
}
public void setString1(String test1) {
this.string1 = test1;
}
public String getString2() {
return string2;
}
public void setString2(String test2) {
this.string2 = test2;
}
}
The Main Class from where you can set the Configurations
public class GSONTestDriver {
public static void main(String[] args) throws Exception {
System.out.println("In Main");
Configuration conf = new Configuration();
TestBean testB1 = new TestBean("Hello1","Gson1");
TestBean testB2 = new TestBean("Hello2","Gson2");
Gson gson = new Gson();
String testSerialization1 = gson.toJson(testB1);
String testSerialization2 = gson.toJson(testB2);
conf.set("instance1", testSerialization1);
conf.set("instance2", testSerialization2);
Job job = new Job(conf, " GSON Test");
job.setJarByClass(GSONTestDriver.class);
job.setMapperClass(GSONTestMapper.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The mapper class from where you can retrieve the object
public class GSONTestMapper extends
Mapper<LongWritable, Text, Text, NullWritable> {
Configuration conf;
String inst1;
String inst2;
public void setup(Context context) {
conf = context.getConfiguration();
inst1 = conf.get("instance1");
inst2 = conf.get("instance2");
Gson gson = new Gson();
TestBean tb1 = gson.fromJson(inst1, TestBean.class);
System.out.println(tb1.getString1());
System.out.println(tb1.getString2());
TestBean tb2 = gson.fromJson(inst2, TestBean.class);
System.out.println(tb2.getString1());
System.out.println(tb2.getString2());
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(value,NullWritable.get());
}
}
The bean is converted to a serialized Json String using the toJson(Object src) method of the class com.google.gson.Gson. Then the serialised Json string is passed as value through the configuration instance and accessed by name from the Mapper. The string is deserialized there using the fromJson(String json, Class classOfT) method of the same Gson class. Instead of my test bean, you could place your objects.

Related

Set hashmap in JobParameter in spring boot

I am working on creating excel file from data, for that I have created job. I want to set hashmap to the jobparameter so that I can use it in MyReader class, I have created CustomJobParameter Class.
Below code you can find to get the job parameters :
Get Job Parameters :
public JobParameters createJobParam (MyRequest request) {
final JobParameters parameters = new JobParametersBuilder()
.addString("MyParam1", request.getReportGenerationJobId())
.addString("MyParam2", request.getSessionId())
.addLong("time", System.currentTimeMillis())
.addParameter(
"MyObject",
new MyUtils.CustomJobParameter(request.getHsSlideArticles())
)
.toJobParameters();
return JobParameters;
}
CustomJobParameter Class written in MyUtils class:
public static class CustomJobParameter<T extends Serializable> extends JobParameter {
private HashMap customParam;
public CustomJobParameter (HashMap slideArticles) {
super("");
this.customParam = customParam;
}
public HashMap getValue () {
return customParam;
}
}
But while I am setting using custom parameters, it setting blank string, not object I am passing.
How can I pass the hashmap to my reader.

According to the documentation for JobParameter, a JobParameter can only be String, Long, Date, and Double.
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/JobParameter.html
Domain representation of a parameter to a batch job. Only the following types can be parameters: String, Long, Date, and Double.
The identifying flag is used to indicate if the parameter is to be
used as part of the identification of a job instance.
Therefore you can not extend JobParameter and expect it to work with HashMap.
However there is another option, JobParameters:
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/JobParameters.html
https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/JobParametersBuilder.html
You could create a Map<String, JobParameter> instead :
Example:
new JobParameters(Maps.newHashMap("yearMonth", new JobParameter("2021-07")))
and then use JobParametersBuilder addJobParameters in your createJobParam to simply add all your Map<String, JobParameter> records:
addJobParameters(JobParameters jobParameters) //Copy job parameters into the current state.
So your method will look like:
public JobParameters createJobParam (MyRequest request) {
final JobParameters parameters = new JobParametersBuilder()
.addString("MyParam1", request.getReportGenerationJobId())
.addString("MyParam2", request.getSessionId())
.addLong("time", System.currentTimeMillis())
.addParameters(mayHashMapThatHas<String,JobParameter>)
.toJobParameters();
return JobParameters;
}

How to parse a list of list of spring properties

I have this Spring boot application.properties
list1=valueA,valueB
list2=valueC
list3=valueD,valueE
topics=list1,list2,list3
What I'm trying to do is to use in the topics element of #KafkaListener annotation the values of the values of topics property
Using the expression
#KafkaListener(topics={"#{'${topics}'.split(',')}"})
I get list1,list2,list3 as separated string
How can I loop on this list in order to get valueA,valueB,valueC,valueD,valueE?
Edit: I must parse topics properties in order that #KafkaListener registers for consuming message from topics valueA,valueB,valueC, etc.
I read that is possible call a method in this way:
#KafkaListener(topics="#parse(${topics})")
So, I wrote this method:
public String[] parse(String s) {
ExpressionParser parser = new SpelExpressionParser();
return Arrays.stream(s.split(",").map(key -> (String)(parser.parse(key).getValue())).toArray(String[]::new);
}
But the parse method is not invoked
So, I tried directly to do this into annotations
in this way:
#KafkaListener(topics="#{Arrays.stream('${topics}'.split(',')).map(key->${key}).toArray(String[]::new)}")
But also this solution give me errors.
Edit 2:
Modifying in this way the method is invoked
#KafkaListener(topics="parse()")
#Bean
public String[] parse(String s) {
...
}
The problems is how to get "topics" props inside the method

You can't invoke arbitrary methods like that; you need to reference a bean #someBean.parse(...); using #parse requires registering a static method as a function.
However, this works for me and is much simpler:
list1=valueA,valueB
list2=valueC
list3=valueD,valueE
topics=${list1},${list2},${list3}
and
#KafkaListener(id = "so64390079", topics = "#{'${topics}'.split(',')}")
EDIT
If you can't use placeholders in topics, this works...
#SpringBootApplication
public class So64390079Application {
public static void main(String[] args) {
SpringApplication.run(So64390079Application.class, args);
}
#KafkaListener(id = "so64390079", topics = "#{#parser.parse('${topics}')}")
public void listen(String in) {
System.out.println(in);
}
}
#Component
class Parser implements EnvironmentAware {
private Environment environmment;
#Override
public void setEnvironment(Environment environment) {
this.environmment = environment;
}
public String[] parse(String[] topics) {
StringBuilder sb = new StringBuilder();
for (String topic : topics) {
sb.append(this.environmment.getProperty(topic));
sb.append(',');
}
return StringUtils.commaDelimitedListToStringArray(sb.toString().substring(0, sb.length() - 1));
}
}

Define prefix root node for Jackson Serialization/Deserialization YAML document to POJO with Prefix

I found https://github.com/FasterXML/jackson-dataformat-yaml to deserialize/serialize YAML files. However, I'm having a hard time to deserialize/serialize the following:
I want to define a prefix to the actual document to be parsed as a POJO. Similar to a subtree of the document.
I want to define the POJO that represents the simple object representation instead of creating multiple objects.
The Error "Unrecognized field "spring" (class ConfigServerProperties), not marked as ignorable (one known property: "repos"])" is shown. But I don't know how to represent the prefix "spring.cloud.config.server.git" to be the root element of the POJO.
Document
spring:
cloud:
config:
server:
git:
repos:
publisher:
uri: 'https://github.company.com/toos/spring-cloud-config-publisher-config'
cloneOnStart: true
username: myuser
password: password
pullOnRequest: false
differentProperty: My Value
config_test_server_config:
uri: 'https://github.company.com/mdesales/config-test-server-config'
cloneOnStart: true
username: 226b4bb85aa131cd6393acee9c484ec426111d16
password: ""
completelyDifferentProp: this is a different one
For this document, the requirements are as follows:
* I want to define the prefix as "spring.cloud.config.server.git".
* I want to create a POJO that represents the object.
POJO
I created the following POJOs to represent this.
ConfigServerProperties: represents the top pojo containing the list of repos.
ConfigServerOnboard: represents each of the elements of the document.
Each properties are stored in a map, so that we can add as many different properties as possible.
Each class is as follows:
public class ConfigServerProperties {
private Map<String, ConfigServerOnboard> repos;
public void setRepos(Map<String, ConfigServerOnboard> repos) {
this.repos = repos;
}
public Map<String, ConfigServerOnboard> getRepos() {
return this.repos;
}
}
The second class is as follows:
public class ConfigServerOnboard {
private Map<String, String> properties;
public Map<String, String> getProperties() {
return properties;
}
public void setProperties(Map<String, String> properties) {
this.properties = properties;
}
}
Deserialize
The deserialization strategy I tried is as follows:
public static ConfigServerProperties parseProperties(File filePath)
throws JsonParseException, JsonMappingException, IOException {
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
JsonNodeFactory jsonNodeFactory = new JsonNodeFactory(false);
jsonNodeFactory.textNode("spring.cloud.config");
// tried to use this attempting to get the prefix
mapper.setNodeFactory(jsonNodeFactory);
ConfigServerProperties user = mapper.readValue(filePath, ConfigServerProperties.class);
return user;
}
Error Returned
Exception in thread "main" com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "spring" (class com.company.platform.config.onboarding.files.config.model.ConfigServerProperties), not marked as ignorable (one known property: "repos"])
at [Source: /tmp/config-server-onboards.yml; line: 3, column: 3] (through reference chain: com.company.platform.config.onboarding.files.config.model.ConfigServerProperties["spring"])
at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:62)
at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:834)
at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1094)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1470)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1448)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:282)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:140)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3798)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2740)
at com.company.platform.config.onboarding.files.config.model.ConfigServerProperties.parseProperties(ConfigServerProperties.java:37)
at com.company.platform.config.onboarding.files.config.model.ConfigServerProperties.main(ConfigServerProperties.java:42)
Edit 1: Looking for a possible SpringBoot Solution
I'm open to solutions using SpringBoot's ConfigurationProperties("spring.cloud.config.server.git"). That way, we could have the following:
#ConfigurationProperties("spring.cloud.config.server.git")
public class Configuration {
private Map<String, Map<String, String>> repos = new LinkedHashMap<String, new HashMap<String, String>>();
// getter/setter
}
Questions
How to set the root element of the document?
Deserialization must read the document and produce instances of the POJOs.
Serialization must produce the same document with updated values.

I had to come up with the following:
Create 6 classes, each of them with the property required for the prefix "spring.cloud.config.server.git"
SpringCloudConfigSpring.java
SpringCloudConfigCloud.java
SpringCloudConfigConfig.java
SpringCloudConfigServer.java
SpringCloudConfigGit.java
The holder of all of them is SpringCloudConfigFile.java.
The holder and all the classes have a reference to the next property, which has a reference to the next, etc, with their own setter/getter methods as usual.
public class SpringCloudConfigSpring {
private SpringCloudConfigCloud cloud;
public SpringCloudConfigCloud getCloud() {
return cloud;
}
public void setCloud(SpringCloudConfigCloud cloud) {
this.cloud = cloud;
}
}
Implemented the representation of the map easily.
The last one I used the reference of a TreeMap to keep the keys sorted, another map to represent any property that may be added, without changing the representation.
public class SpringCloudConfigGit {
TreeMap<String, Map<String, Object>> repos;
public TreeMap<String, Map<String, Object>> getRepos() {
return repos;
}
public void setRepos(TreeMap<String, Map<String, Object>> repos) {
this.repos = repos;
}
}
Results
Creating the verification as follows:
public static void main(String[] args) throws JsonParseException, JsonMappingException, IOException {
File config = new File("/tmp/config-server-onboards.yml");
SpringCloudConfigFile props = ConfigServerProperties.parseProperties(config);
props.getSpring().getCloud().getConfig().getServer().getGit().getRepos().forEach((appName, properties) -> {
System.out.println("################## " + appName + " #######################3");
System.out.println(properties);
if (appName.equals("github_pages_reference")) {
properties.put("name", "Marcello");
properties.put("cloneOnStart", true);
}
System.out.println("");
});
saveProperties(new File(config.getAbsoluteFile().getParentFile(), "updated-config-onboards.yml"), props);
}
The output is as follows:
################## config_onboarding #######################3
{uri=https://github.company.com/servicesplatform-tools/spring-cloud-config-onboarding-config, cloneOnStart=true, username=226b4bb85aa131cd6393acee9c484ec426111d16, password=, pullOnRequest=false}
################## config_test_server_config #######################3
{uri=https://github.company.com/rlynch2/config-test-server-config, cloneOnStart=true, username=226b4bb85aa131cd6393acee9c484ec426111d16, password=, pullOnRequest=false}
################## github_pages_reference #######################3
{uri=https://github.company.com/servicesplatform-tools/spring-cloud-config-reference-service-config, cloneOnStart=true, username=226b4bb85aa131cd6393acee9c484ec426111d16, password=, pullOnRequest=false}
There are obvious improvements required:
I'd like to have a solution with a single class;
I'd like to have a an ObjetMapper method that specifies the "subtree" of the YAML object tree that I'd like to parse.
Maybe a more sophisticated SpringBoot-like #ConfigurationProperties("spring.cloud.config.server.git") would help.
Helper methods for loading and saving the state of these instances.
Load Method
public static SpringCloudConfigFile parseProperties(File filePath)
throws JsonParseException, JsonMappingException, IOException {
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
SpringCloudConfigFile file = mapper.readValue(filePath, SpringCloudConfigFile.class);
return file;
}
Save Properties
public static void saveProperties(File filePath, SpringCloudConfigFile file) throws JsonGenerationException, JsonMappingException, IOException {
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
mapper.writeValue(filePath, file);
}
File Saved
It maintained the sorted keys as implemented.

Spring batch : Assemble a job rather than configuring it (Extensible job configuration)

Background
I am working on designing a file reading layer that can read delimited files and load it in a List. I have decided to use Spring Batch because it provides a lot of scalability options which I can leverage for different sets of files depending on their size.
The requirement
I want to design a generic Job API that can be used to read any delimited file.
There should be a single Job structure that should be used for parsing every delimited file. For example, if the system needs to read 5 files, there will be 5 jobs (one for each file). The only way the 5 jobs will be different from each other is that they will use a different FieldSetMapper, column name, directory path and additional scaling parameters such as commit-interval and throttle-limit.
The user of this API should not need to configure a Spring
batch job, step, chunking, partitioning, etc on his own when a new file type is introduced in the system.
All that the user needs to do is to provide the FieldsetMapperto be used by the job along with the commit-interval, throttle-limit and the directory where each type of file will be placed.
There will be one predefined directory per file. Each directory can contain multiple files of the same type and format. A MultiResourcePartioner will be used to look inside a directory. The number of partitions = number of files in the directory.
My requirement is to build a Spring Batch infrastructure that gives me a unique job I can launch once I have the bits and pieces that will make up the job.
My solution :
I created an abstract configuration class that will be extended by concrete configuration classes (There will be 1 concrete class per file to be read).
#Configuration
#EnableBatchProcessing
public abstract class AbstractFileLoader<T> {
private static final String FILE_PATTERN = "*.dat";
#Autowired
JobBuilderFactory jobs;
#Autowired
ResourcePatternResolver resourcePatternResolver;
public final Job createJob(Step s1, JobExecutionListener listener) {
return jobs.get(this.getClass().getSimpleName())
.incrementer(new RunIdIncrementer()).listener(listener)
.start(s1).build();
}
public abstract Job loaderJob(Step s1, JobExecutionListener listener);
public abstract FieldSetMapper<T> getFieldSetMapper();
public abstract String getFilesPath();
public abstract String[] getColumnNames();
public abstract int getChunkSize();
public abstract int getThrottleLimit();
#Bean
#StepScope
#Value("#{stepExecutionContext['fileName']}")
public FlatFileItemReader<T> reader(String file) {
FlatFileItemReader<T> reader = new FlatFileItemReader<T>();
String path = file.substring(file.indexOf(":") + 1, file.length());
FileSystemResource resource = new FileSystemResource(path);
reader.setResource(resource);
DefaultLineMapper<T> lineMapper = new DefaultLineMapper<T>();
lineMapper.setFieldSetMapper(getFieldSetMapper());
DelimitedLineTokenizer tokenizer = new DelimitedLineTokenizer(",");
tokenizer.setNames(getColumnNames());
lineMapper.setLineTokenizer(tokenizer);
reader.setLineMapper(lineMapper);
reader.setLinesToSkip(1);
return reader;
}
#Bean
public ItemProcessor<T, T> processor() {
// TODO add transformations here
return null;
}
#Bean
#JobScope
public ListItemWriter<T> writer() {
ListItemWriter<T> writer = new ListItemWriter<T>();
return writer;
}
#Bean
#JobScope
public Step readStep(StepBuilderFactory stepBuilderFactory,
ItemReader<T> reader, ItemWriter<T> writer,
ItemProcessor<T, T> processor, TaskExecutor taskExecutor) {
final Step readerStep = stepBuilderFactory
.get(this.getClass().getSimpleName() + " ReadStep:slave")
.<T, T> chunk(getChunkSize()).reader(reader)
.processor(processor).writer(writer).taskExecutor(taskExecutor)
.throttleLimit(getThrottleLimit()).build();
final Step partitionedStep = stepBuilderFactory
.get(this.getClass().getSimpleName() + " ReadStep:master")
.partitioner(readerStep)
.partitioner(
this.getClass().getSimpleName() + " ReadStep:slave",
partitioner()).taskExecutor(taskExecutor).build();
return partitionedStep;
}
/*
* #Bean public TaskExecutor taskExecutor() { return new
* SimpleAsyncTaskExecutor(); }
*/
#Bean
#JobScope
public Partitioner partitioner() {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
Resource[] resources;
try {
resources = resourcePatternResolver.getResources("file:"
+ getFilesPath() + FILE_PATTERN);
} catch (IOException e) {
throw new RuntimeException(
"I/O problems when resolving the input file pattern.", e);
}
partitioner.setResources(resources);
return partitioner;
}
#Bean
#JobScope
public JobExecutionListener listener(ListItemWriter<T> writer) {
return new JobCompletionNotificationListener<T>(writer);
}
/*
* Use this if you want the writer to have job scope (JIRA BATCH-2269). Also
* change the return type of writer to ListItemWriter for this to work.
*/
#Bean
public TaskExecutor taskExecutor() {
return new SimpleAsyncTaskExecutor() {
#Override
protected void doExecute(final Runnable task) {
// gets the jobExecution of the configuration thread
final JobExecution jobExecution = JobSynchronizationManager
.getContext().getJobExecution();
super.doExecute(new Runnable() {
public void run() {
JobSynchronizationManager.register(jobExecution);
try {
task.run();
} finally {
JobSynchronizationManager.close();
}
}
});
}
};
}
}
Let's say I have to read Invoice data for the sake of discussion. I can therefore extend the above class for creating an InvoiceLoader :
#Configuration
public class InvoiceLoader extends AbstractFileLoader<Invoice>{
private class InvoiceFieldSetMapper implements FieldSetMapper<Invoice> {
public Invoice mapFieldSet(FieldSet f) {
Invoice invoice = new Invoice();
invoice.setNo(f.readString("INVOICE_NO");
return e;
}
}
#Override
public FieldSetMapper<Invoice> getFieldSetMapper() {
return new InvoiceFieldSetMapper();
}
#Override
public String getFilesPath() {
return "I:/CK/invoices/partitions/";
}
#Override
public String[] getColumnNames() {
return new String[] { "INVOICE_NO", "DATE"};
}
#Override
#Bean(name="invoiceJob")
public Job loaderJob(Step s1,
JobExecutionListener listener) {
return createJob(s1, listener);
}
#Override
public int getChunkSize() {
return 25254;
}
#Override
public int getThrottleLimit() {
return 8;
}
}
Let's say I have one more class called Inventory that extends AbstractFileLoader.
On application startup, I can load these two annotation configurations as follows :
AbstractApplicationContext context1 = new AnnotationConfigApplicationContext(InvoiceLoader.class, InventoryLoader.class);
Somewhere else in my application two different threads can launch the jobs as follows :
Thread 1 :
JobLauncher jobLauncher1 = context1.getBean(JobLauncher.class);
Job job1 = context1.getBean("invoiceJob", Job.class);
JobExecution jobExecution = jobLauncher1.run(job1, jobParams1);
Thread 2 :
JobLauncher jobLauncher1 = context1.getBean(JobLauncher.class);
Job job1 = context1.getBean("inventoryJob", Job.class);
JobExecution jobExecution = jobLauncher1.run(job1, jobParams1);
The advantage of this approach is that everytime there is a new file to be read, all that the developer/user has to do is subclass AbstractFileLoader and implement the required abstract methods without the need to get into the details of how to assemble the job.
The questions :
I am new to Spring batch so I may have overlooked some of the not-so-obvious issues with this approach such as shared internal objects in Spring batch that may cause two jobs running together to fail or obvious issues such as scoping of the beans.
Is there a better way to achieve my objective?
The fileName attribute of the #Value("#{stepExecutionContext['fileName']}") is always being assigned the value as I:/CK/invoices/partitions/ which is the value returned by getPathmethod in InvoiceLoader even though the getPathmethod inInventoryLoader`returns a different value.

One option is passing them as job parameters. For instance:
#Bean
Job job() {
jobs.get("myJob").start(step1(null)).build()
}
#Bean
#JobScope
Step step1(#Value('#{jobParameters["commitInterval"]}') commitInterval) {
steps.get('step1')
.chunk((int) commitInterval)
.reader(new IterableItemReader(iterable: [1, 2, 3, 4], name: 'foo'))
.writer(writer(null))
.build()
}
#Bean
#JobScope
ItemWriter writer(#Value('#{jobParameters["writerClass"]}') writerClass) {
applicationContext.classLoader.loadClass(writerClass).newInstance()
}
With MyWriter:
class MyWriter implements ItemWriter<Integer> {
#Override
void write(List<? extends Integer> items) throws Exception {
println "Write $items"
}
}
Then executed with:
def jobExecution = launcher.run(ctx.getBean(Job), new JobParameters([
commitInterval: new JobParameter(3),
writerClass: new JobParameter('MyWriter'), ]))
Output is:
INFO: Executing step: [step1]
Write [1, 2, 3]
Write [4]
Feb 24, 2016 2:30:22 PM org.springframework.batch.core.launch.support.SimpleJobLauncher$1 run
INFO: Job: [SimpleJob: [name=myJob]] completed with the following parameters: [{commitInterval=3, writerClass=MyWriter}] and the following status: [COMPLETED]
Status is: COMPLETED, job execution id 0
#1 step1 COMPLETED
Full example here.

Hadoop - Mysql new API connection

I am trying to set MySQL as input, in a Hadoop Process. How to use DBInputFormat class for Hadoop - MySQL connection in version 1.0.3? The configuration of the job via JobConf from hadoop-1.0.3/docs/api/ doesnt work.
// Create a new JobConf
JobConf job = new JobConf(new Configuration(), MyJob.class);
// Specify various job-specific parameters
job.setJobName("myjob");
FileInputFormat.setInputPaths(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
job.setMapperClass(MyJob.MyMapper.class);
job.setCombinerClass(MyJob.MyReducer.class);
job.setReducerClass(MyJob.MyReducer.class);
job.setInputFormat(SequenceFileInputFormat.class);
job.setOutputFormat(SequenceFileOutputFormat.class);

You need to do something like the following (assuming the typical employee table for example):
JobConf conf = new JobConf(getConf(), MyDriver.class);
conf.setInputFormat(DBInputFormat.class);
DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost/mydatabase”); String [] fields = { “employee_id”, "name" };
DBInputFormat.setInput(conf, MyRecord.class, “employees”, null /* conditions */, “employee_id”, fields);
...
// other necessary configuration
JobClient.runJob(conf);
The configureDB() and setInput() calls configure the DBInputFormat. The first call specifies the JDBC driver implementation to use and what database to connect to. The second call specifies what data to load from the database. The MyRecord class is the class where data will be read into in Java, and "employees" is the name of the table to read. The "employee_id" parameter specifies the table’s primary key, used for ordering results. The section “Limitations of the InputFormat” below explains why this is necessary. Finally, the fields array lists what columns of the table to read. An overloaded definition of setInput() allows you to specify an arbitrary SQL query to read from, instead.
After calling configureDB() and setInput(), you should configure the rest of your job as usual, setting the Mapper and Reducer classes, specifying any other data sources to read from (e.g., datasets in HDFS) and other job-specific parameters.
You need to create your own implementation of Writable- something like the following (considering id and name as table fields):
class MyRecord implements Writable, DBWritable {
long id;
String name;
public void readFields(DataInput in) throws IOException {
this.id = in.readLong();
this.name = Text.readString(in);
}
public void readFields(ResultSet resultSet) throws SQLException {
this.id = resultSet.getLong(1);
this.name = resultSet.getString(2); }
public void write(DataOutput out) throws IOException {
out.writeLong(this.id);
Text.writeString(out, this.name); }
public void write(PreparedStatement stmt) throws SQLException {
stmt.setLong(1, this.id);
stmt.setString(2, this.name); }
}
The mapper then receives an instance of your DBWritable implementation as its input value. The input key is a row id provided by the database; you’ll most likely discard this value.
public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, LongWritable, Text> {
public void map(LongWritable key, MyRecord val, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {
// Use val.id, val.name here
output.collect(new LongWritable(val.id), new Text(val.name));
}
}
For more : read the following link (actual source of my answer) : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Have a look at this post. It shows how to sink data from Map Reduce to MySQL Database.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Passing objects to MapReduce from a driver - hadoop

Related

Set hashmap in JobParameter in spring boot

How to parse a list of list of spring properties

Define prefix root node for Jackson Serialization/Deserialization YAML document to POJO with Prefix

Spring batch : Assemble a job rather than configuring it (Extensible job configuration)

Hadoop - Mysql new API connection

Categories

Resources