We are trying to achieve parallelism through the spark executors below are the steps which we are following -
Read from the hive
Data transformation (custom spring library).
Ship it via rest endpoint in batches (1000 records per batch).
Problem - We want to do all these steps in parallel and want to use a spring-boot based library.
Understanding - If we are using the custom code for transformation (the code that we want to run parallelly most probably inside rdd.map() method) then our classes and composite dependencies need to be serialized or those classes need to implement serialization.
We know we can achieve this by performing these tasks in sequence over the driver in such case we need to collect the data over the driver again + again and then pass it to the next step. In this case, we are not leveraging the power of executors.
Needs your assistance here -
If we ship this spring boot dependency to executors then is there any way that the executor understands the spring boot code and resolves the annotations over there ?
sample code -
Code from spring boot library -
public class Process{
String convert(Row row) {
return row.mkString();
}
}
#Component
#ConditionalOnProperty(name = "process.dummy.serialize", havingValue = "true")
class ProcessNotSerialized extends Process {
#Autowired
private RecordService recorService; //not-serialized
public void setName(String name) {
this.name = name;
}
private String name;
#Override
public String toString() {
return "ProcessNotSerialized";
}
}
Code from my spark spring boot application -
Dataset<Row> sqlDf = sparkSession.sql(sqlQuery); // millions of data
ProcessNotSerialized process = new ProcessNotSerialized();
System.out.println("object name=>" + process.toString());
List<String> listColumns = sqlDf.select(column)
.javaRDD()
.map(row -> {
return process.convert(row);
})
.collect();
here is the code inside map() will execute in parallel.
Please let me know if you have any better way other than serialization or running over a driver.
Related
Developing a spring boot batch application, wanted to know if there is a sample code to how to get the micro metrics discussed in the spring batch document?
I am looking way of getting these details per execution. Also, since I run the application using cron task schedule, can we get a separation of this data per execution?
Found the solution
You need not do any coding, all readily available.
Include actuator dependency in pom
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
In the property, file add this line (instead of * you can add comma separated and security) refrence
management.endpoints.web.exposure.include=*
If your application is running on port 8080
http://localhost:8080/actuator/metrics/spring.batch.job
Will give job stats. You can query on any other metrics parameters. Reference
You can calculate metrics for all execution in the corn task schedule by
First create a custom metric endpoint.
#Component
#Endpoint(id = "spring.batch.executions")
public class BatchMetricEndpoint {
#Autowired
BatchAllJobsMetricContext batchAllJobsMetricContext;
#ReadOperation
public Map<String, BatchPerJobMetricContext> features() {
return batchAllJobsMetricContext.getAll();
}
}
Second
Create a job listener
Tap into local metric endpoint after execution.
#Component
public class BatchJobListener implements JobExecutionListener {
#Autowired
private MetricsEndpoint metricsEndpoint;
#Autowired
BatchAllJobsMetricContext batchAllJobsMetricContext;
#Autowired
BatchPerJobMetricContext batchPerJobMetricContext;
#Override
public void beforeJob(JobExecution jobExecution) {
}
#Override
public void afterJob(JobExecution jobExecution) {
MetricsEndpoint.MetricResponse metricResponse = metricsEndpoint.metric("spring.batch.job",null);
String key = jobExecution.getJobParameters().getString("jobId");
String execution = "Execution "+jobExecution.getJobParameters().getString("executionCout");
if(batchAllJobsMetricContext.hasKey(key)){
batchAllJobsMetricContext.get(key).put(execution,metricResponse);
}else{
batchPerJobMetricContext.put(execution,metricResponse);
batchAllJobsMetricContext.put(key,batchPerJobMetricContext);
}
}
}
Third
Tap into the local metric to aggrigate data.
Please note this way of holding metic per iteration would be expensive on memory, you would like to keep some limit and push this data to time series data source.
Sample code
I'm trying to test a service which's trying to communicate with other one.
One of them generates auditories which are stored on memory until an scheduled task flushs them on a redis node:
#Component
public class AuditFlushTask {
private AuditService auditService;
private AuditFlushTask(AuditService auditService) {
this.auditService = auditService;
}
#Scheduled(fixedDelayString = "${fo.audit-flush-interval}")
public void flushAudits() {
this.auditService.flush();
}
}
By other hand, this service provide an endpoint stands for providing those flushed auditories:
public Collection<String> listAudits(
) {
return this.boService.listRawAudits(deadlineTimestamp);
}
The problem is I'm building an integration test in order to check if this process works right, I mean, if audits are well provided.
So, I don't know how to "wait until audits has been flushed on microservice".
Any ideas?
Don't test the framework: Spring almost certainly has tests which test fixed delays.
Instead, keep all logic within the service itself, and integration test that in isolation from the Spring #Scheduled function.
I have N Servers, N DBs and N configuration. see the scenario below
So, on every request , I need to access server and db based on configuration.
How can implement dynamically data source in spring data jpa?
You can try AbstractRoutingDatasource provided by Spring since version 2.0.1. using which you can dynamically use appropriate data-source . For integration with Spring data JPA check this very good example. In your case since your configurations are in DB instead of properties file you would need to perform an extra first database lookup to get the appropriate database configuration and return appropriate data-source object.
Another simple approach can be reloading different properties per environment;
Indeed, it might be overkill for DB switching, but it keeps your app simple and maintainable, and most importantly, keeps your environments completely isolated.
Step 1: Configure different properties file per each configuration you have (keep them in src/main/resources with the this naming convention: application-profile.properties)
Step 2: In runtime, change the application context to reload your app based on a given profile
Sample code:
In ProfileController:
#RestController
#RequestMapping("/profile")
public class ProfileController {
#Value("${spring.profiles.active}")
private String profile;
#GetMapping("/profile")
public String getProfile() {
System.out.println("Current profile is: " + profile);
return "Current profile is: " + profile;
}
#GetMapping("/switch/{profile}")
public String switchProfile(#PathVariable String profile) {
System.out.println("Switching profile to: " + profile);
**MyApplication.restartWithNewProfile(profile);**
return "Switched to profile: " + profile;
}
}
In MyApplication.java:
/**
* Switching profile in runtime
*/
public static void restartWithNewProfile(String profile) {
Thread thread = new Thread(() -> {
context.close();
context = SpringApplication.run(MyApplication.class, "--spring.profiles.active=" + profile);
});
thread.setDaemon(false);
thread.start();
}
I am trying to add an aggregator to my code.
Couple of problems I am facing.
1. How do I setup a messagestore using annotations only.
2. Is there any design of aggregator works ? basically some picture explaining the same.
#MessageEndpoint
public class Aggregator {
#Aggregator(inputChannel = "abcCH",outputChannel = "reply",sendPartialResultsOnExpiry = "true")
public APayload aggregatingMethod(List<APayload> items) {
return items.get(0);
}
#ReleaseStrategy
public boolean canRelease(List<Message<?>> messages){
return messages.size()>2;
}
#CorrelationStrategy
public String correlateBy(Message<AbcPayload> message) {
return (String) message.getHeaders().get(RECEIVED_MESSAGE_KEY);
}
}
In the Reference Manual we have a note:
Annotation configuration (#Aggregator and others) for the Aggregator component covers only simple use cases, where most default options are sufficient. If you need more control over those options using Annotation configuration, consider using a #Bean definition for the AggregatingMessageHandler and mark its #Bean method with #ServiceActivator:
And a bit below:
Starting with the version 4.2 the AggregatorFactoryBean is available, to simplify Java configuration for the AggregatingMessageHandler.
So, actually you should configure AggregatorFactoryBean as a #Bean and with the #ServiceActivator(inputChannel = "abcCH",outputChannel = "reply").
Also consider to use Spring Integration Java DSL to simplify your life with the Java Configuration.
I seem to have struck an issue and have no real clue on how to solve.
My current app is based on Spring Boot with JPA and the following code gets a lock when run for the second execution.
#RequestMapping(value="/", method = RequestMethod.GET)
public String index() {
repository.save(new RawData("test"));
repository.save(new RawData("test"));
// hangs when the method index() is run 2 sequentially
RawData rawData = rawDataRepository.findOne(1L);
System.out.println(rawData);
return "#: " + repository.count();
}
When run the first time all seems ok, but executing the same code 2 times gives me a lock on:
RawData rawData = rawDataRepository.findOne(1L);
Also trying to connect to the DB gives me a lock timeout when the methods hangs or waits for a timeout.
Calling the same code in Spring Service results in the same behaviour.
#Component
public class SyncService {
#Autowired
RawDataRepository rawDataRepository;
void syncWithRemote() {
// hang on this line...
RawData rawData = rawDataRepository.findOne(1L);
System.out.println(rawData);
}
}
You should use two techniques:
Use optimistic locking by using #Version field in your entities
Add transactions support by annotating your methods by #Transactional annotation. Normally you also have to annotate Configuration class by #EnableTransactionManagement but Spring Boot makes it for you
That should solve your problems