Spring cloud data flow with spring batch job - scaling considerations

Spring cloud data flow with spring batch job - scaling considerations - spring-boot

We are currently in evaluation process shifting from Spring batch + Batch Admin
into Spring Cloud based infrastructure.
our main challenges / questions:
1. As part of the monolithic design of the spring batch jobs we are fetching some general MD and aggregated it into common data structure that many jobs using to run in a more optimized way. is the nature of the SCDF Tasks going to be a problem in our case ? should we reconsider shifting into Streams ? and how its can be done ?
2. One of the major reasons to use SCDF is the support for scaling for better performance.
as first POC its going to be hard for us to create a real cloud infrastructure and i was looking for standalone SCDF that use the remote partitioning design for a scaling solution.we looking for a demo/intro GitHub project/guide - i didn't mange to find anything relevant. is it also requiring as past years solution communication between nodes via JMS infrastructure (Spring Integration) ?
3. The main challenge for us is to refactor on of our batch jobs and be able to support both remote partitioning and multiple threads on each node. is it possible to create a spring batch job with both of the aspects.
4. breaking up our monolithic jar with 20 Jobs into separate spring boot über jars isn't simple task to achieve - any thoughts / ideas / best practices.
Best,
Elad

I had the same problem as Elad's point 3 and eventually solved it by using the basic framework as demonstrated here but with modified versions of DeployerPartitionHandler and DeployerStepExecutionHandler.
I first tried the naive approach of creating a two-level partitioning where the step that each worker executes is itself partitioned into sub-partitions. But the framework doesn't seem to support that; it got confused about the step's state.
So I went back to a flat set of partitions but passing multiple step execution ids to each worker. For this to work, I created DeployerMultiPartitionHandler which launches the configured number of workers and passes each one a list of step execution ids. Note that there are now two degrees of freedom: the number of workers and the gridSize, which is the total number of partitions that get distributed as evenly as possible to the workers. Unfortunately, I had to duplicate a lot of DeployerPartitionHandler's code here.
#Slf4j
#Getter
#Setter
public class DeployerMultiPartitionHandler implements PartitionHandler, EnvironmentAware, InitializingBean {
public static final String SPRING_CLOUD_TASK_STEP_EXECUTION_IDS =
"spring.cloud.task.step-execution-ids";
public static final String SPRING_CLOUD_TASK_JOB_EXECUTION_ID =
"spring.cloud.task.job-execution-id";
public static final String SPRING_CLOUD_TASK_STEP_EXECUTION_ID =
"spring.cloud.task.step-execution-id";
public static final String SPRING_CLOUD_TASK_STEP_NAME =
"spring.cloud.task.step-name";
public static final String SPRING_CLOUD_TASK_PARENT_EXECUTION_ID =
"spring.cloud.task.parentExecutionId";
public static final String SPRING_CLOUD_TASK_NAME = "spring.cloud.task.name";
private int maxWorkers = -1;
private int gridSize = 1;
private int currentWorkers = 0;
private TaskLauncher taskLauncher;
private JobExplorer jobExplorer;
private TaskExecution taskExecution;
private Resource resource;
private String stepName;
private long pollInterval = 10000;
private long timeout = -1;
private Environment environment;
private Map<String, String> deploymentProperties;
private EnvironmentVariablesProvider environmentVariablesProvider;
private String applicationName;
private CommandLineArgsProvider commandLineArgsProvider;
private boolean defaultArgsAsEnvironmentVars = false;
public DeployerMultiPartitionHandler(TaskLauncher taskLauncher,
JobExplorer jobExplorer,
Resource resource,
String stepName) {
Assert.notNull(taskLauncher, "A taskLauncher is required");
Assert.notNull(jobExplorer, "A jobExplorer is required");
Assert.notNull(resource, "A resource is required");
Assert.hasText(stepName, "A step name is required");
this.taskLauncher = taskLauncher;
this.jobExplorer = jobExplorer;
this.resource = resource;
this.stepName = stepName;
}
#Override
public Collection<StepExecution> handle(StepExecutionSplitter stepSplitter,
StepExecution stepExecution) throws Exception {
final Set<StepExecution> tempCandidates =
stepSplitter.split(stepExecution, this.gridSize);
// Following two lines due to https://jira.spring.io/browse/BATCH-2490
final List<StepExecution> candidates = new ArrayList<>(tempCandidates.size());
candidates.addAll(tempCandidates);
int partitions = candidates.size();
log.debug(String.format("%s partitions were returned", partitions));
final Set<StepExecution> executed = new HashSet<>(candidates.size());
if (CollectionUtils.isEmpty(candidates)) {
return null;
}
launchWorkers(candidates, executed);
candidates.removeAll(executed);
return pollReplies(stepExecution, executed, partitions);
}
private void launchWorkers(List<StepExecution> candidates, Set<StepExecution> executed) {
int partitions = candidates.size();
int numWorkers = this.maxWorkers != -1 ? Math.min(this.maxWorkers, partitions) : partitions;
IntStream.range(0, numWorkers).boxed()
.map(i -> candidates.subList(partitionOffset(partitions, numWorkers, i), partitionOffset(partitions, numWorkers, i + 1)))
.filter(not(List::isEmpty))
.forEach(stepExecutions -> processStepExecutions(stepExecutions, executed));
}
private void processStepExecutions(List<StepExecution> stepExecutions, Set<StepExecution> executed) {
launchWorker(stepExecutions);
this.currentWorkers++;
executed.addAll(stepExecutions);
}
private void launchWorker(List<StepExecution> workerStepExecutions) {
List<String> arguments = new ArrayList<>();
StepExecution firstWorkerStepExecution = workerStepExecutions.get(0);
ExecutionContext copyContext = new ExecutionContext(firstWorkerStepExecution.getExecutionContext());
arguments.addAll(
this.commandLineArgsProvider
.getCommandLineArgs(copyContext));
String jobExecutionId = String.valueOf(firstWorkerStepExecution.getJobExecution().getId());
String stepExecutionIds = workerStepExecutions.stream().map(workerStepExecution -> String.valueOf(workerStepExecution.getId())).collect(joining(","));
String taskName = String.format("%s_%s_%s",
taskExecution.getTaskName(),
firstWorkerStepExecution.getJobExecution().getJobInstance().getJobName(),
firstWorkerStepExecution.getStepName());
String parentExecutionId = String.valueOf(taskExecution.getExecutionId());
if(!this.defaultArgsAsEnvironmentVars) {
arguments.add(formatArgument(SPRING_CLOUD_TASK_JOB_EXECUTION_ID,
jobExecutionId));
arguments.add(formatArgument(SPRING_CLOUD_TASK_STEP_EXECUTION_IDS,
stepExecutionIds));
arguments.add(formatArgument(SPRING_CLOUD_TASK_STEP_NAME, this.stepName));
arguments.add(formatArgument(SPRING_CLOUD_TASK_NAME, taskName));
arguments.add(formatArgument(SPRING_CLOUD_TASK_PARENT_EXECUTION_ID,
parentExecutionId));
}
copyContext = new ExecutionContext(firstWorkerStepExecution.getExecutionContext());
log.info("launchWorker context={}", copyContext);
Map<String, String> environmentVariables = this.environmentVariablesProvider.getEnvironmentVariables(copyContext);
if(this.defaultArgsAsEnvironmentVars) {
environmentVariables.put(SPRING_CLOUD_TASK_JOB_EXECUTION_ID,
jobExecutionId);
environmentVariables.put(SPRING_CLOUD_TASK_STEP_EXECUTION_ID,
String.valueOf(firstWorkerStepExecution.getId()));
environmentVariables.put(SPRING_CLOUD_TASK_STEP_NAME, this.stepName);
environmentVariables.put(SPRING_CLOUD_TASK_NAME, taskName);
environmentVariables.put(SPRING_CLOUD_TASK_PARENT_EXECUTION_ID,
parentExecutionId);
}
AppDefinition definition =
new AppDefinition(resolveApplicationName(),
environmentVariables);
AppDeploymentRequest request =
new AppDeploymentRequest(definition,
this.resource,
this.deploymentProperties,
arguments);
taskLauncher.launch(request);
}
private String resolveApplicationName() {
if(StringUtils.hasText(this.applicationName)) {
return this.applicationName;
}
else {
return this.taskExecution.getTaskName();
}
}
private String formatArgument(String key, String value) {
return String.format("--%s=%s", key, value);
}
private Collection<StepExecution> pollReplies(final StepExecution masterStepExecution,
final Set<StepExecution> executed,
final int size) throws Exception {
final Collection<StepExecution> result = new ArrayList<>(executed.size());
Callable<Collection<StepExecution>> callback = new Callable<Collection<StepExecution>>() {
#Override
public Collection<StepExecution> call() {
Set<StepExecution> newExecuted = new HashSet<>();
for (StepExecution curStepExecution : executed) {
if (!result.contains(curStepExecution)) {
StepExecution partitionStepExecution =
jobExplorer.getStepExecution(masterStepExecution.getJobExecutionId(), curStepExecution.getId());
if (isComplete(partitionStepExecution.getStatus())) {
result.add(partitionStepExecution);
currentWorkers--;
}
}
}
executed.addAll(newExecuted);
if (result.size() == size) {
return result;
}
else {
return null;
}
}
};
Poller<Collection<StepExecution>> poller = new DirectPoller<>(this.pollInterval);
Future<Collection<StepExecution>> resultsFuture = poller.poll(callback);
if (timeout >= 0) {
return resultsFuture.get(timeout, TimeUnit.MILLISECONDS);
}
else {
return resultsFuture.get();
}
}
private boolean isComplete(BatchStatus status) {
return status.equals(BatchStatus.COMPLETED) || status.isGreaterThan(BatchStatus.STARTED);
}
#Override
public void setEnvironment(Environment environment) {
this.environment = environment;
}
#Override
public void afterPropertiesSet() {
Assert.notNull(taskExecution, "A taskExecution is required");
if(this.environmentVariablesProvider == null) {
this.environmentVariablesProvider =
new CloudEnvironmentVariablesProvider(this.environment);
}
if(this.commandLineArgsProvider == null) {
SimpleCommandLineArgsProvider simpleCommandLineArgsProvider = new SimpleCommandLineArgsProvider();
simpleCommandLineArgsProvider.onTaskStartup(taskExecution);
this.commandLineArgsProvider = simpleCommandLineArgsProvider;
}
}
}
The partitions are distributed to workers with the help of static function partitionOffset, which ensures that the number of partitions each worker receives differ by at most one:
static int partitionOffset(int length, int numberOfPartitions, int partitionIndex) {
return partitionIndex * (length / numberOfPartitions) + Math.min(partitionIndex, length % numberOfPartitions);
}
On the receiving end I created DeployerMultiStepExecutionHandler which inherits the parallel execution of partitions from TaskExecutorPartitionHandler and in addition implements the command line interface matching DeployerMultiPartitionHandler:
#Slf4j
public class DeployerMultiStepExecutionHandler extends TaskExecutorPartitionHandler implements CommandLineRunner {
private JobExplorer jobExplorer;
private JobRepository jobRepository;
private Log logger = LogFactory.getLog(org.springframework.cloud.task.batch.partition.DeployerStepExecutionHandler.class);
#Autowired
private Environment environment;
private StepLocator stepLocator;
public DeployerMultiStepExecutionHandler(BeanFactory beanFactory, JobExplorer jobExplorer, JobRepository jobRepository) {
Assert.notNull(beanFactory, "A beanFactory is required");
Assert.notNull(jobExplorer, "A jobExplorer is required");
Assert.notNull(jobRepository, "A jobRepository is required");
this.stepLocator = new BeanFactoryStepLocator();
((BeanFactoryStepLocator) this.stepLocator).setBeanFactory(beanFactory);
this.jobExplorer = jobExplorer;
this.jobRepository = jobRepository;
}
#Override
public void run(String... args) throws Exception {
validateRequest();
Long jobExecutionId = Long.parseLong(environment.getProperty(SPRING_CLOUD_TASK_JOB_EXECUTION_ID));
Stream<Long> stepExecutionIds = Stream.of(environment.getProperty(SPRING_CLOUD_TASK_STEP_EXECUTION_IDS).split(",")).map(Long::parseLong);
Set<StepExecution> stepExecutions = stepExecutionIds.map(stepExecutionId -> jobExplorer.getStepExecution(jobExecutionId, stepExecutionId)).collect(Collectors.toSet());
log.info("found stepExecutions:\n{}", stepExecutions.stream().map(stepExecution -> stepExecution.getId() + ":" + stepExecution.getExecutionContext()).collect(joining("\n")));
if (stepExecutions.isEmpty()) {
throw new NoSuchStepException(String.format("No StepExecution could be located for step execution id %s within job execution %s", stepExecutionIds, jobExecutionId));
}
String stepName = environment.getProperty(SPRING_CLOUD_TASK_STEP_NAME);
setStep(stepLocator.getStep(stepName));
doHandle(null, stepExecutions);
}
private void validateRequest() {
Assert.isTrue(environment.containsProperty(SPRING_CLOUD_TASK_JOB_EXECUTION_ID), "A job execution id is required");
Assert.isTrue(environment.containsProperty(SPRING_CLOUD_TASK_STEP_EXECUTION_IDS), "A step execution id is required");
Assert.isTrue(environment.containsProperty(SPRING_CLOUD_TASK_STEP_NAME), "A step name is required");
Assert.isTrue(this.stepLocator.getStepNames().contains(environment.getProperty(SPRING_CLOUD_TASK_STEP_NAME)), "The step requested cannot be found in the provided BeanFactory");
}
}

Related

Spring profiles: Deny user tu run many profiles from same "group"

I would like to conditioner profiles.
For example I have Two groups of profiles:
a) DEV, PROD, TEST
b) ProfileDB1, ProfileDB2, ProfileDB3
I would like to force that application will be run with one profile of first group, and one profile of second group. But no more. Is it possible ?

You can add a ApplicationListener that to check when the application is prepared that the profiles are as expected before the rest of the application loads.
public class ProfileValidator implements ApplicationListener<ApplicationEnvironmentPreparedEvent> {
private static final Logger LOG = LoggerFactory.getLogger(ProfileValidator.class);
private static final Set<String> DB_PROFILES = new HashSet<>(Arrays.asList("DB1", "DB2", "DB3"));
private static final Set<String> ENVIRONMENT_PROFILES = new HashSet<>(Arrays.asList("dev", "test"));
#Override
public void onApplicationEvent(ApplicationEnvironmentPreparedEvent applicationEnvironmentPreparedEvent) {
List<String> activeProfiles = Arrays.asList(applicationEnvironmentPreparedEvent.getEnvironment().getActiveProfiles());
LOG.info("Validating Allowed Profiles - {}", activeProfiles);
if (activeProfiles.size() > 1) {
long count = activeProfiles.stream().filter(profile -> DB_PROFILES.contains(profile) || ENVIRONMENT_PROFILES.contains(profile)).count();
LOG.debug("Counted {} profiles", count);
if (count != 0 && activeProfiles.size() - 2 != (activeProfiles.size() - count)) {
throw new IllegalArgumentException(String.format("Invalid Profiles detected for %s", activeProfiles.toString()));
}
}
}
}
Here's a working example,
https://github.com/DarrenForsythe/spring-profile-validator
Note the spring.factories to register the ApplicationListener. There's also a test class to verify the functionality, and can just start it up with invalid combos or not.

You could write some kind of an ActiveProfilesVerifier component in which the Environment is injected and the active profiles are verified:
#Component
public class ActiveProfilesVerifier {
private static final List<String> ENV_PROFILES = Arrays.asList("DEV", "PROD", "TEST");
private static final List<String> DBASE_PROFILES = Arrays.asList("ProfileDB1", "ProfileDB2", "ProfileDB3");
private final Environment environment;
public ActiveProfilesVerifier(Environment environment) {
this.environment = environment;
}
#PostConstruct
public void verifyProfiles() {
String[] activeProfiles = environment.getActiveProfiles();
boolean hasSingleEnvProfile = Arrays.stream(activeProfiles).filter(ENV_PROFILES::contains).count() == 1;
if (!hasSingleEnvProfile) {
throw new IllegalArgumentException("Select exactly one environment profile");
}
boolean hasSingleDbaseProfile = Arrays.stream(activeProfiles).filter(DBASE_PROFILES::contains).count() == 1;
if (!hasSingleDbaseProfile) {
throw new IllegalArgumentException("Select exactly one database profile");
}
}
}

How to start multiple boot apps for end-to-end tests?

I'd like to write end-to-end tests to validate two boot apps work well together with various profiles.
What already works:
create a third maven module (e2e) for end-to-end tests, in addition to the two tested apps (authorization-server and resource-server)
write tests using TestResTemplate
Test work fine if I start authorization-server and resource-server manually.
What I now want to do is automate the tested boot apps startup and shutdown with the right profiles for each test.
I tried:
adding maven dependencies to tested apps in e2e module
using SpringApplication in new threads for each app to start
But I face miss-configuration issues as all resources and dependencies end in the same shared classpath...
Is there a way to sort this out?
I'm also considering starting two separate java -jar ... processes, but then, how to ensure tested apps fat-jars are built before 2e2 unit-tests run?
Current app start/shutdown code sample which fails as soon as I had maven dependency to second app to start:
private Service startAuthorizationServer(boolean isJwtActive) throws InterruptedException {
return new Service(
AuthorizationServer.class,
isJwtActive ? new String[]{ "jwt" } : new String[]{} );
}
private static final class Service {
private ConfigurableApplicationContext context;
private final Thread thread;
public Service(Class<?> appClass, String... profiles) throws InterruptedException {
thread = new Thread(() -> {
SpringApplication app = new SpringApplicationBuilder(appClass).profiles(profiles).build();
context = app.run();
});
thread.setDaemon(false);
thread.start();
while (context == null || !context.isRunning()) {
Thread.sleep(1000);
};
}
#PreDestroy
public void stop() {
if (context != null) {
SpringApplication.exit(context);
}
if (thread != null) {
thread.interrupt();
}
}
}

I think your case, running the two applications via a docker compose can be a good idea.
This article shows how you can set up some integration tests using a docker compose image: https://blog.codecentric.de/en/2017/03/writing-integration-tests-docker-compose-junit/
Also, take a look at this post from Martin Fowler: https://martinfowler.com/articles/microservice-testing/

I got things working with second solution:
end-to-end tests projects has no other maven dependency than what is required to run spring-tests with TestRestClient
test config initialises environment, running mvn packageon required modules in separate processes
test cases run (re)start apps with chosen profiles in separate java -jar ... processes
Here is the helper class I wrote for this (taken from there):
class ActuatorApp {
private final int port;
private final String actuatorEndpoint;
private final File jarFile;
private final TestRestTemplate actuatorClient;
private Process process;
private ActuatorApp(File jarFile, int port, TestRestTemplate actuatorClient) {
this.port = port;
this.actuatorEndpoint = getBaseUri() + "actuator/";
this.actuatorClient = actuatorClient;
this.jarFile = jarFile;
Assert.isTrue(jarFile.exists(), jarFile.getAbsolutePath() + " does not exist");
}
public void start(List<String> profiles, List<String> additionalArgs) throws InterruptedException, IOException {
if (isUp()) {
stop();
}
this.process = Runtime.getRuntime().exec(appStartCmd(jarFile, profiles, additionalArgs));
Executors.newSingleThreadExecutor().submit(new ProcessStdOutPrinter(process));
for (int i = 0; i < 10 && !isUp(); ++i) {
Thread.sleep(5000);
}
}
public void start(String... profiles) throws InterruptedException, IOException {
this.start(Arrays.asList(profiles), List.of());
}
public void stop() throws InterruptedException {
if (isUp()) {
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON_UTF8);
headers.setAccept(List.of(MediaType.APPLICATION_JSON_UTF8));
actuatorClient.postForEntity(actuatorEndpoint + "shutdown", new HttpEntity<>(headers), Object.class);
Thread.sleep(5000);
}
if (process != null) {
process.destroy();
}
}
private String[] appStartCmd(File jarFile, List<String> profiles, List<String> additionalArgs) {
final List<String> cmd = new ArrayList<>(
List.of(
"java",
"-jar",
jarFile.getAbsolutePath(),
"--server.port=" + port,
"--management.endpoint.heath.enabled=true",
"--management.endpoint.shutdown.enabled=true",
"--management.endpoints.web.exposure.include=*",
"--management.endpoints.web.base-path=/actuator"));
if (profiles.size() > 0) {
cmd.add("--spring.profiles.active=" + profiles.stream().collect(Collectors.joining(",")));
}
if (additionalArgs != null) {
cmd.addAll(additionalArgs);
}
return cmd.toArray(new String[0]);
}
private boolean isUp() {
try {
final ResponseEntity<HealthResponse> response =
actuatorClient.getForEntity(actuatorEndpoint + "health", HealthResponse.class);
return response.getStatusCode().is2xxSuccessful() && response.getBody().getStatus().equals("UP");
} catch (ResourceAccessException e) {
return false;
}
}
public static Builder builder(String moduleName, String moduleVersion) {
return new Builder(moduleName, moduleVersion);
}
/**
* Configure and build a spring-boot app
*
* #author Ch4mp
*
*/
public static class Builder {
private String moduleParentDirectory = "..";
private final String moduleName;
private final String moduleVersion;
private int port = SocketUtils.findAvailableTcpPort(8080);
private String actuatorClientId = "actuator";
private String actuatorClientSecret = "secret";
public Builder(String moduleName, String moduleVersion) {
this.moduleName = moduleName;
this.moduleVersion = moduleVersion;
}
public Builder moduleParentDirectory(String moduleParentDirectory) {
this.moduleParentDirectory = moduleParentDirectory;
return this;
}
public Builder port(int port) {
this.port = port;
return this;
}
public Builder actuatorClientId(String actuatorClientId) {
this.actuatorClientId = actuatorClientId;
return this;
}
public Builder actuatorClientSecret(String actuatorClientSecret) {
this.actuatorClientSecret = actuatorClientSecret;
return this;
}
/**
* Ensures the app module is found and packaged
* #return app ready to be started
* #throws IOException if module packaging throws one
* #throws InterruptedException if module packaging throws one
*/
public ActuatorApp build() throws IOException, InterruptedException {
final File moduleDir = new File(moduleParentDirectory, moduleName);
packageModule(moduleDir);
final File jarFile = new File(new File(moduleDir, "target"), moduleName + "-" + moduleVersion + ".jar");
return new ActuatorApp(jarFile, port, new TestRestTemplate(actuatorClientId, actuatorClientSecret));
}
private void packageModule(File moduleDir) throws IOException, InterruptedException {
Assert.isTrue(moduleDir.exists(), "could not find module. " + moduleDir + " does not exist.");
String[] cmd = new File(moduleDir, "pom.xml").exists() ?
new String[] { "mvn", "-DskipTests=true", "package" } :
new String[] { "./gradlew", "bootJar" };
Process mvnProcess = new ProcessBuilder().directory(moduleDir).command(cmd).start();
Executors.newSingleThreadExecutor().submit(new ProcessStdOutPrinter(mvnProcess));
Assert.isTrue(mvnProcess.waitFor() == 0, "module packaging exited with error status.");
}
}
private static class ProcessStdOutPrinter implements Runnable {
private InputStream inputStream;
public ProcessStdOutPrinter(Process process) {
this.inputStream = process.getInputStream();
}
#Override
public void run() {
new BufferedReader(new InputStreamReader(inputStream)).lines().forEach(System.out::println);
}
}
public String getBaseUri() {
return "https://localhost:" + port;
}
}

Sorting DataStream using Apache Flink

I am learning Flink and I started with a simple word count using DataStream. To enhance the processing I filtered the output to show only the results with 3 or more words found.
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9000)
.flatMap(new Splitter())
.keyBy(0)
.timeWindow(Time.seconds(5))
.apply(new MyWindowFunction())
.sum(1)
.filter(word -> word.f1 >= 3);
I would like to create a WindowFunction to sort the output by the value of words found. The WindowFunction that I am trying to implement does not compile at all. I am struggling to define the apply method and the parameters of the WindowFunction interface.
public static class MyWindowFunction implements WindowFunction<
Tuple2<String, Integer>, // input type
Tuple2<String, Integer>, // output type
Tuple2<String, Integer>, // key type
TimeWindow> {
void apply(Tuple2<String, Integer> key, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) {
String word = ((Tuple2<String, Integer>)key).f0;
Integer count = ((Tuple2<String, Integer>)key).f1;
.........
out.collect(new Tuple2<>(word, count));
}
}

I am updating this answer to use Flink 1.12.0. In order to sort the elements of a stream in I had to use a KeyedProcessFunction after counting the stream with a ReduceFunction. Then I had to set the parallelism of the very last transformation to 1 in order to not change the order of the elements that I sorted using KeyedProcessFunction. The sequence that I am using is socketTextStream -> flatMap -> keyBy -> reduce -> keyBy -> process -> print().setParallelism(1). Bellow it the example:
public class SocketWindowWordCountJava {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream("localhost", 9000)
.flatMap(new SplitterFlatMap())
.keyBy(new WordKeySelector())
.reduce(new SumReducer())
.keyBy(new WordKeySelector())
.process(new SortKeyedProcessFunction(3 * 1000))
.print().setParallelism(1);
String executionPlan = env.getExecutionPlan();
System.out.println("ExecutionPlan ........................ ");
System.out.println(executionPlan);
System.out.println("........................ ");
env.execute("Window WordCount sorted");
}
}
The UDF that I used to sort the stream is the SortKeyedProcessFunction which extends KeyedProcessFunction. I use a ValueState<List<Event>> listState of Event implements Comparable<Event> to have a sorted list as state. On the processElement method I register the time stamp that I added the event to the state context.timerService().registerProcessingTimeTimer(timeoutTime); and I collect the event at the onTimer method. I am also using a time window of 3 seconds here.
public class SortKeyedProcessFunction extends KeyedProcessFunction<String, Tuple2<String, Integer>, Event> {
private static final long serialVersionUID = 7289761960983988878L;
// delay after which an alert flag is thrown
private final long timeOut;
// state to remember the last timer set
private ValueState<List<Event>> listState = null;
private ValueState<Long> lastTime = null;
public SortKeyedProcessFunction(long timeOut) {
this.timeOut = timeOut;
}
#Override
public void open(Configuration conf) {
// setup timer and HLL state
ValueStateDescriptor<List<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<List<Event>>() {
}));
listState = getRuntimeContext().getState(descriptor);
ValueStateDescriptor<Long> descriptorLastTime = new ValueStateDescriptor<Long>(
"lastTime",
TypeInformation.of(new TypeHint<Long>() {
}));
lastTime = getRuntimeContext().getState(descriptorLastTime);
}
#Override
public void processElement(Tuple2<String, Integer> value, Context context, Collector<Event> collector) throws Exception {
// get current time and compute timeout time
long currentTime = context.timerService().currentProcessingTime();
long timeoutTime = currentTime + timeOut;
// register timer for timeout time
context.timerService().registerProcessingTimeTimer(timeoutTime);
List<Event> queue = listState.value();
if (queue == null) {
queue = new ArrayList<Event>();
}
Long current = lastTime.value();
queue.add(new Event(value.f0, value.f1));
lastTime.update(timeoutTime);
listState.update(queue);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Event> out) throws Exception {
// System.out.println("onTimer: " + timestamp);
// check if this was the last timer we registered
System.out.println("timestamp: " + timestamp);
List<Event> queue = listState.value();
Long current = lastTime.value();
if (timestamp == current.longValue()) {
Collections.sort(queue);
queue.forEach( e -> {
out.collect(e);
});
queue.clear();
listState.clear();
}
}
}
class Event implements Comparable<Event> {
String value;
Integer qtd;
public Event(String value, Integer qtd) {
this.value = value;
this.qtd = qtd;
}
public String getValue() { return value; }
public Integer getQtd() { return qtd; }
#Override
public String toString() {
return "Event{" +"value='" + value + '\'' +", qtd=" + qtd +'}';
}
#Override
public int compareTo(#NotNull Event event) {
return this.getValue().compareTo(event.getValue());
}
}
So when I use $ nc -lk 9000 and type the words on the console I see them in order on the output
...
Event{value='soccer', qtd=7}
Event{value='swim', qtd=5}
...
Event{value='basketball', qtd=9}
Event{value='soccer', qtd=8}
Event{value='swim', qtd=6}
The other UDFs are for the other transformations of the stream program and they are here for completeness.
public class SplitterFlatMap implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 3121588720675797629L;
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word : sentence.split(" ")) {
out.collect(Tuple2.of(word, 1));
}
}
}
public class WordKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
#Override
public String getKey(Tuple2<String, Integer> value) throws Exception {
return value.f0;
}
}
public class SumReducer implements ReduceFunction<Tuple2<String, Integer>> {
#Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> event1, Tuple2<String, Integer> event2) throws Exception {
return Tuple2.of(event1.f0, event1.f1 + event2.f1);
}
}

The .sum(1) method will do everything you need (no need for using apply()), as long as the Splitter class (which should be a FlatMapFunction) is emitting Tuple2<String, Integer> records, where String is the word, and Integer is always 1.
So then .sum(1) will do the aggregation for you. If you needed something different than what sum() does, you would typically use .reduce(new MyCustomReduceFunction()), as that's going to be the most efficient and scalable approach, in terms of not needing to buffer lots in memory.

Cucumber Guice / Injector seems not to be thread-safe (Parallel execution / ExecutorService)

[long description warning]
I'm running some cucumber tests which have to be executed intercalated a defined server - for instance:
a.feature -> JBoss Server 1 | b.feature -> JBoss Serv. 2 | c.feature -> JB1 | etc.
For that, I created a hypothetical ExecutorService like this:
final ExecutorService executorService = Executors.newFixedThreadPool(2); //numberOfServers
for (Runnable task : tasks) {
executorService.execute(task);
}
executorService.shutdown();
try {
executorService.awaitTermination(1000, TimeUnit.SECONDS);
} catch (InterruptedException e) {
//doX();
}
The way that I manage about how will be the server chosen as liable to execute is:
inside of my Runnable class created for the executorService, I pass as a parameter a instanceId to a TestNG (XmlTest class) as below:
#Override
public void run() {
setupTest().run();
}
private TestNG setupTest() {
TestNG testNG = new TestNG();
XmlSuite xmlSuite = new XmlSuite();
XmlTest xmlTest = new XmlTest(xmlSuite);
xmlTest.setName(//irrelevant);
xmlTest.addParameter("instanceId", String.valueOf(instanceId));
xmlTest.setXmlClasses(..........);
testNG.setXmlSuites(..........);
return testNG;
}
Then, I get this just fine in a class that extends TestNgCucumberAdaptor:
#BeforeTest
#Parameters({"instanceId"})
public void setInstanceId(#Optional("") String instanceId) {
if (!StringUtils.isEmpty(instanceId)) {
this.instanceId = Integer.valueOf(instanceId);
}
}
And inside a #BeforeClass I'm populating a Pojo with this instanceId and setting the Pojo in a threadLocal attribute of another class. So far, so good.
public class CurrentPojoContext {
private static final ThreadLocal<PojoContext> TEST_CONTEXT = new ThreadLocal<PojoContext>();
...
public static PojoContext getContext(){
TEST_CONTEXT.get();
}
Now the problem really starts - I'm using Guice (Cucumber guice as well) in a 3rd class, injecting this pojo object that contains the instanceId. The example follows:
public class Environment {
protected final PojoContext pojoContext;
#Inject
public Environment() {
this.pojoContext = CurrentPojoContext.getContext();
}
public void foo() {
print(pojoContext.instanceId); // output: 1
Another.doSomething(pojoContext);
}
class Another{
public String doSomething(PojoContext p){
print(p.instanceId); // output: 2
}
}
}
Here it is not every time like this the outputs (1 and 2) but from time to time, I realized that the execution of different threads is messing with the attribute pojoContext. I know that is a little confusing, but my guess is that the Guice Injector is not thread-safe for this scenario - it might be a long shot, but I'd appreciate if someone else takes a guess.
Regards

Well, just in order to provide a solution for someone else, my solution was the following:
Create a class that maintains a Map with an identifier (unique and thread-safe one) as the key and a Guice Injector as value;
Inside my instantiation of Guice injector, I created my own module
Guice.createInjector(Stage.PRODUCTION, MyOwnModules.SCENARIO, new RandomModule());
and for this module:
public class MyOwnModules {
public static final Module SCENARIO = new ScenarioModule(MyOwnCucumberScopes.SCENARIO);
}
the scope defined here provides the following:
public class MyOwnCucumberScopes {
public static final ScenarioScope SCENARIO = new ParallelScenarioScope();
}
To sum up, the thread-safe will be in the ParallelScenarioScope:
public class ParallelScenarioScope implements ScenarioScope {
private static final Logger LOGGER = Logger.getLogger(ParallelScenarioScope.class);
private final ThreadLocal<Map<Key<?>, Object>> threadLocalMap = new ThreadLocal<Map<Key<?>, Object>>();
#Override
public <T> Provider<T> scope(final Key<T> key, final Provider<T> unscoped) {
return new Provider<T>() {
public T get() {
Map<Key<?>, Object> scopedObjects = getScopedObjectMap(key);
#SuppressWarnings("unchecked")
T current = (T) scopedObjects.get(key);
if (current == null && !scopedObjects.containsKey(key)) {
current = unscoped.get();
scopedObjects.put(key, current);
}
return current;
}
};
}
protected <T> Map<Key<?>, Object> getScopedObjectMap(Key<T> key) {
Map<Key<?>, Object> map = threadLocalMap.get();
if (map == null) {
throw new OutOfScopeException("Cannot access " + key + " outside of a scoping block");
}
return map;
}
#Override
public void enterScope() {
checkState(threadLocalMap.get() == null, "A scoping block is already in progress");
threadLocalMap.set(new ConcurrentHashMap<Key<?>, Object>());
}
#Override
public void exitScope() {
checkState(threadLocalMap.get() != null, "No scoping block in progress");
threadLocalMap.remove();
}
private void checkState(boolean expression, String errorMessage) {
if (!expression) {
LOGGER.info("M=checkState, Will throw exception: " + errorMessage);
throw new IllegalStateException(errorMessage);
}
}
}
Now the gotcha is just to be careful regarding the #ScenarioScoped and the code will work as expected.

application with Chronicle-Queue memory keeps on growing

I have implemented a simple Spring Boot application which receive a network message, queue it into SingleChronicleQueue using appender.writeText(str), another thread polls for a message using tailer.readText(). After some processing a processed message is place in another SingleChronicleQueue to be sent away.
I have three queues in the application.
The application rotates the files every night and the first weird thing is that the file sizes (for each Q) are the same (different for every Q).
The largest cq4 file is about 220MB per day.
The problem that I face is that in three days from start until now the memory grew from 480MB to 1.6GB and it just unreasonable.
I have a notion that I am missing something in configuration, or a naive/bad implementation on my part. (I don't close the appender and tailer after every use, should I).
Here is a stripped down example, maybe someone can shed some light.
#Service
public class QueuesService {
private static Logger LOG = LoggerFactory.getLogger(QueuesService.class);
#Autowired
AppConfiguration conf;
private SingleChronicleQueue Q = null;
private ExcerptAppender QAppender = null;
private ExcerptTailer QTailer = null;
public QueuesService() {
}
#PostConstruct
private void init() {
Q = SingleChronicleQueueBuilder.binary(conf.getQueuePath()).indexSpacing(1).build();
QAppender = Q.acquireAppender();
QTailer = Q.createTailer();
}
public ExcerptAppender getQAppender() {
return QAppender;
}
public ExcerptTailer getQTailer() {
return QTailer;
}
}
#Service
public class ProcessingService {
private static Logger LOG = LoggerFactory.getLogger(ProcessingService.class);
#Autowired
AppConfiguration conf;
#Autowired
private TaskExecutor taskExecutor;
#Autowired
private QueuesService queueService;
private QueueProcessor processor = null;
public ProcessingService() {
}
#PostConstruct
private void init() {
processor = new QueueProcessor();
processor.start();
}
#Override
public Message processMessage(Message msg, Map<String, Object> metadata) throws SomeException {
String strMsg = msg.getMessage().toString();
if (LOG.isInfoEnabled()) {
LOG.info("\n" + strMsg);
}
try {
queueService.getQAppender().writeText(strMsg);
if (LOG.isInfoEnabled()) {
LOG.info("Added new message to queue. index: " + queueService.getQAppender().lastIndexAppended());
}
}
catch(Exception e) {
LOG.error("Unkbown error. reason: " + e.getMessage(), e);
}
}
class QueueProcessor extends Thread {
public void run() {
while (!interrupted()) {
try {
String msg = queueService.getEpicQTailer().readText();
if (msg != null) {
long index = queueService.getEpicQTailer().index();
// process
}
else {
Thread.sleep(10);
}
}
catch (InterruptedException e) {
LOG.warn(e);
this.interrupt();
break;
}
}
ThreadPoolTaskExecutor tp = (ThreadPoolTaskExecutor) taskExecutor;
tp.shutdown();
}
}
}

Chronicle Queue is designed to use virtual memory which can be much larger than main memory (or the heap) without a significant impact on your system. This allows you to access the data at random quickly.
Here is an example of a process writing 1 TB in 3 hours.
https://vanilla-java.github.io/2017/01/27/Chronicle-Queue-storing-1-TB-in-virtual-memory-on-a-128-GB-machine.html
This shows how much slower it gets as the queue grows
Even after it is 1 TB in size on a machine with 128 GB, it write 1 GB under 2 seconds pretty consistently.
While this doesn't cause a technical problem, we are aware this does concern people who also find this "weird", and we plan to have a mode which reduces virtual memory use (even if a little slower for some use cases)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Spring cloud data flow with spring batch job - scaling considerations - spring-boot

Related

Spring profiles: Deny user tu run many profiles from same "group"

How to start multiple boot apps for end-to-end tests?

Sorting DataStream using Apache Flink

Cucumber Guice / Injector seems not to be thread-safe (Parallel execution / ExecutorService)

application with Chronicle-Queue memory keeps on growing

Categories

Resources