Spring Batch read enormous data from Rest web service - spring-boot

I need to process data from Rest web service. the following basic exemple is :
import org.springframework.batch.item.ItemReader;
import org.springframework.http.ResponseEntity;
import org.springframework.web.client.RestTemplate;
import java.util.Arrays;
import java.util.List;
class RESTDataReader implements ItemReader<DataDTO> {
private final String apiUrl;
private final RestTemplate restTemplate;
private int nextDataIndex;
private List<DataDTO> data;
RESTDataReader(String apiUrl, RestTemplate restTemplate) {
this.apiUrl = apiUrl;
this.restTemplate = restTemplate;
nextDataIndex = 0;
}
#Override
public DataDTO read() throws Exception {
if (dataIsNotInitialized()) {
data = fetchDataFromAPI();
}
DataDTO nextData = null;
if (nextDataIndex < data.size()) {
nextData = data.get(nextDataIndex);
nextDataIndex++;
}
else {
nextDataIndex= 0;
data = null;
}
return nextData;
}
private boolean dataIsNotInitialized() {
return this.data == null;
}
private List<DataDTO> fetchDataFromAPI() {
ResponseEntity<DataDTO[]> response = restTemplate.getForEntity(apiUrl,
DataDTO[].class
);
DataDTO[] data= response.getBody();
return Arrays.asList(data);
}
}
However, my fetchDataFromAPI method is called with time slots and it could get more than 20 Millions objects.
For example : if i call it between 01012020 and 01012021 i'll get 80 Millions data.
PS : the web service works by pagination of a single day, i.e. if I want to retrieve the data between 01/09/2020 and 07/09/2020 I have to call it several times (between 01/09-02/09 then between 02/09-03/09 and so on until 06/09-07/09)
My problem in this case is a heap space memory if the data is bulky.
I had to create a step for each month to avoid this problem in my BatchConfiguration (12 steps). The first step which will call the web service between 01/01/2020 and 01/02/2020 etc
Is there a solution to read all this volume of data with only one step before going to the processor ??
Thanks in advance

Since your web service does not provide pagination within a single day, you need to ensure that the process that calls this web service (ie your Spring Batch job) has enough memory to store all items returned by this service.
For example : if i call it between 01012020 and 01012021 i'll get 80 Millions data.
This means that if you call this web service with curl on a machine that does not have enough memory to hold the result, then the curl command will fail. The point I want to make here is that the only way to solve this issue is to give enough memory to the JVM that runs your Spring Batch job to hold such a big result set.
As a side note: if you have control over this web service, I highly recommend you to improve it by introducing a more granular pagination mechanism.

Related

How to redirect Prometheus Metrics to the default spring boot server

I am trying to expose a custom Gauge metric from my Spring Boot Application. I am using Micrometer with the Prometheus registry to do so. I have set up the PrometheusRegistry and configs as per - Micrometer Samples - Github but it creates one more HTTP server for exposing the Prometheus metrics. I need to redirect or expose all the metrics to the Spring boot's default context path - /actuator/prometheus instead of a new context path on a new port. I have implemented the following code so far -
PrometheusRegistry.java -
package com.xyz.abc.prometheus;
import java.io.IOException;
import java.io.OutputStream;
import java.net.InetSocketAddress;
import java.time.Duration;
import com.sun.net.httpserver.HttpServer;
import io.micrometer.core.lang.Nullable;
import io.micrometer.prometheus.PrometheusConfig;
import io.micrometer.prometheus.PrometheusMeterRegistry;
public class PrometheusRegistry {
public static PrometheusMeterRegistry prometheus() {
PrometheusMeterRegistry prometheusRegistry = new PrometheusMeterRegistry(new PrometheusConfig() {
#Override
public Duration step() {
return Duration.ofSeconds(10);
}
#Override
#Nullable
public String get(String k) {
return null;
}
});
try {
HttpServer server = HttpServer.create(new InetSocketAddress(8081), 0);
server.createContext("/sample-data/prometheus", httpExchange -> {
String response = prometheusRegistry.scrape();
httpExchange.sendResponseHeaders(200, response.length());
OutputStream os = httpExchange.getResponseBody();
os.write(response.getBytes());
os.close();
});
new Thread(server::start).run();
} catch (IOException e) {
throw new RuntimeException(e);
}
return prometheusRegistry;
}
}
MicrometerConfig.java -
package com.xyz.abc.prometheus;
import io.micrometer.core.instrument.MeterRegistry;
public class MicrometerConfig {
public static MeterRegistry carMonitoringSystem() {
// Pick a monitoring system here to use in your samples.
return PrometheusRegistry.prometheus();
}
}
Code snippet where I am creating a custom Gauge metric. As of now, it's a simple REST API to test - (Please read the comments in between)
#SuppressWarnings({ "unchecked", "rawtypes" })
#RequestMapping(value = "/sampleApi", method= RequestMethod.GET)
#ResponseBody
//This Timed annotation is working fine and this metrics comes in /actuator/prometheus by default
#Timed(value = "car.healthcheck", description = "Time taken to return healthcheck")
public ResponseEntity healthCheck(){
MeterRegistry registry = MicrometerConfig.carMonitoringSystem();
AtomicLong n = new AtomicLong();
//Starting from here none of the Gauge metrics shows up in /actuator/prometheus path instead it goes to /sample-data/prometheus on port 8081 as configured.
registry.gauge("car.gauge.one", Tags.of("k", "v"), n);
registry.gauge("car.gauge.two", Tags.of("k", "v1"), n, n2 -> n2.get() - 1);
registry.gauge("car.help.gauge", 89);
//This thing never works! This gauge metrics never shows up in any URI configured
Gauge.builder("car.gauge.test", cpu)
.description("car.device.cpu")
.tags("customer", "demo")
.register(registry);
return new ResponseEntity("Car is working fine.", HttpStatus.OK);
}
I need all the metrics to show up inside - /actuator/prometheus instead of a new HTTP Server getting created. I know that I am explicitly creating a new HTTP Server so metrics are popping up there. Please let me know how to avoid creating a new HTTP Server and redirect all the prometheus metrics to the default path - /actuator/prometheus. Also if I use Gauge.builder to define a custom gauge metrics, it never works. Please explain how I can make that work also. Let me know where I am doing wrong.
Thank you.
Every time you call MicrometerConfig.carMonitoringSystem(); it is creating a new prometheus registry (and trying to start a new server)
You need to inject the MeterRegistry in your class that is creating the gauge and use the injected MeterRegistry that way.

How to read Spring Boot application log files into Splunk? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
This post was edited and submitted for review 10 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I am looking to send log data from the application to Splunk. I came to know that there is nothing to do with spring, it's just Splunk needs some configurations to read Application's Logs files. I want to know how we can make Splunk read Applications Log files.
Please help me out with Splunk integration with Spring Boot. It will be great if you provided any code snippets or references.
In terms of integration, what are you after? Are you looking to bring data in from Splunk for use in your Sprint Boot application, or are you looking to send data from your application into Splunk?
For logging into Splunk, I suggest you look at the following:
https://github.com/splunk/splunk-library-javalogging
https://docs.spring.io/autorepo/docs/spring-integration-splunk/0.5.x-SNAPSHOT/reference/htmlsingle/
https://github.com/barrycommins/spring-boot-splunk-sleuth-demo
If you are looking to interact with the Splunk application and run queries against it, look at the Splunk Java SDK, https://dev.splunk.com/enterprise/docs/java/sdk-java/howtousesdkjava/
Here are the steps which I have followed to integrate Splunk successfully into my Spring Boot application:
Set up the repository in the pom.xml file by adding the following:
<repositories>
<repository>
<id>splunk-artifactory</id>
<name>Splunk Releases</name>
<url>https://splunk.jfrog.io/splunk/ext-releases-local</url>
</repository>
</repositories>
Add the maven dependency for Splunk jar, within the dependencies tags, which will download and setup the Splunk jar file in the project (In my case the jar file is splunk-1.6.5.0.jar):
<dependency>
<groupId>com.splunk</groupId>
<artifactId>splunk</artifactId>
<version>1.6.5.0</version>
</dependency>
Configure and run the Splunk query from your controller / service / main class:
package com.my.test;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.splunk.Args;
import com.splunk.HttpService;
import com.splunk.Job;
import com.splunk.SSLSecurityProtocol;
import com.splunk.Service;
#SpringBootApplication
public class Main {
public static String username = "your username";
public static String password = "your password";
public static String host = "your splunk host url like - splunk-xx-test.abc.com";
public static int port = 8089;
public static String scheme = "https";
public static Service getSplunkService() {
HttpService.setSslSecurityProtocol(SSLSecurityProtocol.TLSv1_2);
Map<String, Object> connectionArgs = new HashMap<>();
connectionArgs.put("host", host);
connectionArgs.put("port", port);
connectionArgs.put("scheme", scheme);
connectionArgs.put("username", username);
connectionArgs.put("password", password);
Service splunkService = Service.connect(connectionArgs);
return splunkService;
}
/* Take the Splunk query as the argument and return the results as a JSON
string */
public static String getQueryResultsIntoJsonString(String query) throws IOException {
Service splunkService = getSplunkService();
Args queryArgs = new Args();
//set "from" time of query. 1 = from beginning
queryArgs.put("earliest_time", "1");
//set "to" time of query. now = till now
queryArgs.put("latest_time", "now");
Job job = splunkService.getJobs().create(query);
while(!job.isDone()) {
try {
Thread.sleep(500);
} catch(InterruptedException ex) {
ex.printStackTrace();
}
}
Args outputArgs = new Args();
//set format of result set as json
outputArgs.put("output_mode", "json");
//set offset of result set (how many records to skip from the beginning)
//Default is 0
outputArgs.put("offset", 0);
//set no. of records to get in the result set.
//Default is 100
//If you put 0 here then it would be set to "no limit"
//(i.e. get all records, don't truncate anything in the result set)
outputArgs.put("count", 0);
InputStream inputStream = job.getResults(outputArgs);
//Now read the InputStream of the result set line by line
//And return the final result into a JSON string
//I am using Jackson for JSON processing here,
//which is the default in Spring boot
BufferedReader in = new BufferedReader(new InputStreamReader(inputStream));
String resultString = null;
String aLine = null;
while((aLine = in.readLine()) != null) {
//Convert the line from String to JsonNode
ObjectMapper mapper = new ObjectMapper();
JsonNode jsonNode = mapper.readTree(aLine);
//Get the JsonNode with key "results"
JsonNode resultNode = jsonNode.get("results");
//Check if the resultNode is array
if (resultNode.isArray()) {
resultString = resultNode.toString();
}
}
return resultString;
}
/*Now run your Splunk query from the main method (or a RestController or a Service class)*/
public static void main(String[] args) {
try {
getQueryResultsIntoJsonString("search index=..."); //your Splunk query
} catch (IOException e) {
e.printStackTrace();
}
}
}

How to refresh the key and value in cache after they are expired in Guava (Spring)

So, I was looking at caching methods in Java (Spring). And Guava looked like it would solve the purpose.
This is the usecase -
I query for some data from a remote service. Kind of configuration field for my application. This field will be used by every inbound request to my application. And it would be expensive to call the remote service everytime as it's kind of constant which changes periodically.
So, on the first request inbound to my application, when I call remote service, I would cache the value. I set an expiry time of this cache as 30 mins. After 30 mins when the cache is expired and there is a request to retrieve the key, I would like a callback or something to do the operation of calling the remote service and setting the cache and return the value for that key.
How can I do it in Guava cache?
Here i give a example how to use guava cache. If you want to handle removal listener then need to call cleanUp. Here i run a thread which one call clean up every 30 minutes.
import com.google.common.cache.*;
import org.springframework.stereotype.Component;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
#Component
public class Cache {
public static LoadingCache<String, String> REQUIRED_CACHE;
public Cache(){
RemovalListener<String,String> REMOVAL_LISTENER = new RemovalListener<String, String>() {
#Override
public void onRemoval(RemovalNotification<String, String> notification) {
if(notification.getCause() == RemovalCause.EXPIRED){
//do as per your requirement
}
}
};
CacheLoader<String,String> LOADER = new CacheLoader<String, String>() {
#Override
public String load(String key) throws Exception {
return null; // return as per your requirement. if key value is not found
}
};
REQUIRED_CACHE = CacheBuilder.newBuilder().maximumSize(100000000)
.expireAfterWrite(30, TimeUnit.MINUTES)
.removalListener(REMOVAL_LISTENER)
.build(LOADER);
Executors.newSingleThreadExecutor().submit(()->{
while (true) {
REQUIRED_CACHE.cleanUp(); // need to call clean up for removal listener
TimeUnit.MINUTES.sleep(30L);
}
});
}
}
put & get data:
Cache.REQUIRED_CACHE.get("key");
Cache.REQUIRED_CACHE.put("key","value");

Long-running AEM EventListener working inconsistently - blacklisted?

As always, AEM has brought new challenges to my life. This time, I'm experiencing an issue where an EventListener that listens for ReplicationEvents is working sometimes, and normally just the first few times after the service is restarted. After that, it stops running entirely.
The first line of the listener is a log line. If it was running, it would be clear. Here's a simplified example of the listener:
#Component(immediate = true, metatype = false)
#Service(value = EventHandler.class)
#Property(
name="event.topics", value = ReplicationEvent.EVENT_TOPIC
)
public class MyActivityReplicationListener implements EventHandler {
#Reference
private SlingRepository repository;
#Reference
private OnboardingInterface onboardingService;
#Reference
private QueryInterface queryInterface;
private Logger log = LoggerFactory.getLogger(this.getClass());
private Session session;
#Override
public void handleEvent(Event ev) {
log.info(String.format("Starting %s", this.getClass()));
// Business logic
log.info(String.format("Finished %s", this.getClass()));
}
}
Now before you panic that I haven't included the business logic, see my answer below. The main point of interest is that the business logic could take a few seconds.
While crawling through the second page of Google search to find an answer, I came across this article. A German article explaining that EventListeners that take more than 5 seconds to finish are sort of silently quarantined by AEM with no output.
It just so happens that this task might take longer than 5 seconds, as it's working off data that was originally quite small, but has grown (and this is in line with other symptoms).
I put a change in that makes the listener much more like the one in that article - that is, it uses an EventConsumer to asynchronously process the ReplicationEvent using a pub/sub model. Here's a simplified version of the new model (for AEM 6.3):
#Component(immediate = true, property = {
EventConstants.EVENT_TOPIC + "=" + ReplicationEvent.EVENT_TOPIC,
JobConsumer.PROPERTY_TOPICS + "=" + AsyncReplicationListener.JOB_TOPIC
})
public class AsyncReplicationListener implements EventHandler, JobConsumer {
private static final String PROPERTY_EVENT = "event";
static final String JOB_TOPIC = ReplicationEvent.EVENT_TOPIC;
#Reference
private JobManager jobManager;
#Override
public JobConsumer.JobResult process (Job job) {
try {
ReplicationEvent event = (ReplicationEvent)job.getProperty(PROPERTY_EVENT);
// Slow business logic (>5 seconds)
} catch (Exception e) {
return JobResult.FAILED;
}
return JobResult.OK ;
}
#Override
public void handleEvent(Event event) {
final Map <String, Object> payload = new HashMap<>();
payload.put(PROPERTY_EVENT, ReplicationEvent.fromEvent(event));
final Job addJobResult = jobManager.addJob(JOB_TOPIC , payload);
}
}
You can see here that the EventListener passes off the ReplicationEvent wrapped up in a Job, which is then handled by the JobConsumer, which according to this magic article, is not subject to the 5 second rule.
Here is some official documentation on this time limit. Once I had the "5 seconds" key, I was able to a bit more information, here and here, that talk about the 5 second limit as well. The first article uses a similar method to the above, and the second article shows a way to turn off these time limits.
The time limits can be disabled entirely (or increased) in the configMgr by setting the Timeout property to zero in the Apache Felix Event Admin Implementation configuration.

Wicket cluster session store, page store, data store

I am dealing with a custom implementation for wicket session store, data store, page store. I have cu cluster wicket and make it work in the following situation:
There are 2 nodes in the cluster, node one fails and the user should be able to continue the flow without noticing, the pages a statefull, with a lot of ajax requests. For now I'm storing the wicket session in a custom storage over rmi, and I'm trying to extend the DiskPageStore. The new challenge is SessionEntry inner class, it is still hold by a ConcurrentMap.
My question is: Has anyone done this before? Do you have any suggestions on how to accomplish this?
My suggestion is forget about DiskPageStore and SessionEntry in your situation. The ConcurrentMap you mentioned is held in the heap locally. Once one of the nodes fails, there is no way to get access to its ConcurrentMap, and Wicket resources referred to from the ConcurrentMap will be impossible to be released.
Therefore, in a clustered environment, you need to cluster the Wicket page store. Page versions can be expired based on certain policy, or deliberately removed when their corresponding session expires.
I've enabled web session and data store clustering for Apache Wicket used in an enterprise web application in production, and it has been working very well. The software I use are:
JDK 1.8.0_60
Apache Tomcat 8.0.33 (Tomcat 7 works too)
Wicket 6.16 (versions 6.22.0 and 7.2.0 should also work)
Apache Ignite 1.7.0
Load balancer: Crossroads
Ubuntu 14.04.1
The idea is use Apache Ignite for web session clustering, and it is pretty straightforward following its instructions for Web Session Clustering.
Once I got the web session clustered, I then put the data store (which includes the page store already) into the Ignite distributed data grid, while at the same time I disabled the Wicket application scoped cache (so as to make sure all data is clustered). Take a look at the documentation on Wicket's page store to find out how to configure the data store.
Alternatively you should be able to use Wicket HttpSessionDataStore to put the data store into the session. As the session is clustered, the data store is clustered automatically. But this approach does not work compatibly with Apache Ignite for me. So I use my own implementation of the IDataStore interface, which puts the data store into the Ignite distributed data grid. See below the implementation.
import java.util.concurrent.TimeUnit;
import javax.cache.expiry.Duration;
import javax.cache.expiry.TouchedExpiryPolicy;
import org.apache.ignite.Ignite;
import org.apache.ignite.IgniteCache;
import org.apache.ignite.Ignition;
import org.apache.ignite.cache.CacheMemoryMode;
import org.apache.ignite.cache.CacheMode;
import org.apache.ignite.cache.eviction.lru.LruEvictionPolicy;
import org.apache.ignite.configuration.CacheConfiguration;
import org.apache.wicket.pageStore.IDataStore;
import org.apache.wicket.pageStore.memory.IDataStoreEvictionStrategy;
import org.apache.wicket.pageStore.memory.PageTable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class IgniteDataStore implements IDataStore {
private static final Logger log = LoggerFactory.getLogger(IgniteDataStore.class);
private final IDataStoreEvictionStrategy evictionStrategy;
private Ignite ignite;
IgniteCache<String, PageTable> igniteCache;
public IgniteDataStore(IDataStoreEvictionStrategy evictionStrategy) {
this.evictionStrategy = evictionStrategy;
CacheConfiguration<String, PageTable> cacheCfg = new CacheConfiguration<String, PageTable>("wicket-data-store");
cacheCfg.setCacheMode(CacheMode.PARTITIONED);
cacheCfg.setBackups(1);
cacheCfg.setMemoryMode(CacheMemoryMode.OFFHEAP_VALUES);
cacheCfg.setOffHeapMaxMemory(2 * 1024L * 1024L * 1024L); // 2 Gigabytes.
cacheCfg.setEvictionPolicy(new LruEvictionPolicy<String, PageTable>(10000));
cacheCfg.setExpiryPolicyFactory(TouchedExpiryPolicy.factoryOf(new Duration(TimeUnit.SECONDS, 14400)));
log.info("IgniteDataStore timeout is set to 14400 seconds.");
ignite = Ignition.ignite();
igniteCache = ignite.getOrCreateCache(cacheCfg);
}
#Override
public synchronized byte[] getData(String sessionId, int id) {
PageTable pageTable = getPageTable(sessionId, false);
byte[] pageAsBytes = null;
if (pageTable != null) {
pageAsBytes = pageTable.getPage(id);
}
return pageAsBytes;
}
#Override
public synchronized void removeData(String sessionId, int id) {
PageTable pageTable = getPageTable(sessionId, false);
if (pageTable != null) {
pageTable.removePage(id);
}
}
#Override
public synchronized void removeData(String sessionId) {
PageTable pageTable = getPageTable(sessionId, false);
if (pageTable != null) {
pageTable.clear();
}
igniteCache.remove(sessionId);
}
#Override
public synchronized void storeData(String sessionId, int id, byte[] data) {
PageTable pageTable = getPageTable(sessionId, true);
if (pageTable != null) {
pageTable.storePage(id, data);
evictionStrategy.evict(pageTable);
igniteCache.put(sessionId, pageTable);
} else {
log.error("Cannot store the data for page with id '{}' in session with id '{}'", id, sessionId);
}
}
#Override
public synchronized void destroy() {
igniteCache.clear();
}
#Override
public boolean isReplicated() {
return true;
}
#Override
public boolean canBeAsynchronous() {
return false;
}
private PageTable getPageTable(String sessionId, boolean create) {
if (igniteCache.containsKey(sessionId)) {
return igniteCache.get(sessionId);
}
if (!create) {
return null;
}
PageTable pageTable = new PageTable();
igniteCache.put(sessionId, pageTable);
return pageTable;
}
}
Hope it helps.

Resources