Aws lambda java - Implement simple cache to read a file - caching

I have a lambda process in java and it reads a json file with a table everytime is triggered. I'd like to implement a kind of cache to have that file in memory and I wonder how to do something simple. I don't want to use elasticchache or redis.
I read something similar to my approach in javascript declaring a global variable with let but not sure how to do it in java, where it should be declared and how to test it. Any idea or example you can provide me? Thanks

There are global variables in lambda which can be of help but they have to be used wisely.
They are usually the variables declared out side of lambda_handler.
There are pros and cons of using it.
You can't rely on this behavior but you must be aware it exists. When you call your Lambda function several times, you MIGHT get the same container to optimise run duration and setup delay Use of Global Variables
At the same time you should be aware of the issues or avoid wrong use of it caching issues
If you don't want to use ElastiCache/redis then i guess you have very less options left.......may be dynamoDB or S3 that's all i can think of
again connection to dynamoDB or S3 can be cached here. It won't be as fast as ElastiCache though.

In Java it's not too hard to do. Just create your cache outside of the handler:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.HashMap;
import java.util.Map;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.amazonaws.services.lambda.runtime.RequestStreamHandler;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
public class SampleHandler implements RequestStreamHandler {
private static final Logger logger = LogManager.getLogger(SampleHandler.class);
private static Map<String, String> theCache = null;
public SampleHandler() {
logger.info( "filling cache...");
theCache = new HashMap<>();
theCache.put("key1", "value1");
theCache.put("key2", "value2");
theCache.put("key3", "value3");
theCache.put("key4", "value4");
theCache.put("key5", "value5");
}
public void handleRequest(InputStream inputStream, OutputStream outputStream, Context context) throws IOException {
logger.info("handlingRequest");
LambdaLogger lambdaLogger = context.getLogger();
ObjectMapper objectMapper = new ObjectMapper();
JsonNode jsonNode = objectMapper.readTree(inputStream);
String requestedKey = jsonNode.get("requestedKey").asText();
if( theCache.containsKey( requestedKey )) {
// read from the cache
String result = "{\"requestedValue\": \"" + theCache.get(requestedKey) + "\"}";
outputStream.write(result.getBytes());
}
logger.info("done with run, remaining time in ms is " + context.getRemainingTimeInMillis() );
}
}
(run with the AWS cli with aws lambda invoke --function-name lambda-cache-test --payload '{"requestedKey":"key4"}' out with the output going the the file out)
When this runs with a "cold start" you'll see the "filling cache..." message and then the "handlingRequest" in the CloudWatch log. As long as the Lambda is kept "warm" you will not see the cache message again.
Note that if you had hundreds of the same Lamda's running they would all have their own independent cache. Ultimately this does what you want though - it's a lazy load of the cache during a cold start and the cache is reused for warm calls.

Related

How to distribute workload to many compute and do scatter-gather scenarios with Kafka Steam?

I am new to Kafka Stream and Alpakka Kafka.
Problem: I have been using Java Executor Service to run parallel jobs and when ALL of them are done, marking the entire process done. The issue is fault tolerance, High Availability and Not utilizing all computes to do the work. It is using just ONE HOST JVM to do work.
We have Apache Kafka as infrastructure, so I was wondering how I can use Kafka Stream to do scatter-gather or just execute child task use case implemented to distribute workload and then gather results or get an indication that all tasks are done.
Any pointer to sample work or scatter-gather or Fork join would be great with Kafka Steam or Alpakka Kafka.
Here is a Sample:
import org.springframework.http.MediaType;
import org.springframework.web.reactive.function.client.WebClient;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Main {
private static final ExecutorService executorService = Executors.newFixedThreadPool(15);
public static void main(String[] args) throws Exception {
final WebClient webClient = WebClient.builder().build();
List<CompletableFuture<String>> allTasks = new LinkedList<>();
String urls[] = {"http://test1", "http://test2", "http://test3"};
// Distribute the work ( webcient can do async but I wanted to just give example).
for (final String url : urls) {
CompletableFuture<String> task = CompletableFuture.supplyAsync(() -> {
// SOME Task JUST FOR Example I have put GET call it could be any thing
String response =
webClient.get().uri(url).accept(MediaType.APPLICATION_JSON).retrieve().bodyToMono(String.class).block();
return response;
}, executorService);
allTasks.add(task);
}
// wait for all do be done (Join)
CompletableFuture.allOf(allTasks.toArray(new CompletableFuture[]{})).join();
for(CompletableFuture<String> task: allTasks){
processResponse(task.get());
}
}
public static void processResponse(String response){
System.out.println(response);
}
}

throw not found exception if pubsub topic is not available

I am using spring boot to interact with pubsub topic.
My config class for this connection look like this:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.cloud.gcp.pubsub.core.PubSubTemplate;
import org.springframework.cloud.gcp.pubsub.core.publisher.PubSubPublisherTemplate;
import org.springframework.cloud.gcp.pubsub.support.PublisherFactory;
import org.springframework.cloud.gcp.pubsub.support.converter.SimplePubSubMessageConverter;
import org.springframework.util.Assert;
import org.springframework.util.concurrent.ListenableFuture;
import org.springframework.util.concurrent.SettableListenableFuture;
import com.google.api.core.ApiFuture;
import com.google.api.core.ApiFutureCallback;
import com.google.api.core.ApiFutures;
import com.google.pubsub.v1.PubsubMessage;
public abstract class PubSubPublisher {
private static final Logger LOGGER = LoggerFactory.getLogger(PubSubPublisher.class);
private final PubSubTemplate pubSubTemplate;
protected PubSubPublisher(PubSubTemplate pubSubTemplate) {
this.pubSubTemplate = pubSubTemplate;
}
protected abstract String topic(String topicName);
public ListenableFuture<String> publish(String topicName, String message) {
LOGGER.info("Publishing to topic [{}]. Message: [{}]", topicName, message);
return pubSubTemplate.publish(topicName, message);
}
}
And I am calling this at my service, like this:
publisher.publish(topic-name, payload);
This publish method is async one, which always pass on did not wait for acknowldgrment. I make add get after publish for wait until it get the response from pubsub.
But I wanted to know if in case my topic is not already present and i try to push some message, it should throw some error like resource not found, considering using default async method only.
Might be implementing the callback would help but i am unable to do that in my code. And the current override publish method which use callback is just throwing the WARN not exception i wanted that to be exception. that is the reason i wanted to implement the callback.
You can check if Topic already present
from google.cloud import pubsub_v1
project_id = "projectname"
topic_name = "unknowTopic"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
try:
response = publisher.get_topic(topic_path)
except Exception as e:
print(e)
This returns the error as
404 Resource not found (resource=unknowTopic).

How to pass region as query param for aws java lambda function

I'm new to AWS lambda. I'm creating ec2 instance using AWS java Lambda function in which i'm trying to pass the region dynamically using API gateway.
I'm passing the region as queryparamstring. I'm not sure how to get the queryparam inside the lambda function. I have gone through the questions asked similar to this but unable to understand how to implement that.
Please find the below java lambda function:
package com.amazonaws.lambda.demo;
import java.util.List;
import org.json.simple.JSONObject;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.ec2.AmazonEC2;
import com.amazonaws.services.ec2.AmazonEC2ClientBuilder;
import com.amazonaws.services.ec2.model.CreateTagsRequest;
import com.amazonaws.services.ec2.model.DescribeInstanceStatusRequest;
import com.amazonaws.services.ec2.model.DescribeInstanceStatusResult;
import com.amazonaws.services.ec2.model.Instance;
import com.amazonaws.services.ec2.model.InstanceNetworkInterfaceSpecification;
import com.amazonaws.services.ec2.model.InstanceStatus;
import com.amazonaws.services.ec2.model.RunInstancesRequest;
import com.amazonaws.services.ec2.model.RunInstancesResult;
import com.amazonaws.services.ec2.model.StartInstancesRequest;
import com.amazonaws.services.ec2.model.Tag;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.events.S3Event;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
public class LambdaFunctionHandler implements RequestHandler<S3Event, String> {
private static final AWSCredentials AWS_CREDENTIALS;
static String ACCESS_KEY="XXXXXXXXXX";
static String SECRET_KEY="XXXXXXXXXXXXXXXX";
static {
// Your accesskey and secretkey
AWS_CREDENTIALS = new BasicAWSCredentials(
ACCESS_KEY,
SECRET_KEY
);
}
private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build();
public LambdaFunctionHandler() {}
// Test purpose only.
LambdaFunctionHandler(AmazonS3 s3) {
this.s3 = s3;
}
#Override
public String handleRequest(S3Event event, Context context) {
context.getLogger().log("Received event: " + event);
// Set up the amazon ec2 client
AmazonEC2 ec2Client = AmazonEC2ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(AWS_CREDENTIALS))
.withRegion(Regions.EU_CENTRAL_1)
.build();
// Launch an Amazon EC2 Instance
RunInstancesRequest runInstancesRequest = new RunInstancesRequest().withImageId("ami-XXXX")
.withInstanceType("t2.micro") // https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
.withMinCount(1)
.withMaxCount(1)
.withKeyName("KEY")
.withNetworkInterfaces(new InstanceNetworkInterfaceSpecification()
.withAssociatePublicIpAddress(true)
.withDeviceIndex(0)
.withSubnetId("subnet-XXX")
.withGroups("sg-XXXX"));
RunInstancesResult runInstancesResult = ec2Client.runInstances(runInstancesRequest);
Instance instance = runInstancesResult.getReservation().getInstances().get(0);
String instanceId = instance.getInstanceId();
String instanceip=instance.getPublicIpAddress();
System.out.println("EC2 Instance Id: " + instanceId);
// Setting up the tags for the instance
CreateTagsRequest createTagsRequest = new CreateTagsRequest()
.withResources(instance.getInstanceId())
.withTags(new Tag("Name", "SampleLambdaEc2"));
ec2Client.createTags(createTagsRequest);
// Starting the Instance
StartInstancesRequest startInstancesRequest = new StartInstancesRequest().withInstanceIds(instanceId);
ec2Client.startInstances(startInstancesRequest);
/*// Stopping the Instance
StopInstancesRequest stopInstancesRequest = new StopInstancesRequest()
.withInstanceIds(instanceId);
ec2Client.stopInstances(stopInstancesRequest);*/
//describing the instance
DescribeInstanceStatusRequest describeInstanceRequest = new DescribeInstanceStatusRequest().withInstanceIds(instanceId);
DescribeInstanceStatusResult describeInstanceResult = ec2Client.describeInstanceStatus(describeInstanceRequest);
List<InstanceStatus> state = describeInstanceResult.getInstanceStatuses();
while (state.size() < 1) {
// Do nothing, just wait, have thread sleep if needed
describeInstanceResult = ec2Client.describeInstanceStatus(describeInstanceRequest);
state = describeInstanceResult.getInstanceStatuses();
}
String status = state.get(0).getInstanceState().getName();
System.out.println("status"+status);
JSONObject response=new JSONObject();
response.put("instanceip", instanceip);
response.put("instancestatus", status);
System.out.println("response=>"+response);
return response.toString();
}
}
I would like to pass the query param instead of Regions.EU_CENTRAL_1
// Set up the amazon ec2 client
AmazonEC2 ec2Client = AmazonEC2ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(AWS_CREDENTIALS))
.withRegion(Regions.EU_CENTRAL_1)
.build();
Please find the API configuration below:
Any advise on how to achieve that would be really helpful. Thanks in advance.
How are you?
If you want to get those query parameters, you may want to use Lambda Proxy Integration.
That way, you will get to you function an APIGatewayProxyRequestEvent, that you can perform a Map<String, String> getQueryStringParameters() operation.
You'll need to declare your handler like:
public class APIGatewayHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
/// You cool awesome code here!
}
That way, your method will look like:
#Override
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) {
Map<String, String> params = event.getQueryStringParameters();
Optional<String> region = Optional.ofNullable(params.get("region"));
// Create EC2 instance. You may need to parse that region string to AWS Region object.
}
Let me know if that works for you!

Access Data from REST API in HIVE

Is there a way to create a hive table where the location for that hive table will be a http JSON REST API? I don't want to import the data every time in HDFS.
I had encountered similar situation in a project couple of years ago. This is the sort of low-key way of ingesting data from Restful to HDFS and then you use Hive analytics to implement the business logic.I hope you are familiar with core Java, Map Reduce (if not you might look into Hortonworks Data Flow, HDF which is a product of Hortonworks).
Step 1: Your data ingestion workflow should not be tied to your Hive workflow that contains business logic. This should be executed independently in timely manner based on your requirement (volume & velocity of data flow) and monitored regularly. I am writing this code on a text editor. WARN: It's not compiled or tested!!
The code below is using a Mapper which would take in the url or tweak it to accept the list of urls from the FS. The payload or requested data is stored as text file in the specified job output directory (forget the structure of data this time).
Mapper Class:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class HadoopHttpClientMap extends Mapper<LongWritable, Text, Text, Text> {
private int file = 0;
private String jobOutDir;
private String taskId;
#Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
jobOutDir = context.getOutputValueClass().getName();
taskId = context.getJobID().toString();
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
Path httpDest = new Path(jobOutDir, taskId + "_http_" + (file++));
InputStream is = null;
OutputStream os = null;
URLConnection connection;
try {
connection = new URL(value.toString()).openConnection();
//implement connection timeout logics
//authenticate.. etc
is = connection.getInputStream();
os = FileSystem.getLocal(context.getConfiguration()).create(httpDest,true);
IOUtils.copyBytes(is, os, context.getConfiguration(), true);
} catch(Throwable t){
t.printStackTrace();
}finally {
IOUtils.closeStream(is);
IOUtils.closeStream(os);
}
context.write(value, null);
//context.write(new Text (httpDest.getName()), new Text (os.toString()));
}
}
Mapper Only Job:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class HadoopHttpClientJob {
private static final String data_input_directory = “YOUR_INPUT_DIR”;
private static final String data_output_directory = “YOUR_OUTPUT_DIR”;
public HadoopHttpClientJob() {
}
public static void main(String... args) {
try {
Configuration conf = new Configuration();
Path test_data_in = new Path(data_input_directory, "urls.txt");
Path test_data_out = new Path(data_output_directory);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "HadoopHttpClientMap" + System.currentTimeMillis());
job.setJarByClass(HadoopHttpClientJob.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(test_data_out, true);
job.setMapperClass(HadoopHttpClientMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, test_data_in);
FileOutputFormat.setOutputPath(job, test_data_out);
job.waitForCompletion(true);
}catch (Throwable t){
t.printStackTrace();
}
}
}
Step 2: Create external table in Hive based on the HDFS directory. Remember to use Hive SerDe for the JSON data (in your case) then you can copy the data from external table into managed master tables. This is the step where you implement your incremental logics, compression..
Step 3: Point your hive queries (which you might have already created) to the master table to implement your business needs.
Note: If you are supposedly referring to realtime analysis or streaming api, you might have to change your application's architecture. Since you have asked architectural question, I am using my best educated guess to support you. Please go through this once. If you feel you can implement this in your application then you can ask the specific question, I will try my best to address them.

Route lines from file to persistent JMS queue: How to improve performance?

I need some help with performance tuning of a use case. In this use case the Camel route is tailing status lines in a log file and sends each line as a message to a JMS queue. I have implemented the use case like this:
package tests;
import java.io.File;
import java.net.URI;
import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.activemq.broker.BrokerFactory;
import org.apache.activemq.broker.BrokerService;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.component.sjms.SjmsComponent;
import org.apache.camel.main.Main;
public class LinesToQueue {
public static void main() throws Exception {
final File file = new File("data/log.txt");
final String uri = "tcp://127.0.0.1:61616";
final BrokerService jmsService = BrokerFactory.createBroker(new URI("broker:" + uri));
jmsService.start();
final SjmsComponent jmsComponent = new SjmsComponent();
jmsComponent.setConnectionFactory(new ActiveMQConnectionFactory(uri));
final Main main = new Main();
main.bind("jms", jmsComponent);
main.addRouteBuilder(new RouteBuilder() {
#Override
public void configure() throws Exception {
fromF("stream:file?fileName=%s&scanStream=true&scanStreamDelay=0", file.getAbsolutePath())
.routeId("LinesToQueue")
.to("jms:LogLines?synchronous=false");
}
});
main.enableHangupSupport();
main.run();
}
}
When I run this use case with a file already filled with 1.000.000 lines the overall performance I get in the route is about 313 lines/second. This means that it takes about 55 minutes to process the file.
As some sort of reference I also have created another use case. In this use case the Camel route is tailing status lines in a log file and sends each line as a document to an Elasticsearch index. I have implemented the use case like this:
package tests;
import java.io.File;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.main.Main;
public class LinesToIndex {
public static void main() throws Exception {
final File file = new File("data/log.txt");
final String uri = "local";
final Main main = new Main();
main.addRouteBuilder(new RouteBuilder() {
#Override
public void configure() throws Exception {
fromF("stream:file?fileName=%s&scanStream=true&scanStreamDelay=0", file.getAbsolutePath())
.routeId("LinesToIndex")
.bean(new LineConverter())
.toF("elasticsearch://%s?operation=INDEX&indexName=log&indexType=line", uri);
}
});
main.enableHangupSupport();
main.run();
}
}
When I run this use case with a file already filled with 1.000.000 lines the overall performance I get in the route is about 8333 lines/second. This means that it takes about 2 minutes to process the file.
I understand that there is a huge difference between a JMS queue and an Elasticsearch index but how can have the JMS use case above to perform better?
Update #1:
It seems to be the persistence in the JMS service that is the bottleneck in my first use case above. If I disable the persistence in the JMS service then the performance in the route is about 11111 lines/second. Which persistence storage for the JMS service will give me a better performance?
a couple of things to consider...
ActiveMQ producer connections are expensive, make sure you use a pooled connection factory...
consider using the VM transport for an in process ActiveMQ instance
consider using an external ActiveMQ broker over TCP (so it doesn't compete for resources with your test)
setup/tune KahaDB or LevelDB to optimize persistent storage for your use case

Resources