DynamoDB Stream - Lambda to process formula - aws-lambda

I've got a DynamoDB Table that contains attributes similar to:
{
"pk": "pk1",
"values": {
"v2": 5,
"v1": 90
},
"formula": "(v1 + v2) / 100",
"calc": 5.56
}
I've a Lambda that is triggered by DDB Stream. Is there any way to calculate the "calc" attribute based on the formula and values? Ideally I'd like to do it during update_item call which is updating this table every time Stream sends a message.

Your lambda function can trigger an event like this
def lambda_handler(event, context):
records = event['Records']
for record in records:
new_record = record['dynamodb']['NewImage']
calc = new_record.get('calc')
# do your stuff here
calc = some_functions()
return event

Related

Event Hub Throttling with the error: request was terminated because the entity is being throttled. Error code : 50002. Sub error : 102

I am using Databricks Labs Data Generator to send synthetic data to Event Hub.
Everything appears to be working fine for a about two minutes but then the streaming stops and provides the following error:
The request was terminated because the entity is being throttled. Error code : 50002. Sub error : 102.
Can someone let me know how to adjust the throttling.
The code I'm using to send data to Event Hub is as follows:
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
#.withColumn("body",StringType(), False)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build(withStreaming=True, options={'rowsPerSecond': 100})
streamingDelays = (
df_flight_data
.groupBy(
#df_flight_data.body,
df_flight_data.flightNumber,
df_flight_data.airline,
df_flight_data.original_departure,
df_flight_data.delay_minutes,
df_flight_data.delayed_departure,
df_flight_data.reason,
window(df_flight_data.original_departure, "1 hour")
)
.count()
)
writeConnectionString = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
checkpointLocation = "///checkpoint"
# ehWriteConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# ehWriteConf = {
# 'eventhubs.connectionString' : writeConnectionString
# }
ehWriteConf = {
'eventhubs.connectionString' : writeConnectionString
}
Write body data from a DataFrame to EventHubs. Events are distributed across partitions using round-robin model.
ds = streamingDelays \
.select(F.to_json(F.struct("*")).alias("body")) \
.writeStream.format("eventhubs") \
.options(**ehWriteConf) \
.outputMode("complete") \
.option("checkpointLocation", "...") \
.start()
I forgot to mention that I have 1 TU
This is due to usual traffic throttling from Event Hubs, take a look at the limits for 1 TU https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas, you can increase the number of TUs to 2 and then go from there.
If you think this is unexpected throttling then open a support ticket for the issue.

Google Sheets API inserting records twice (duplicated records)

I'm currently working on a Ruby script application using Google::Apis::Sheets. I'm encountering an issue that I'm trying to figure out where is being generated from. Basically when data is append into the Google sheet the data is being inserted twice. So for example, if 50 records are pass to be appended into the Google Sheet, 100 records are created.
# data should be an array of arrays
class Sheets
#some code here
def append(data)
# Initialize the API
service = Google::Apis::SheetsV4::SheetsService.new
service.client_options.application_name = APPLICATION_NAME
service.authorization = authorize
# Prints the names and majors of students in a sample spreadsheet:
# https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/edit
spreadsheet_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
range = "Sheet1!A:C"
request_body = Google::Apis::SheetsV4::ValueRange.new
request_body.values = data
service.append_spreadsheet_value(spreadsheet_id, range, request_body, value_input_option: 'USER_ENTERED')
end
end
The above method append_spreadsheet_value appends the values into the spreadsheet. I'm trying to figure out if the error is coming from the range or request_body when data is passed into the request_body.values.
The method append is being called from another file reporting.rb which is hosted on AWS Lambda. The method has the following script.
def self.process(event:, context:, db: MySQL.new)
FileUtils.cp('./token.yaml', '/tmp/token.yaml')
last_day = db.get_last_day()
sheets = Sheets.new
data = []
last_day.each do |row|
data.push([row["created_at"].strftime("%Y-%m-%d"), row["has_email"], row["type"]])
end
sheets.append(data)
api_gateway_resp(statusCode: 204)
end
Basically in the method last_day I'm retrieving some records from a DB via a MySQL2 client. I then iterate over last_day and I add each rwo into data. So basically data is an array of arrays holding the records in the following format.
data = [
[2020-08-06, 1, QUARANTINE],
[2020-08-06, 1, QUARANTINE],
[2020-08-06, 1, POSITIVE],
[2020-08-06, 1, POSITIVE],
[2020-08-06, 1, POSITIVE],
[2020-08-06, 1, QUARANTINE],
[2020-08-06, 0, POSITIVE],
[2020-08-06, 1, QUARANTINE]
]
So if data has 10 records when sheets.append(data) data is append to the Google sheep 20 records are created.
Since my other post got nuked, maybe try replacing
request_body = Google::Apis::SheetsV4::ValueRange.new
with
request_body={
"range":range,
"majorDimension":"ROWS",
"values":values
}
The issue was due to the configuration of the Lambda function trigger being set to run every 10 hours.

Elasticsearch: scroll between specified time frame

I have some data in elasticsearch. as shown in the image
I used below link example to do the scrolling
https://gist.github.com/drorata/146ce50807d16fd4a6aa
page = es.search(
index = INDEX_NAME,
scroll = '1m',
size = 1000,
body={"query": {"match_all": {}}})
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
print( "Scrolling...")
while (scroll_size > 0):
print("Page: ",count)
page = es.scroll(scroll_id = sid, scroll = '10m')
# Update the scroll ID
sid = page['_scroll_id']
for hit in page['hits']['hits']:
#some code processing here
Currently my requirement is that i want to scroll but want to specify the start timestamp and end timestamp
Need help as to how to do this using scroll.
Simply replace
body={"query": {"match_all": {}}})
by
body={"query": {"range": {"timestamp":{"gte":"2018-08-05T05:30:00Z", "glte":"2018-08-06T05:30:00Z"}}}})
example code. time range should be in es query. Also You should process the first query result.
es_query_dict = {"query": {"range": {"timestamp":{
"gte":"2018-08-00T00:00:00Z", "lte":"2018-08-17T00:00:00Z"}}}}
def get_es_logs():
es_client = Elasticsearch([source_es_ip], port=9200, timeout=300)
total_docs = 0
page = es_client.search(scroll=scroll_time,
size=scroll_size,
body=json.dumps(es_query_dict))
while True:
sid = page['_scroll_id']
details = page["hits"]["hits"]
doc_count = len(details)
if len(details) > 0:
total_docs += doc_count
print("scroll size: " + str(doc_count))
print("start bulk index docs")
# index_bulk(details)
print("end success")
else:
break
page = es_client.scroll(scroll_id=sid, scroll=scroll_time)
print("total docs: " + str(total_docs))
Also have a look at elasticsearch.helpers.scan where you already have the loop logic implemented for you, just pass it query={"query": {"range": {"timestamp": {"gt": ..., "lt": ...}}}}

how to get an array response of jdbc request in jmeter?

Example I have my JDBC Request and the response is like:
X Y Z
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
. . .
. . .
. . .
How can I get all the value of x, y and z?
then I have HTTP request and I'm going to assert if all the response is matched to the data selected from JDBC.
example response:
{
{
"x":"a1",
"y":"b1",
"z": "c1"
},
{
"x":"a2",
"y":"b2",
"z": "c2"
},
{
"x":"a3",
"y":"b3",
"z": "c4"
},
{
"x":"a4",
"y":"b4",
"z": "c4"
},
{
"x":"a5",
"y":"b5",
"z": "c5"
},
{
"x":"a6",
"y":"b6",
"z": "c6"
},
{
"x":"a7",
"y":"b7",
"z": "c7"
},
{
"x":"a8",
"y":"b8",
"z": "c8"
},
.
.
.
.
}
As per JDBC Request sampler documentation:
If the Variable Names list is provided, then for each row returned by a Select statement, the variables are set up with the value of the corresponding column (if a variable name is provided), and the count of rows is also set up. For example, if the Select statement returns 2 rows of 3 columns, and the variable list is A,,C, then the following variables will be set up:
A_#=2 (number of rows)
A_1=column 1, row 1
A_2=column 1, row 2
C_#=2 (number of rows)
C_1=column 3, row 1
C_2=column 3, row 2
So given you provide "Variable Names" as X,Y,Z you should be able to access the values as ${X_1}, ${Y_2}, etc.
See Debugging JDBC Sampler Results in JMeter for more detailed information on working with JDBC Test Elements results and result sets.
You should declare the "Variable Names" field and also declare a result variable name as shown below.
Then you can access them using the _1 _2 method. Please find below the sample code that you can use in the beanshell post processor.
import java.util.ArrayList;
import net.minidev.json.parser.JSONParser;
import net.minidev.json.JSONObject;
import net.minidev.json.JSONArray;
ArrayList items = vars.getObject("result1");
for ( int i = items.size() - 1; i >= 0; i--) {
JSONObject jsonitemElement = new JSONObject();
jsonitemElement.put("x", vars.get("x_" + (i + 1)));
jsonitemElement.put("y", vars.get("y_" + (i + 1)));
jsonitemElement.put("z", vars.get("z_" + (i + 1)));
log.info(jsonitemElement.toString());
}
Since you are getting these values as the response from the response payload of the HTTP request, you should add the code to parse that JSON response in an assertion or post processor and compare it with the elements from the above sample code.
A point to note - Different applications send the target JSON in any order. So, there is no guarantee that the HTTP response will always send the response as A1,B1,C1 - A2,B2,C2 etc. It can send them in any order starting with A5,B5,C5 etc. It is better to then use a hashmap or write your array comparison to ensure that your result set completely matches the HTTp response.

Querying a parameter that’s not an index on DynamoDb

TableName : people
id | name | age | location
id_1 | A | 23 | New Zealand
id_2 | B | 12 | India
id_3 | C | 26 | Singapore
id_4 | D | 30 | Turkey
keys: id -> hash and age->range
Question 1
I’m trying to execute a query: “Select * from people where age > 25”
I can get it to work queries like “Select age from people where id = id_1 and age > 25” which is not what I need, just need to select all values.
And if I don’t need age to be a range index, how should i modify my query params to just return the list of records matching the criterion: age > 25?
Question 2
AWS throws an error when either Lines 23 or 24-41 are commented.
: Query Error: ValidationException: Either the KeyConditions or KeyConditionExpression parameter must be specified in the request.
status code: 400, request id: []
Is the KeyConditions/KeyConditionsExpressions parameter required? Does it mean that I cannot query the table on a parameter that's not a part of the index?
func queryDynamo() {
log.Println("Enter queryDynamo")
svc := dynamodb.New(nil)
params := &dynamodb.QueryInput{
TableName: aws.String("people"), // Required
Limit: aws.Long(3),
// IndexName: aws.String("localSecondaryIndex"),
ExpressionAttributeValues: map[string]*dynamodb.AttributeValue{
":v_age": { // Required
N: aws.String("25"),
},
":v_ID": {
S: aws.String("NULL"),
},
},
FilterExpression: aws.String("age >= :v_age"),
// KeyConditionExpression: aws.String("id = :v_ID and age >= :v_age"),
KeyConditions: map[string]*dynamodb.Condition{
"age": { // Required
ComparisonOperator: aws.String("GT"), // Required
AttributeValueList: []*dynamodb.AttributeValue{
{ // Required
N: aws.String("25"),
},
// More values...
},
},
"id": { // Required
ComparisonOperator: aws.String("EQ"), // Required
// AttributeValueList: []*dynamodb.AttributeValue{
// S: aws.String("NOT_NULL"),
// },
},
// More values...
},
Select: aws.String("ALL_ATTRIBUTES"),
ScanIndexForward: aws.Boolean(true),
}
//Get the response and print it out.
resp, err := svc.Query(params)
if err != nil {
log.Println("Query Error: ", err.Error())
}
// Pretty-print the response data.
log.Println(awsutil.StringValue(resp))
}
DynamoDB is a NoSQL based system so you will not be able to retrieve all of the records based on a condition on a non-indexed field without doing a table scan.
A table scan will cause DynamoDB to go through every single record in the table, which for a big table will be very expensive in either time (it is slow) or money (provisioned read IOPS).
Using a filter is the correct approach and will allow the operation to complete if you switch from a query to a scan. A query must always specify the hash key.
A word of warning though: if you plan on using a scan operation on a table of more than just a few (less than 100) items that is exposed in a front end you will be disappointed with the results. If this is some type of cron job or backend reporting task where response time doesn't matter this is an acceptable approach, but be careful not to exhaust all of your IOPS and impact front end applications.

Resources