I have built a Kinesis Firehose stream to push data into redshift and am trying to push data from an EC2 instance using kinesis agent.
Firehose Stream is able to parse the records but not identify the firehose streams am getting the following java error.
com.amazon.kinesis.streaming.agent.tailing.AsyncPublisher
[ERROR] AsyncPublisher[kinesis:csvtoredshiftstream:/tmp/redshift.log*]:RecordBuffer
(id=2,records=2,bytes=45) Retriable send error (com.amazonaws.services.kinesis.model.ResourceNotFoundException:
Stream csvtoredshiftstream under account xyz not found.
(Service: AmazonKinesis; Status Code: 400;
Error Code: ResourceNotFoundException;
Request ID: f4a63623-9a15-b2f8-a597-13b478c81bbc)). Will retry.
Request your pointers to identify and resolve the issue.
Regards,
Srivignesh KN
Thank you #peter,
I was able to overcome the error using by specifying the inputs in the agent.json in the following manner for firehose events.
{ "cloudwatch.emitMetrics": true, "kinesis.endpoint": "", "firehose.endpoint": "firehose.us-west-2.amazonaws.com", "flows": [ { "filePattern": "/tmp/s3streaming.", "deliveryStream": "S3TestingStream", "partitionKeyOption": "RANDOM" }, { "filePattern": "/tmp/app.log", "deliveryStream": "yourdeliverystream" } ] } –
Moreover for Kinesis Streams to work as expected, the S3 bucket also needs to be created in the same region as that of the Streams.
If the Stream is created in West-2 region, the S3 bucket should also be created in the same region.
Thanks & Regards,
Srivignesh KN
Related
I have multiple queries running on the same spark structured streaming session.
The queries are writing parquet records to Google Bucket and checkpoint to Google Bucket.
val query1 = df1
.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir1)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query1").start()
val query2 = df2.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir2)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query2").start()
Problem: Sometimes job fails with ava.lang.IllegalStateException: Race while writing batch 4
Logs:
Caused by: java.lang.IllegalStateException: Race while writing batch 4
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:67)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 20 more
20/07/24 19:40:15 INFO SparkContext: Invoking stop() from shutdown hook
This error is because there are two writers writing to the output path. The file streaming sink doesn't support multiple writers. It assumes there is only one writer writing to the path. Each query needs to use its own output directory.
Hence, in order to fix this, you can make each query use its own output directory. When reading back the data, you can load each output directory and union them.
You can also use a streaming sink that supports multiple concurrent writers, such as the Delta Lake library. It's also supported by Google Cloud: https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc . This link has instructions about how to use Delta Lake on Google Cloud. It doesn't mention the streaming case, but what you need to do is changing format("parquet") to format("delta") in your codes.
On my AWS Lambda dashboard, I see a spike in failed invocations. I want to investigate these errors by looking at the logs for these invocations. Currently, the only thing I can do to filter these invocations, is get the timeline of the failed invocations, and then look through logs.
Is there a way I can search for failed invocations, i.e. ones that did not return a 200, and get a request ID that I can then lookup in CloudWatch Logs?
You may use AWS X-Ray for this by enabling in AWS Lambda dashboard.
In X-Ray dashboard;
you may view traces
filter them by status code
see all the details of the invocation including request id, total execution time etc such as
{
"Document": {
"id": "ept5e8c459d8d017fab",
"name": "zucker",
"start_time": 1595364779.526,
"trace_id": "1-some-trace-id-fa543548b17a44aeb2e62171",
"end_time": 1595364780.079,
"http": {
"response": {
"status": 200
}
},
"aws": {
"request_id": "abcdefg-69b5-hijkl-95cc-170e91c66110"
},
"origin": "AWS::Lambda",
"resource_arn": "arn:aws:lambda:eu-west-1:12345678:function:major-tom"
},
"Id": "52dc189d8d017fab"
}
What I understand from your question is you are more interested in finding out why your lambda invocation has failed rather than finding the request-id for failed lambda invocation.
You can do this by following the steps below:
Go to your lambda function in the AWS console.
There will be three tabs named as Configuration, Permissions, and Monitoring
Click on the Monitoring Tab. Here you can see the number of invocation, Error count and success rate, and other metrics as well. Click on the Error metrics. You will see that at what time the error in invocation has happened. You can read more at this Lambda function metrics
If you already know the time at which your function has failed then ignore Step 3.
Now scroll down. You will find the section termed as CloudWatch Logs Insights. Here you will see logs for all the invocation that has happened within the specified time range.
Adjust your time range under this section. You can choose a predefined time range like 1h, 3h, 1d, etc, or your custom time range.
Now Click on the Log stream link after the above filter has been applied. It will take you to cloudwatch console and you can see the logs here.
How to check in s3-outbound-gateway if the bucket is available in S3 storage before processing using bucket expression. If the bucket is not available, it should be redirected to errorchannel.
<int-aws:s3-outbound-gateway id="FileGenerationChannelS3"
request-channel="filesOutS3ChainChannel"
reply-channel="filesArchiveChannel"
transfer-manager="transferManager"
bucket-expression="headers.TARGET_BUCKET"
command="UPLOAD">
The <int-aws:s3-outbound-gateway> is a typical MessageHandler-based event-driven consumer. There can be applied an ExpressionEvaluatingRequestHandlerAdvice with a desired failureChannel in the <int-aws:request-handler-advice-chain>.
See Reference Manual for more info.
I am trying to invoke a simple lambda function (the lambda function prints hello world to console) using ruby . However when I run the code and look at the swf dashboard . I see the following error :
Reason: An Activity cannot send a response with data larger than 32768 characters. Please limit the size of the response. You can look at the Activity Worker logs to see the original response.
Could someone help me out to resolve this issue?
the code is as follows:
require 'aws/decider'
require 'aws-sdk'
class U_Act
extend AWS::Flow::Activities
activity :b_u do
{
version: "1.0"
}
end
def b_u(c_id)
lambda=Aws::Lambda::Client.new(
region: “xxxxxx”
access_key_id: “XxXXXXXXXXX”,
secret_access_key: “XXXXXXXXXX”
)
resp = lambda.invoke(
function_name: “s_u_1” # required
)
print "#{resp}"
end
Thanks
According to AWS documentation you cannot send input / result data set size larger than 32,000 characters. This limit affects activity or workflow execution result data, input data when scheduling activity tasks or workflow executions, and input sent with a workflow execution signal.
Workaround to resolve this issue are
Use AWS S3 to upload the message and send the path of the S3 message between the activities.
If you need high performance use Elasticache and store the values and pass the keys between the activities.
I plan to store images on Amazon S3 how to retrieve from Amazon S3 :
file size
image height
image width ?
You can store image dimensions in user-defined metadata when uploading your images and later read this data using REST API.
Refer to this page for more information about user-defined metadata: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
Getting the file size is possible by reading the Content-Length response header to a simple HEAD request for your file. Maybe your client can help you with this query. More info on the S3 API docs.
Amazon S3 just provides you with storage, (almost) nothing more. Image dimensions are not accessible through the API. You have to get the whole file, and calculate its dimensions yourself. I'd advise you to store this information in the database when uploading the files to S3, if applicable.
On Node, it can be really easy using image-size coupled with node-fetch.
async function getSize(imageUrl) {
const response = await fetch(imageUrl);
const buffer = await response.buffer();
return imageSize(buffer);
}