How to import from a bucket in the S3 eu-west-1 region? - cockroachdb

I try to bulk import into a cockroach db from a S3 bucket in the eu-west-1 region
IMPORT TABLE osm.nodes (
id INT PRIMARY KEY,
version INT NOT NULL,
lat DECIMAL NOT NULL,
lon DECIMAL NOT NULL,
changeset_id INT NOT NULL,
visible BOOLEAN NOT NULL
)
CSV DATA ('s3://cockroach-import/nodes.csv?AWS_ACCESS_KEY_ID=<snip>&AWS_SECRET_ACCESS_KEY=<snip>')
WITH
temp = 's3://cockroach-import/?AWS_ACCESS_KEY_ID=<snip>&AWS_SECRET_ACCESS_KEY=<snip>',
delimiter = ','
;
I get the error message:
failed to create s3 reader: 400: "The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-west-1'"
How can I specify the S3 region in the import statement?

This is a bug that has been fixed in the upcoming release of Cockroach 1.1.2. https://github.com/cockroachdb/cockroach/issues/19435 describes the problem and fix.

Related

How to generate SAS token using python legacy SDK(2.1) without account_key or connection_string

I am using python3.6 and azure-storage-blob(version = 1.5.0) and trying to use user assigned managed identity to connect to my azure storage blob from an Azure VM.The problem I am facing is I want to generate the SAS token to form a downloadable url.
I am using blob_service = BlockBlobService(account name,token credential) to authenticate. But I am not able to find any methods which let me generate SAS token without supplying the account key.
Also not seeing any way of using the user delegation key as is available in the new azure-storage-blob (versions>=12.0.0). Is there any workaround or I will need to upgrade the azure storage library at the end.
I tried to reproduce in my environment to generate SAS token without account key or connection string got result successfully.
Code:
import datetime as dt
import json
import os
from azure.identity import DefaultAzureCredential
from azure.storage.blob import (
BlobClient,
BlobSasPermissions,
BlobServiceClient,
generate_blob_sas,
)
credential = DefaultAzureCredential(exclude_shared_token_cache_credential=True)
storage_acct_name = "Accountname"
container_name = "containername"
blob_name = "Filename"
url = f"https://<Accountname>.blob.core.windows.net"
blob_service_client = BlobServiceClient(url, credential=credential)
udk = blob_service_client.get_user_delegation_key(
key_start_time=dt.datetime.utcnow() - dt.timedelta(hours=1),
key_expiry_time=dt.datetime.utcnow() + dt.timedelta(hours=1))
sas = generate_blob_sas(
account_name=storage_acct_name,
container_name=container_name,
blob_name=blob_name,
user_delegation_key=udk,
permission=BlobSasPermissions(read=True),
start = dt.datetime.utcnow() - dt.timedelta(minutes=15),
expiry = dt.datetime.utcnow() + dt.timedelta(hours=2),
)
sas_url = (
f'https://{storage_acct_name}.blob.core.windows.net/'
f'{container_name}/{blob_name}?{sas}'
)
print(sas_url)
Output:
Make sure you need to add storage blob data contributor role as below:

Azure - Copy LARGE blobs from one container to other using logic apps

I successfully built logic app where whenever a blob is added in container-one, it gets copied to container-2. However it fails when any blobs larger than 50 MB (default size) is uploaded.
Could you please guide.
Blobs are added via rest api.
Below is the flow,
Currently, the maximum file size with disabled chunking is 50MB. One of the workarounds is to use Azure functions in order to transfer the files from one container to another.
Below is the sample Python Code that worked for me when I'm trying to transfer files from One container to Another
from azure.storage.blob import BlobClient, BlobServiceClient
from azure.storage.blob import ResourceTypes, AccountSasPermissions
from azure.storage.blob import generate_account_sas
from datetime import datetime,timedelta
connection_string = '<Your Connection String>'
account_key = '<Your Account Key>'
source_container_name = 'container1'
blob_name = 'samplepdf.pdf'
destination_container_name = 'container2'
# Create client
client = BlobServiceClient.from_connection_string(connection_string)
# Create sas token for blob
sas_token = generate_account_sas(
account_name = client.account_name,
account_key = account_key,
resource_types = ResourceTypes(object=True),
permission= AccountSasPermissions(read=True),
expiry = datetime.utcnow() + timedelta(hours=4)
)
# Create blob client for source blob
source_blob = BlobClient(
client.url,
container_name = source_container_name,
blob_name = blob_name,
credential = sas_token
)
# Create new blob and start copy operation
new_blob = client.get_blob_client(destination_container_name, blob_name)
new_blob.start_copy_from_url(source_blob.url)
RESULT:
REFERENCES:
General Limits
How to copy a blob from one container to another container using Azure Blob storage SDK

'Integer overflow' in AWS Athena for more than 800M records

I have used Spark EMR to copy tables from Oracle to S3 in parquet format, and then used Glue crawler to crawl the data from S3 and registered in Athena. The data ingestion is fine but when I tried to preview the data it showed this error:
GENERIC_INTERNAL_ERROR: integer overflow
I have tried the pipeline multiple times. The original schema is this:
SAMPLEINDEX(NUMBER38, 0)
GENEINDEX(NUMBER38, 0)
VALUE(FLOAT)
MINSEGMENTLENGTH(NUMBER38, 0)
I tried to cast the data into integer, long and string but the error still persists. I also inspected original dataset and didn't find any value that could cause int overflow.
Tables which contain rows < 800 millions work perfectly fine. But when the table has more than 800 millions rows the error started to come up.
Here is some sample code in scala:
val table = sparkSession.read
.format("jdbc")
.option("url", "jdbc:oracle:thin://#XXX")
.option("dbtable", "tcga.%s".format(tableName))
.option("user", "XXX")
.option("password", "XXX")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("fetchsize", "50000")
.option("numPartitions", "200")
.load()
println("writing tablename: %s".format(tableName))
val finalDF = table.selectExpr("cast(SAMPLEINDEX as string) as SAMPLEINDEX", "cast(GENEINDEX as string) as GENEINDEX",
"cast(VALUE as string) as VALUE", "cast(MINSEGMENTLENGTH as string) as MINSEGMENTLENGTH")
finalDF.repartition(200)
finalDF.printSchema()
finalDF.write.format("parquet").mode("Overwrite").save("s3n://XXX/CNVGENELEVELDATATEST")
finalDF.printSchema()
finalDF.show()
Does anyone know what may cause the issue?

Syncing DynamoDB with ElasticSearch for old Data

I'm using this function https://github.com/bfansports/dynamodb-to-elasticsearch to sync my DynamoDB table with ElasticSearch. Unfortunately it's only processing the newly added data and Updated ones and not the previously existing rows in the table despite i chose "New and old images - both the new and the old images of the item
" in the Manage stream section.
How to fix that ?
Ok, i ended up with updating the DynamoDB and that triggers the Stream so Sync between ElasticSearch and DynamoDB can be done.
This is the script that i use :
import json
import boto3
import random
def lambda_handler(event, context):
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('image-library')
response = table.scan(
ProjectionExpression='#k',
ExpressionAttributeNames={
'#k' : 'id', #partition key
}
)
items = response['Items']
random_number = random.randint(0,1000)
for item in items:
response = table.update_item(
Key=item,
UpdateExpression='SET #f = :f',
ExpressionAttributeNames={
'#f' :'force_update'
},
ExpressionAttributeValues={
':f' : random_number
}
)

How to encrypt Lambda variables Using cloudformation

AWS CloudFormation template that includes a Lambda function with sensitive environment variables. I'd like to set up a KMS key and encrypt them with it
Add basic cloudformation to encrypt the key even is ok with aws/lambda default encryption
LambdaFunction:
Type: AWS::Lambda::Function
DependsOn: LambdaRole
Properties:
Environment:
Variables:
key: AKIAJ6W7WERITYHYUHJGHN
secret: PGDzQ8277Fg6+SbuTyqxfrtbskjnaslkchkY1
dest: !Ref dstBucket
Code:
ZipFile: |
from __future__ import print_function
import os
import json
import boto3
import time
import string
import urllib
print('Loading function')
ACCESS_KEY_ID = os.environ['key']
ACCESS_SECRET_KEY = os.environ['secret']
#s3_bucket = boto3.resource('s3',aws_access_key_id=ACCESS_KEY_ID,aws_secret_access_key=ACCESS_SECRET_KEY)
s3 = boto3.client('s3',aws_access_key_id=ACCESS_KEY_ID,aws_secret_access_key=ACCESS_SECRET_KEY)
#s3 = boto3.client('s3')
def handler(event, context):
source_bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
#key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'])
#target_bucket = "${dstBucket}"
target_bucket = os.environ['dest']
copy_source = {'Bucket':source_bucket, 'Key':key}
try:
s3.copy_object(Bucket=target_bucket, Key=key, CopySource=copy_source)
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist '
'and your bucket is in the same region as this '
'function.'.format(key, source_bucket))
raise e
AWS CloudFormation template that includes a Lambda function with sensitive environment variables. I'd like to set up a KMS key and encrypt them with it
You can store the access key and Secret key in AWS SSM Parameter Store by encrypting it with KMS Key. Go to AWS Systems Manager -> Parameter store -> Create Parameter. You can choose secure string option and choose the KMS key to encrypt with. You can access that Parameter through the boto3 function call. For example, response = client.get_parameter(Name='AccessKey', WithDecryption=True). you can use 'response' variable to refer to access key. Make sure that Lambda function has enough permissions to use that KMS Key to decrypt that Parameter you stored. Attach all necessary Decrypt permissions to the IAM Role the Lambda uses. In this way, you don't need to pass your access key and secret key as environment variables. Hope this will help!
You can use the AWS KMS service to create a KMS key manually (or)
by using CFT (https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-kms-key.html)
The return value will have an ARN which can be used for KmsKeyArn property in Lambda CFT
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html#cfn-lambda-function-kmskeyarn
Hope this helps !!
You can also use Secrets Manager AWS::SecretsManager::Secret CFN resource to store the secret values and Cloudformation.
Use Cloudformation dynamic references to retrieve the secret's values from either SSM Paramenter store or Secrets Manager, in the template where you consume them.

Resources