Apache Spark Streaming from Azure Event Hubs with Databricks Error: Endpoint hostname cannot be null or empty - spark-streaming

I am successfully streaming data to Azure Event Hub, howevever when I attempt to write the the stream using the code below, I get the following error:
Endpoint hostname cannot be null or empty
=== Streaming Query ===
Identifier: [id = 5b2fef8d-886a-411f-b39e-7b2540361cea, runId = 43b637e5-e1ca-47d0-b64e-b4a7af4c84e6]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE
logged message The streaming query failed to execute
Exception: Endpoint hostname cannot be null or empty
=== Streaming Query ===
Identifier: [id = 5b2fef8d-886a-411f-b39e-7b2540361cea, runId = 43b637e5-e1ca-47d0-b64e-b4a7af4c84e6]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE
The code that I'm using to write the streams is as follows:
streamingQuery = (
df
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", f"{thelocation}/_checkpoints")
.start(thelocation)
)
Any thoughts on resolving the issue?
I am passing Endpoint details as follows:
df_flight_data = flightdata_defn.build(withStreaming=True, options={'rowsPerSecond': 1})
streamingDelays = (
df_flight_data
.groupBy(
#df_flight_data.body,
df_flight_data.flightNumber,
df_flight_data.airline,
df_flight_data.original_departure,
df_flight_data.delay_minutes,
df_flight_data.delayed_departure,
df_flight_data.reason,
window(df_flight_data.original_departure, "1 hour")
)
.count()
)
writeConnectionString = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
checkpointLocation = "///checkpoint"
# ehWriteConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# ehWriteConf = {
# 'eventhubs.connectionString' : writeConnectionString
# }
ehWriteConf = {
'eventhubs.connectionString' : writeConnectionString
}
# Write body data from a DataFrame to EventHubs. Events are distributed across partitions using round-robin model.
ds = streamingDelays \
.select(F.to_json(F.struct("*")).alias("body")) \
.writeStream.format("eventhubs") \
.options(**ehWriteConf) \
.outputMode("complete") \
.option("checkpointLocation", "...") \
.start()

Related

Does the AsyncElasticsearch client use the same session for async actions?

Does the AsyncElasticsearch client open a new session for each async request?
AsyncElasticsearch (from elasticsearch-py) uses AIOHTTP. From what I understand, AIOHTTP recommends a using a context manager for the aiohttp.ClientSession object, so as to not generate a new session for each request:
async with aiohttp.ClientSession() as session:
...
I'm trying to speed up my bulk ingests.
How do I know if the AsyncElasticsearch client is using the same session, or setting up multiple?
Do I need the above async with... command in my code snippet below?
# %%------------------------------------------------------------------------------------
# Create async elastic client
async_es = AsyncElasticsearch(
hosts=[os.getenv("ELASTIC_URL")],
verify_certs=False,
http_auth=(os.getenv("ELASTIC_USERNAME"), os.getenv("ELASTIC_PW")),
timeout=60 * 60,
ssl_show_warn=False,
)
# %%------------------------------------------------------------------------------------
# Upload csv to elastic
# Chunk files to keep memory low
with pd.read_csv(file, usecols=["attributes"], chunksize=50_000) as reader:
for df in reader:
# Upload to elastic with username as id
async def generate_actions(df_chunk):
for index, record in df_chunk.iterrows():
doc = record.replace({np.nan: None}).to_dict()
doc.update(
{"_id": doc["username"], "_index": "users",}
)
yield doc
es_upl_chunk = 1000
async def main():
tasks = []
for i in range(0, len(df), es_upl_chunk):
tasks.append(
helpers.async_bulk(
client=async_es,
actions=generate_actions(df[i : i + es_upl_chunk]),
chunk_size=es_upl_chunk,
)
)
successes = 0
errors = []
print("Uploading to es...")
progress = tqdm(unit=" docs", total=len(df))
for task in asyncio.as_completed(tasks):
resp = await task
successes += resp[0]
errors.extend(resp[1])
progress.update(es_upl_chunk)
return successes, errors
responses = asyncio.run(main())
print(f"Uploaded {responses[0]} documents from {file}")
if len(responses[1]) > 0:
print(
f"WARNING: Encountered the following errors: {','.join(responses[1])}"
)
Turns out the AsyncElasticsearch was not the right client to speed up bulk ingests in this case. I use the helpers.parallel_bulk() function instead.

Spark Streaming With JDBC Source and Redis Stream

I'm trying to build a little mix of technologies to implement a solution on my work. Since I'm new to most of them, sometimes I got stuck, but could find solution to some of the problems I'm facing. Right now, both objects are running on Spark, but I can't seem to identify why the Streaming are not working.
Maybe is the way redis implements its sink on the writing to stream side, maybe is the way I'm trying to do the job. Almost all of the examples I found on streaming are related to Spark samples, like streaming text or TCP, and the only solution I found on relational databases are based on kafka connect, which I can't use right now because the company doesn't have the Oracle option to CDC on Kafka.
My scenario is as follows. Build a Oracle -> Redis Stream -> MongoDB Spark application.
I've built my code based on the examples of spark redis And used the sample code to try implement a solution to my case. I load the Oracle data day by day and send to a redis stream which later will be extracted from the stream and saved to Mongo. Right now the sample below is just trying to remove from the stream and show on console, but nothing is shown.
The little 'trick' I've tried was to create a CSV directory, read from it, and later grab the date from the csv and use to query the oracle db, then saving the oracle DataFrame on redis with the foreachBatch command. The data is saved, but I think not in the right way, because using the sample code to read the stream nothing is received.
Those are the codes:
** Writing to Stream **
object SendData extends App {
Logger.getLogger("org").setLevel(Level.INFO)
val oracleHost = scala.util.Properties.envOrElse("ORACLE_HOST", "<HOST_IP>")
val oracleService = scala.util.Properties.envOrElse("ORACLE_SERVICE", "<SERVICE>")
val oracleUser = scala.util.Properties.envOrElse("ORACLE_USER", "<USER>")
val oraclePwd = scala.util.Properties.envOrElse("ORACLE_PWD", "<PASSWD>")
val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
val oracleUrl = "jdbc:oracle:thin:#//" + oracleHost + "/" + oracleService
val userSchema = new StructType().add("DTPROCESS", "string")
val spark = SparkSession
.builder()
.appName("Send Data")
.master("local[*]")
.config("spark.redis.host", redisHost)
.config("spark.redis.port", redisPort)
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val csvDF = spark.readStream.option("header", "true").schema(userSchema).csv("/tmp/checkpoint/*.csv")
val output = csvDF
.writeStream
.outputMode("update")
.foreachBatch {(df :DataFrame, batchId: Long) => {
val dtProcess = df.select(col("DTPROCESS")).first.getString(0).take(10)
val query = s"""
(SELECT
<FIELDS>
FROM
TABLE
WHERE
DTPROCESS BETWEEN (TO_TIMESTAMP('$dtProcess 00:00:00.00', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
AND (TO_TIMESTAMP('$dtProcess 23:59:59.99', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
) Table
"""
val df = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", query)
.option("user", oracleUser)
.option("password", oraclePwd)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.load()
df.cache()
if (df.count() > 0) {
df.write.format("org.apache.spark.sql.redis")
.option("table", "process")
.option("key.column", "PRIMARY_KEY")
.mode(SaveMode.Append)
.save()
}
if ((new DateTime(dtProcess).toLocalDate()).equals(new LocalDate()))
Seq(dtProcess).toDF("DTPROCESS")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.mode("overwrite")
.option("header", "true")
.save("/tmp/checkpoint")
else {
val nextDay = new DateTime(dtProcess).plusDays(1)
Seq(nextDay.toString(DateTimeFormat.forPattern("YYYY-MM-dd"))).toDF("DTPROCESS")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.mode("overwrite")
.option("header", "true")
.save("/tmp/checkpoint")
}
}}
.start()
output.awaitTermination()
}
** Reading from Stream **
object ReceiveData extends App {
Logger.getLogger("org").setLevel(Level.INFO)
val mongoPwd = scala.util.Properties.envOrElse("MONGO_PWD", "bpedes")
val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
val spark = SparkSession
.builder()
.appName("Receive Data")
.master("local[*]")
.config("spark.redis.host", redisHost)
.config("spark.redis.port", redisPort)
.getOrCreate()
val processes = spark
.readStream
.format("redis")
.option("stream.keys", "process")
.schema(StructType(Array(
StructField("FIELD_1", StringType),
StructField("PRIMARY_KEY", StringType),
StructField("FIELD_3", TimestampType),
StructField("FIELD_4", LongType),
StructField("FIELD_5", StringType),
StructField("FIELD_6", StringType),
StructField("FIELD_7", StringType),
StructField("FIELD_8", TimestampType)
)))
.load()
val query = processes
.writeStream
.format("console")
.start()
query.awaitTermination()
}
This code writes the dataframe to Redis as hashes (not to the Redis stream).
df.write.format("org.apache.spark.sql.redis")
.option("table", "process")
.option("key.column", "PRIMARY_KEY")
.mode(SaveMode.Append)
.save()
Spark-redis doesn't support writing to Redis stream out of the box.

Using SNS as a Target to Trigger Lambda Function

I have a Lambda function is working 100%, i set my Cloudwatch rule and connected the Target to the Lambda directly and everything is working fine.
My manager want me to change the Target in the Cloudwatch and set it to SNS, then use the SNS as a trigger in my Lambda.
I have done the necessary thing and now my Lambda Function is no longer working.
import os, json, boto3
def validate_instance(rec_event):
sns_msg = json.loads(rec_event['Records'][0]['Sns']['Message'])
account_id = sns_msg['account']
event_region = sns_msg['region']
assumedRoleObject = sts_client.assume_role(
RoleArn="arn:aws:iam::{}:role/{}".format(account_id, 'VSC-Admin-Account-Lambda-Execution-Role'),
RoleSessionName="AssumeRoleSession1"
)
credentials = assumedRoleObject['Credentials']
print(credentials)
ec2_client = boto3.client('ec2', event_region, aws_access_key_id=credentials['AccessKeyId'],
aws_secret_access_key=credentials['SecretAccessKey'],
aws_session_token=credentials['SessionToken'],
)
def lambda_handler(event, context):
ip_permissions=[]
print("The event log is " + str(event))
# Ensure that we have an event name to evaluate.
if 'detail' not in event or ('detail' in event and 'eventName' not in event['detail']):
return {"Result": "Failure", "Message": "Lambda not triggered by an event"}
elif event['detail']['eventName'] == 'AuthorizeSecurityGroupIngress':
items_ip_permissions = event['detail']['requestParameters']['ipPermissions']['items']
security_group_id=event['detail']['requestParameters']['groupId']
print("The total items are " + str(items_ip_permissions))
for item in items_ip_permissions:
s = [val['cidrIp'] for val in item['ipRanges']['items']]
print("The value of ipranges are " + str(s))
if ((item['fromPort'] == 22 and item['toPort'] == 22) or (item['fromPort'] == 143 and item['toPort'] == 143) or (item['fromPort'] == 3389 and item['toPort'] == 3389)) and ('0.0.0.0/0' in [val['cidrIp'] for val in item['ipRanges']['items']]):
print("Revoking the security rule for the item" + str(item))
ip_permissions.append(item)
result = revoke_security_group_ingress(security_group_id,ip_permissions)
else:
return
def revoke_security_group_ingress(security_group_id,ip_permissions):
print("The security group id is " + str(security_group_id))
print("The ip_permissions value to be revoked is " + str(ip_permissions))
ip_permissions_new=normalize_paramter_names(ip_permissions)
response = boto3.client('ec2').revoke_security_group_ingress(GroupId=security_group_id,IpPermissions=ip_permissions_new)
print("The response of the revoke is " + str(response))
def normalize_paramter_names(ip_items):
# Start building the permissions items list.
new_ip_items = []
# First, build the basic parameter list.
for ip_item in ip_items:
new_ip_item = {
"IpProtocol": ip_item['ipProtocol'],
"FromPort": ip_item['fromPort'],
"ToPort": ip_item['toPort']
}
# CidrIp or CidrIpv6 (IPv4 or IPv6)?
if 'ipv6Ranges' in ip_item and ip_item['ipv6Ranges']:
# This is an IPv6 permission range, so change the key names.
ipv_range_list_name = 'ipv6Ranges'
ipv_address_value = 'cidrIpv6'
ipv_range_list_name_capitalized = 'Ipv6Ranges'
ipv_address_value_capitalized = 'CidrIpv6'
else:
ipv_range_list_name = 'ipRanges'
ipv_address_value = 'cidrIp'
ipv_range_list_name_capitalized = 'IpRanges'
ipv_address_value_capitalized = 'CidrIp'
ip_ranges = []
# Next, build the IP permission list.
for item in ip_item[ipv_range_list_name]['items']:
ip_ranges.append(
{ipv_address_value_capitalized: item[ipv_address_value]}
)
new_ip_item[ipv_range_list_name_capitalized] = ip_ranges
new_ip_items.append(new_ip_item)
return new_ip_items
Assume the permissions are missing causing the invocation failure.
You need to explicitly grant permission for SNS to invoke the Lambda function.
Below is the CLI
aws lambda add-permission --function-name my-function --action lambda:InvokeFunction --statement-id sns-my-topic \
--principal sns.amazonaws.com --source-arn arn:aws:sns:us-east-2:123456789012:my-topic
my-function -> Name of the lambda function
my-topic -> Name of the SNS topic
Reference: https://docs.aws.amazon.com/lambda/latest/dg/access-control-resource-based.html

ResourceNotFoundException while adding data to Kinesis Firehose stream using Lambda

I am trying to add data to Kinesis Firehose delivery stream using putrecord with python3.6 on aws lambda. When calling put record on the stream I get following exception.
An error occurred (ResourceNotFoundException) when calling the PutRecord operation: Stream MyStream under account 123456 not found.
I am executing following python code to add data to Stream.
import boto3
import json
def lambda_handler(event, context):
session = boto3.Session(aws_access_key_id=key_id, aws_secret_access_key=access_key)
kinesis_client = session.client('kinesis', region_name='ap-south-1')
records = event['Records']
write_records = list()
count = 0
for record in records:
count += 1
if str(record['eventName']).lower() == 'insert':
rec = record['dynamodb']['Keys']
rec.update(record['dynamodb']['NewImage'])
new_record = dict()
new_record['Data'] = json.dumps(rec).encode()
new_record['PartitionKey'] = 'PartitionKey'+str(count)
# Following Line throws Exception
kinesis_client.put_record(StreamName="MyStream", Data=new_record['Data'], PartitionKey='PartitionKey'+str(count))
elif str(record['eventName']).lower() == 'modify':
pass
write_records = json.dumps(write_records)
print(stream_data)
MyStream status is active and source for the stream data is set to Direct PUT and other sources
If you are sure that the stream name is correct, you can create client with regional endpoint of Kinesis
kinesis_client = session.client('kinesis', region_name='ap-south-1', endpoint_url='https://kinesis.ap-south-1.amazonaws.com/')
AWS Service Endpoints List
https://docs.aws.amazon.com/general/latest/gr/rande.html
Hope this helps !!!

Cloudwatch logs filter to trigger lambda then extract values from log data

Please I have got a question from the subject-line.
I want to create a AWS CloudWatch log or Event to trigger Lambda function from filter pattern then extract values from that log data as output to lambda function in python.
Example:
Filter name: abcd
value to extract: 01234 to the lambda function.
from log data
log Data:
abcd:01234
Any ideas?
Here is a simple way to capture the events from CloudWatch. The log data is in the message. You could process here or send it on to Firehose and transform there. Alternatively, you could send CloudWatch directly to Firehose with a subscription but I think that haas to be done with the AWS CLI.
import boto3
import gzip
import json
import base64
firehose = boto3.client('firehose',region_name='us-east-2')
def print_result(firehose_return):
records_error = int(firehose_return['FailedPutCount'])
records_sent = len(firehose_return['RequestResponses'])
return 'Firehose sent %d records, %d error(s)' % (records_sent,records_error )
def lambda_handler(events, context):
cw_encoded_logs_data = events['awslogs']['data']
compressed_payload = base64.b64decode(cw_encoded_logs_data)
cw_decoded_logs_data = gzip.decompress(compressed_payload)
cw_all_events = json.loads(cw_decoded_logs_data)
records = []
for event in cw_all_events['logEvents']:
log_event = {
"Data": str(event['message']) + '\n'
}
records.insert(len(records),log_event)
if len(records) > 499:
firehose_return = firehose.put_record_batch(
DeliveryStreamName = 'streamname ',
Records = records
)
print_result(firehose_return)
records = []
if len(records) > 0:
firehose_return = firehose.put_record_batch(
DeliveryStreamName = 'streamname',
Records = records
)
print(print_result(firehose_return))

Resources