Databricks reading duplicate rows coming from Event Hubs - azure-databricks

I have been trying to read some data in Azure Databricks coming from Event Hubs. On the first read, it reads the data correctly, but the problem is that when I send the same data or some different data later after some minutes or hours, it is reading both my previous and the new records. And I only want to read the new stream. I am stuck on this part that how can I only get to read the new records, and not duplicates.
The code I am using is written below. Please note I am using Azure Free Tier account along with Databricks community edition.
connectionString = "Endpoint=sb://ingestionlayer1.servicebus.windows.net/;SharedAccessKeyName=userpolicy2;SharedAccessKey=...=;EntityPath=dataingestion1"
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
ehConf['eventhubs.consumerGroup'] = "$default"
import json
startingEventPosition = {
"offset": -1,
"seqNo": -1,
"enqueuedTime": None,
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
# Read events from the Event Hub
df_new = spark.readStream.format("eventhubs").options(**ehConf).load()
read_stream = df_new.withColumn("body", df_new["body"].cast("string"))
#display(read_stream)
# Writing to Data lake
saveloc = "/mnt/streamingdatastorage2/datastorage2/Bronze/Table1"
read_stream2.writeStream.format("delta").option("checkpointLocation", f"{saveloc}/_checkpoint").start(saveloc)
I would really appreciate if anyone help me out with this !

Related

Elastic-Cloud Not Receiving Data from Serilog Sink

I set up an Elastic Cloud to offload my local elasticsearch config (as one does), but for reasons unknown to me, I can't get it to show any logs in Elastic Cloud, despite it working fine locally.
The code I got: (modified for privacy reasons)
//var uri = new Uri("http://localhost:9200"); // old one
var uri = new Uri("https://my-server.kb.eastus2.azure.elastic-cloud.com:9243");
var sinkOptions = new ElasticsearchSinkOptions(uri)
{
AutoRegisterTemplate = true,
ModifyConnectionSettings = x => x.BasicAuthentication("elastic", "the password I was given"),
IndexFormat = $"test-logs-{env.EnvironmentName?.ToLower().Replace('.', '-')}-{DateTime.Now:yyyy-MM}",
};
Log.Logger = new LoggerConfiguration()
.ReadFrom.Configuration(config)
.Enrich.FromLogContext()
.Enrich.WithMachineName()
.WriteTo.Console()
.WriteTo.Elasticsearch(sinkOptions)
.Enrich.WithProperty("Environment", env.EnvironmentName)
.CreateLogger();
There are two possible reasons I can think of that might be the cause of this not working:
The credentials are wrong
The Uri is wrong
Every solution I've been given so far has provided the data in this fashion, and nowhere does it say what the URI I'm supposed to use looks like.
I get no errors.
I get no warnings.
I get no logs.
What am I doing wrong here?
The issue was using the incorrect uri. I wrote
my-server.kb.eastus2.azure.elastic-cloud.com:9243 rather than
my-server.es.eastus2.azure.elastic-cloud.com:9243.
Note the very tiny difference that is kb vs es in the url

Very slow connection to Snowflake from Databricks

I am trying to connect to Snowflake using R in databricks, my connection works and I can make queries and retrieve data successfully, however my problem is that it can take more than 25 minutes to simply connect, but once connected all my queries are quick thereafter.
I am using the sparklyr function 'spark_read_source', which looks like this:
query<- spark_read_source(
sc = sc,
name = "query_tbl",
memory = FALSE,
overwrite = TRUE,
source = "snowflake",
options = append(sf_options, client_Q)
)
where 'sf_options' are a list of connection parameters which look similar to this;
sf_options <- list(
sfUrl = "https://<my_account>.snowflakecomputing.com",
sfUser = "<my_user>",
sfPassword = "<my_pass>",
sfDatabase = "<my_database>",
sfSchema = "<my_schema>",
sfWarehouse = "<my_warehouse>",
sfRole = "<my_role>"
)
and my query is a string appended to the 'options' arguement e.g.
client_Q <- 'SELECT * FROM <my_database>.<my_schema>.<my_table>'
I can't understand why it is taking so long, if I run the same query from RStudio using a local spark instance and 'dbGetQuery', it is instant.
Is spark_read_source the problem? Is it an issue between Snowflake and Databricks? Or something else? Any help would be great. Thanks.

Hyperledger composer performance(adding asset) is very low

Following code is simple code to check how many entities can be added per second or minute.
createAsset is calling backend(http:localhost:3000) and add data using post.
When I did test using this code, it took 23 seconds to add 10 entities.
I am using composer 0.19.12 and fabric 1.1. When I checked some thread from GitHub, performance has improved using indexing couchdb. How can I use that feature? (I need to check again, but it seems that it is default feature of recent composer version)
addEntities: async function() {
var start = 0;
var end = start + 100;
var sd = new Date();
console.log(sd.getHours()+':'+sd.getMinutes()+':'+sd.getSeconds()+'.'+sd.getMilliseconds());
for(var i = start; i<end; i++) {
entityData.id = i.toString();
await this.createAsset('/Entity', 'model.Entity', entityData);
}
var ed = new Date();
var totalTime = new Date(ed.getTime()-sd.getTime());
console.log(totalTime.getMinutes()+':'+totalTime.getSeconds()+'.'+totalTime.getMilliseconds());
},
My model is really simple as follows.
asset Entity identified by id {
o String id
}
I have changed the test code to send multiple transactions as follows following david_k's advice.
addEntities: async function() {
var start = 15000;
var dataNumber = 1200;
var loopNumber = 400;
var end = start + dataNumber;
var sd = new Date();
console.log(sd.getHours()+':'+sd.getMinutes()+':'+sd.getSeconds()+'.'+sd.getMilliseconds());
var tasks = [];
for(var i = start; i<end; i++) {
entityData.id = i.toString();
if((i-start)%loopNumber === loopNumber - 1) {
await this.createAsset('/Entity', 'model.Entity', entityData);
console.log('--- i: ' + i + ' loops completed');
}
else {
this.createAsset('/Entity', 'model.Entity', entityData);
}
}
var ed = new Date();
var totalTime = new Date(ed.getTime()-sd.getTime());
console.log(totalTime.getMinutes()+':'+totalTime.getSeconds()+'.'+totalTime.getMilliseconds());
},
The purpose of change is send multiple requests at the same time, and it seems work well because it shows much better performance compared to previous code. However, the performance is still around 8 TPS. As original test code was 1 transaction per 2sec~3sec, it improved a lot. But, 8TPS looks that it cannot be used for commercial application at all. Even it is not good for test purpose as well. Could someone give some advice for this?
That sounds about right looking at your example code and I am assuming you are using either the fabric-dev-servers package which is a very simple fabric network to help get users started with developing a business network and want to try out on a hyperledger fabric network, or you are using the byfn network from the multi-org tutorial which is a hyperledger fabric example of a 2 organisation network in a consortium to demonstrate the required operational steps of composer in a multi-org fabric setup.
Hyperledger Fabric is a distributed ledger technology based around eventual consistency. Composer implements a submit/notify model such that once a transaction has been submitted it will notify the client when that transaction has been committed to the ledger. You can configure which Peers in a network you are interested in informing you when that occurs, but the default is all of them and so the rest server responds once all peers have committed it to the ledger.
Hyperledger fabric doesn't commit individual transactions, it batches them up into blocks and these blocks get committed to the ledger, and it will wait a period of time before building that block with the current set of transactions that have been submitted for ordering, so blocks can contain one or more transactions. You need to configure fabric for your use case to determine how transactions are batched into blocks.

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Importing binary data to parse.com

I'm trying to import data to parse.com so I can test my application (I'm new to parse and I've never used json before).
Can you please give me an example of a json file that I can use to import binary files (images) ?
NB : I'm trying to upload my data in bulk directry from the Data Browser. Here is a screencap : i.stack.imgur.com/bw9b4.png
In parse docs i think 2 sections could help you out depend on whether you want to use REST api of the android sdk.
rest api - see section on POST, uploading files that can be upload to parse using REST POST.
SDk - see section on "files"
code for Rest includes following:
use some HttpClient implementation having "ByteArrayEntity" class or something
Map your image to bytearrayEntity and POST it with the correct headers for Mime/Type in httpclient...
case POST:
HttpPost httpPost = new HttpPost(url); //urlends "audio OR "pic"
httpPost.setProtocolVersion(new ProtocolVersion("HTTP", 1,1));
httpPost.setConfig(this.config);
if ( mfile.canRead() ){
FileInputStream fis = new FileInputStream(mfile);
FileChannel fc = fis.getChannel(); // Get the file's size and then map it into memory
int sz = (int)fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
data2 = new byte[bb.remaining()];
bb.get(data2);
ByteArrayEntity reqEntity = new ByteArrayEntity(data2);
httpPost.setEntity(reqEntity);
fis.close();
}
,,,
request.addHeader("Content-Type", "image/*") ;
pseudocode for post the runnable to execute the http request
The only binary data allowed to be loaded to parse.com are images. In other cases like files or streams .. etc the most suitable solution is to store a link to the binary data in another dedicated storage for such type of information.

Resources