Is it possible to write multiple blobs in a single request? - azure-blob-storage

We're planning to use Azure blob storage to save processing log data for later analysis. Our systems are generating roughly 2000 events per minute, and each "event" is a json document. Looking at the pricing for blob storage, the sheer number of write operations would cost us tons of money if we take each event and simply write it to a blob.
My question is: Is it possible to create multiple blobs in a single write operation, or should I instead plan to create blobs containing multiple event data items (for example, one blob for each minute's worth of data)?

It is possible ,but isn't good practice ,it take long times to multipart files to be merge, hence we are trying to separate upload action from entity persist operation by passing entity id and update doc[image] name in other controller
Also it keeps you clean upload functionality .Best Wish

It's impossible to create multiple blobs in a single write operation.
One feasible solution is to create blobs containing multiple event data items as you planned (which is hard to implement and query in my opinion); another solution is to store the event data into Azure Storage Table rather than Blob, and leverage EntityGroupTransaction to write table entities in one batch (which is billed as one transaction).
Please note that all table entities in one batch must have the same partition key, which should be considered when you're designing your table (see Azure Storage Table Design Guide for further information). If some of your events have large data size that exceeds the size limitation of Azure Storage Table (1MB per entity, 4MB per batch), you can save data of those events to Blob and store the blob links in Azure Storage Table.

Related

Azure Data Factory (Graph Data Connect/Office365 Linked Service): how to work with Binary sink dataset?

Here's what I'm doing.
My company needs me to dump all group members and their corresponding groups into an SQL database. Power Automate takes forever with too many loops and API calls...so I'm trying Data Factory for the first time.
Using the Office365 Linked Service, we can get all organization members--but the only compatible sink option is Azure Blob storage (or DataLake) because the sink MUST be binary.
Ok, fine. So we got a Azure Blob storage account configured and set up.
But now that the pipeline 'copy data' has completed (after 4 hours?), I don't know what to do with this binary data. There seems to be no function, method or dataflow option to interpret the binary data as JSON, delimited text, or otherwise. The storage account shows 1042 different blobs, ranging haphazardly from a few kilobytes to dozens of megabytes (why???). Isn't there anything in Data Factory that can interpret this binary data and allow me to dump the columns I need into SQL?
I was able to load the blob data into Power Automate and parse it into usable JSON using the base64 and json functions, but this is robbing Peter to pay Paul because I have to us a loop to load the contents of 1042 different blobs and I'm exceeding our bandwidth quota. Besides that, some of the contents of the blobs are empty!! (again...why??)
I've looked everywhere for answers, no luck. So thank you for any insight.
You can use Binary dataset in Copy activity, GetMetadata activity, or
Delete activity. When using Binary dataset, the service does not parse
file content but treat it as-is.
So, The data flow activity which is used to transform the data in Azure Data Factory isn't supported for Binary dataset.
Hence, you can use Azure Service for another approach like Azure Databricks in which you can use Python OpenCV or any other Data Engineering library in preferred programming language.

Data health check tool

I want to perform data health check on huge volume of data, which can be either in RDBMS or cloud file storage like Amazon S3. Which tool would be appropriate for performing data health check, which can give me number of rows, rows not matching a given schema for data type validation, average volume for given time period etc?
I do not want to use any bigdata platform like Qubole or Databricks because of extra cost involved. I found Drools which can perform similar operations but it would need reading full data into memory and associate with a POJO before validation. Any alternatives would be appreciated where I do not have to load full data into memory.
You can avoid loading full data in memory by implementing the StatelessKieSession object of drools. StatelessKieSession works only on the current event it does not maintain the state of any event also does not keep objects in the memory. Read more about StatelessKieSession here.
Also, you can use Stateful KieSession and give an expiry to an event using the #expires declaration which expiries event after the specified time. Read more about #expires here.

Data structure for activity feed

There's a concept of a workspace in our application. A user can be a member of virtually any number of workspaces and a workspace can have virtually any number of users. I want to implement an activity feed to help users find out what happened in every workspace they're members of, i.e. when someone uploads a file or creates a task in a workspace, this activity appears in that workspace's activity feed and also in each of its users activity feeds. The problem is that I can't come up with a suitable data structure for quick read and write operations of activities. What I have come up with is storing each activity with a property Targets which is a string of all the workspace's user ids and then filtering activities where that field contains an id of a user I want to fetch activities for, but this approach has serious performance and scalability limitations, because we use SharePoint as our storage. We can also use Azure Table or Blob Storage and I was thinking of just creating a separate activity entity for every user of a workspace so that then I can just easily filter activities by user's id, but this could result in hundreds of copies of the same activity if a workspace has hundreds of members and then writing all those copies becomes problematic as Azure only supports 100 entities in a single batch operation (correct me if I'm wrong), and SharePoint then is not an option at all. So I need help figuring out what data structure I could use to store activities of each workspace so that they're easily retrievable for any member probably by its id and also for any workspace by workspace's id.
We can also use Azure Table or Blob Storage and I was thinking of just creating a separate activity entity for every user of a workspace so that then I can just easily filter activities by user's id
Azure Storage Table could be a choice for storing your activity entities, and Table storage is relatively inexpensive, you can consider storing the same entity multiple times (with different partitioning strategy) in separate partitions or in separate tables for reading efficient.
And storing user’s activity entity with workspaceid_userid as a compound key can be also a possible approach. For more and detailed Table design patterns, please refer to this article.
Azure only supports 100 entities in a single batch operation (correct me if I'm wrong)
Yes, a single batch operation can include up to 100 entities.

Windows Azure Application high volume of records insertions

We are meant to be developing a Web based application based on Azure platform, though I’ve got some basic understanding but still have many questions
The application that we are to develop will have lot of database interaction and would need to insert a large volume of records every day.
What is the best way to interact with db here is via Queue (ie work role and then worker role reads queue and save data in db)or direct to SQL server?
And should it be a multi-tenant application?
I've been playing around with windows azure SQL database of a little while now and this is a blog post i wrote about inserting large amounts of data
http://alexandrebrisebois.wordpress.com/2013/02/18/ingesting-massive-amounts-of-relational-data-with-windows-azure-sql-database-70-million-recordsday/
my recipe is as follows: to Insert/Update data I used the following dataflow
◾Split your data into reasonably sized DataTables
◾Store the data tables as blobs in Windows Azure Blob Storage Service
◾Use SqlBulkCopy to insert data is into write tables
◾Once you have reached reasonable a amount of records in your write tables, merge the records into your read tables using reasonably sized batches. Depending on the complexity and indexes/triggers present on the read tables, batches should be of about 100000 to 500000.
◾Before merging each batch, be sure to remove duplicates by keeping the most recent records only.
◾Once a batch has been merged remove the data from the write table. Keeping this table reasonably small is quite important.
◾Once your data has been merged, be sure to check up on your index fragmentation.
◾Rince &repeat

What specific issues will I have to consider when saving files as binary data to a SQL Server 2005 database?

I'm writing an online tax return filing application using MVC3 and EF 4.1. Part of the application requires that the taxpayer be able to upload documents associated with their return. The users will be able to come back days or weeks later and possibly upload additional documents. Prior to finally submitting their return the user is able to view a list of files that have been uploaded. I've written the application to save the uploaded files to a directory defined in the web.config. When I display the review page to the user I loop through the files in the directory and display it as a list.
I'm now thinking that I should be saving the files to the actual SQL Server as binary data in addition to saving them to the directory. I'm trying to avoid what if scenarios.
What if
A staff member accidentally deletes a file from the directory.
The file server crashes (Other agencies use the same SAN as us)
A staff member saves other files to the same directory. The taxpayer should not see those
Any other scenario that causes us to have to request another copy of a file from a taxpayer (Failure is not an option)
I'm concerned that saving to the SQL Server database will have dire consequences that I am not aware of since I've not done this before in a production environment.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!

Resources