Azure Data Factory Missing Blob Triggers - azure-blob-storage

I have created an ADF pipeline that should trigger when a blob is added to a storage container (say container1) and copy the blob to another storage container (say container2). All my blob names are alphanumeric with '-' (basically a GUID). I see that the ADF is triggered only a few times compared to the number of blobs in container1 (i.e if I have n files in container1, the ADF is triggered only x times where x<n).
I also observed that whenever the blobs created per second in container1 is high there are more missed triggers. I am not using any event batching in the event grid. My storage account is v2 BlockBlobStorage.
Is there a way I can resolve this?

I think it is difficult for us to get the correct answer in the community. We'd better move this issue to here. After MS's stress test to find out the possibility of bugs.

Related

Azure Data Factory (Graph Data Connect/Office365 Linked Service): how to work with Binary sink dataset?

Here's what I'm doing.
My company needs me to dump all group members and their corresponding groups into an SQL database. Power Automate takes forever with too many loops and API calls...so I'm trying Data Factory for the first time.
Using the Office365 Linked Service, we can get all organization members--but the only compatible sink option is Azure Blob storage (or DataLake) because the sink MUST be binary.
Ok, fine. So we got a Azure Blob storage account configured and set up.
But now that the pipeline 'copy data' has completed (after 4 hours?), I don't know what to do with this binary data. There seems to be no function, method or dataflow option to interpret the binary data as JSON, delimited text, or otherwise. The storage account shows 1042 different blobs, ranging haphazardly from a few kilobytes to dozens of megabytes (why???). Isn't there anything in Data Factory that can interpret this binary data and allow me to dump the columns I need into SQL?
I was able to load the blob data into Power Automate and parse it into usable JSON using the base64 and json functions, but this is robbing Peter to pay Paul because I have to us a loop to load the contents of 1042 different blobs and I'm exceeding our bandwidth quota. Besides that, some of the contents of the blobs are empty!! (again...why??)
I've looked everywhere for answers, no luck. So thank you for any insight.
You can use Binary dataset in Copy activity, GetMetadata activity, or
Delete activity. When using Binary dataset, the service does not parse
file content but treat it as-is.
So, The data flow activity which is used to transform the data in Azure Data Factory isn't supported for Binary dataset.
Hence, you can use Azure Service for another approach like Azure Databricks in which you can use Python OpenCV or any other Data Engineering library in preferred programming language.

Create static content with images and videos and show it in my spring-boot application

I wrote some basic blog system, which based on spring boot.
I'm trying to figure out, how can I create posts with videos and images, without the need to editing everything using HTML.
Right now, I am saving my blog posts in DB as plain text.
Is it possible to create content combined with text, images and videos , and saving this "content" as one row in my DB-Table, without creating connections between different tables?
Many thanks in advance.
Images and Videos are heavy content and storing them in database could be a costly affair, until you are developing application for research purpose. Also querying it from database and serving it over the network can impact your application performance.
If you want to store it in a single row that can be done as well using database BLOB object. But i would suggest to have 2 different tables. One containing the BLOB object of Image and Videos and other is your usual table containing blog as text and primary key of of BLOB table.
If you want to take your solution to go live, better use image-videos hosting servers because of following factors
Saves your database cost
Ensures 24x7 availability
Application performance is faster as these are hosted independent of application
Videos can directly be iframed i.e. you do not need to query complete MBs of record and serve over network
A strict answer to your question, yes, you can use BLOBs to store the video/images in the database. Think about them as a column that contains bytes of video or image.
For school/cases where you have a very small amount of videos/images its probably OK. However if you're building a real application, then don't do it :)
Every DBA will raise a bunch of concerns "why do not use Blobs".
So more realistic approach would be to find some "file-system" like storage (but distributed) style S3 in AWS, hardrive on server if you're not on cloud, etc.
Then store that big image / video there and get an identifier (like path to it if we're talking about the harddrive) and store that identifier in the database along with the metadata that you're already store probably (like blogPostId, type of file, etc.)
Once the application become more "mature" - you can switch the "provider" - Grow as you go. There are even cloud storages designed especially for images (like Cloudinary).

Is it possible to write multiple blobs in a single request?

We're planning to use Azure blob storage to save processing log data for later analysis. Our systems are generating roughly 2000 events per minute, and each "event" is a json document. Looking at the pricing for blob storage, the sheer number of write operations would cost us tons of money if we take each event and simply write it to a blob.
My question is: Is it possible to create multiple blobs in a single write operation, or should I instead plan to create blobs containing multiple event data items (for example, one blob for each minute's worth of data)?
It is possible ,but isn't good practice ,it take long times to multipart files to be merge, hence we are trying to separate upload action from entity persist operation by passing entity id and update doc[image] name in other controller
Also it keeps you clean upload functionality .Best Wish
It's impossible to create multiple blobs in a single write operation.
One feasible solution is to create blobs containing multiple event data items as you planned (which is hard to implement and query in my opinion); another solution is to store the event data into Azure Storage Table rather than Blob, and leverage EntityGroupTransaction to write table entities in one batch (which is billed as one transaction).
Please note that all table entities in one batch must have the same partition key, which should be considered when you're designing your table (see Azure Storage Table Design Guide for further information). If some of your events have large data size that exceeds the size limitation of Azure Storage Table (1MB per entity, 4MB per batch), you can save data of those events to Blob and store the blob links in Azure Storage Table.

Realistic Data Backup method for Parse.com

We are building an iOS app with Parse.com, but still can't figure out the right way to backup data efficiently.
As a premise, we have and will have a LOT of data store rows.
Say we have a class with 1million rows, assume we have it backed up, then want to bring it back to Parse, after a hazardous situation (like data loss on production).
The few solutions we have considered are the following:
1) Use external server for backup
BackUp:
- use the REST API to constantly back up data to a remote MySQL server (we chose MySQL for customized analytics purpose, since it's way faster and easier to handle data with MySQL for us)
ImportBack:
a) - recreate JSON objects from MySQL backup and use the REST API to send back to Parse.
Say we use the batch operation which permits 50 simultaneous objects to be created with 1 query, and assume it takes 1 sec for every query, 1million data sets will take 5.5hours to transfer to Parse.
b) - recreate one JSON file from MySQL backup and use the Dashboard to import data manually.
We just tried with 700,000 records file with this method: it took about 2 hours for the loading indicator to stop and show the number of rows in the left pane, but now it never opens in the right pane (it says "operation time out") and it's over 6hours since the upload started.
So we can't rely on 1.b, and 1.a seems to take too long to recover from a disaster (if we have 10 million records, it'll be like 55 hours = 2.2 days).
Now we are thinking about the following:
2) Constantly replicate data to another app
Create the following in Parse:
- Production App: A
- Replication App: B
So while A is in production, every single query will be duplicated to B (using background job constantly).
The downside is of course that it'll eat up the burst limit of A as it'll simply double the amount of query. So not ideal thinking of scaling up.
What we want is something like AWS RDS which gives an option to automatically backup daily.
I wonder how this could be difficult for Parse since it's based on AWS infra.
Please let me know if you have any idea on this, will be happy to share know-hows.
P.S.:
We’ve noticed an important flaw in the above 2) idea.
If we replicate using REST API, all the objectIds of all Classes will be changed, so every 1to1 or 1toMany relations will be broken.
So we think about putting a uuid for every object class.
Is there any problem about this method?
One thing we want to achieve is
query.include(“ObjectName”)
( or in Obj-C “includeKey”),
but I suppose that won’t be possible if we don’t base our app logic on objectId.
Looking for a work around for this issue;
but will uuid-based management be functional under Parse’s Datastore logic?
Parse has never lost production data. While we don't currently offer automated backups, you can request one any time you like, and we're working on making all of this even nicer. Additionally, it's easier in most cases to import the JSON export file through the data browser rather than using the REST batch.
I can confirm that today, Parse did lost my data. Or at least it appeared to be so.
After several errors where detected on multiple apps (agreed by Parse Status twitter account), we could not retrieve data for an app, without any error.
It was because an entire column of one of our class (type pointer) disappeared and data was not present anymore in the dashboard.
We are using this pointer column to filter / retrieve data, so the returned queries and collections were empty.
So we decided to recreate the column manually. By chance, recreating the column, with the same name and type, solved the issue and the data was still there... I can't explain it but I really thought, and the app reacted as if, data were lost.
So an automated backup and restore option is mandatory, it is not an option.
On December 2015 parse.com released a new dashboard with an improved export feature.
Just select your app, click on "App Settings" -> "General" -> "Export app data". Parse generates a json-file for every class in your app and sends an email to you, if the export-progress is done.
UPDATE:
Sad but true, parse.com is winding down: http://blog.parse.com/announcements/moving-on/
I had the same issue of backing up parse server data. As parse server is using mongodb that is why backing up data is not an issue I have just done a simple thing. downloaded the mongodb backup from the server. And then restored it using
mongorestore /path-to-mongodump (extracted files)
As parse has been turned to open source.Therefore we can adopt this technique.
For accidental deletes, writing a cloud function 'beforedelete' to backup the current row to another class would work.
For regular backups, manual export of changed records (use filter) will be useful. For recovery this requires you to write scripts / use import option (not so sure) in data browser. You could also write a cloud function replicate data on your backup server (haven't tried this yet).
However there are some limitations to cloud code that you should consider before venturing into it:
https://parse.com/docs/cloud_code_guide#functions-resource

Can DB2 tell a web-app when a table data is updated?

I have a table of non trivial size on a DB2 database that is updated X times a day per user input in another application. This table is also read by my web-app to display some info to another set of users. I have a large number of users on my web app and they need to do lots of fuzzy string lookups with data that is up-to-the-minute accurate. So, I need a server side cache to do my fuzzy logic on and to keep the DB from getting hammered.
So, what's the best option? I would hate to pull the entire table every minute when the data changes so rarely. I could setup a trigger to update a timestamp of a smaller table and poll that to see if I need refresh my cache, but that seems hacky to.
Ideally I would like to have DB2 tell my web-app when something changes, or at least provide a very lightweight mechanism to detect data level changes.
I think if your web application is running in WebSphere, setting up MQ would be a pretty good solution.
You could write triggers that use the MQ Series routines to add things to a queue, and your web app could subscribe to the queue and listen for updates.
If your web app is not in WebSphere then you could still look at this option but it might be more difficult.
A simple solution could be to have a timestamp (somewhere) for the latest change on to table.
The timestamp could be located in a small table/view that is updated by either the application that updates the big table or by an update-trigger on the big table.
The update-triggers only task would be to update the "help"-timestamp with currenttimestamp.
Then the webapp only checks this timestamp.
If the timestamp is newer then what the webapp has then the data is reread from the big table.
A "low-tech"-solution thats fairly non intrusive to the exsisting system.
Hope this solution fits your setup.
Regards
Sigersted
Having the database push a message to your webapp is certainly doable via a variety of mechanisms (like mqseries, etc). Similar and easier is to write a java stored procedure that gets kicked off by the trigger and hands the data to your cache-maintenance interface. But both of these solutions involve a lot of versioning dependencies, etc that could be a real PITA.
Another option might be to reconsider the entire approach. Is it possible that instead of maintaining a cache on your app's side you could perform your text searching on the original table?
But my suggestion is to do as you (and the other poster) mention - and just update a timestamp in a single-row table purposed to do this, then have your web-app poll that table. Similarly you could just push the changed rows to this small table - and have your cache-maintenance program pull from this table. Either of these is very simple to implement - and should be very reliable.

Resources