Azure Data Factory (Graph Data Connect/Office365 Linked Service): how to work with Binary sink dataset? - azure-blob-storage

Here's what I'm doing.
My company needs me to dump all group members and their corresponding groups into an SQL database. Power Automate takes forever with too many loops and API calls...so I'm trying Data Factory for the first time.
Using the Office365 Linked Service, we can get all organization members--but the only compatible sink option is Azure Blob storage (or DataLake) because the sink MUST be binary.
Ok, fine. So we got a Azure Blob storage account configured and set up.
But now that the pipeline 'copy data' has completed (after 4 hours?), I don't know what to do with this binary data. There seems to be no function, method or dataflow option to interpret the binary data as JSON, delimited text, or otherwise. The storage account shows 1042 different blobs, ranging haphazardly from a few kilobytes to dozens of megabytes (why???). Isn't there anything in Data Factory that can interpret this binary data and allow me to dump the columns I need into SQL?
I was able to load the blob data into Power Automate and parse it into usable JSON using the base64 and json functions, but this is robbing Peter to pay Paul because I have to us a loop to load the contents of 1042 different blobs and I'm exceeding our bandwidth quota. Besides that, some of the contents of the blobs are empty!! (again...why??)
I've looked everywhere for answers, no luck. So thank you for any insight.

You can use Binary dataset in Copy activity, GetMetadata activity, or
Delete activity. When using Binary dataset, the service does not parse
file content but treat it as-is.
So, The data flow activity which is used to transform the data in Azure Data Factory isn't supported for Binary dataset.
Hence, you can use Azure Service for another approach like Azure Databricks in which you can use Python OpenCV or any other Data Engineering library in preferred programming language.

Related

Create static content with images and videos and show it in my spring-boot application

I wrote some basic blog system, which based on spring boot.
I'm trying to figure out, how can I create posts with videos and images, without the need to editing everything using HTML.
Right now, I am saving my blog posts in DB as plain text.
Is it possible to create content combined with text, images and videos , and saving this "content" as one row in my DB-Table, without creating connections between different tables?
Many thanks in advance.
Images and Videos are heavy content and storing them in database could be a costly affair, until you are developing application for research purpose. Also querying it from database and serving it over the network can impact your application performance.
If you want to store it in a single row that can be done as well using database BLOB object. But i would suggest to have 2 different tables. One containing the BLOB object of Image and Videos and other is your usual table containing blog as text and primary key of of BLOB table.
If you want to take your solution to go live, better use image-videos hosting servers because of following factors
Saves your database cost
Ensures 24x7 availability
Application performance is faster as these are hosted independent of application
Videos can directly be iframed i.e. you do not need to query complete MBs of record and serve over network
A strict answer to your question, yes, you can use BLOBs to store the video/images in the database. Think about them as a column that contains bytes of video or image.
For school/cases where you have a very small amount of videos/images its probably OK. However if you're building a real application, then don't do it :)
Every DBA will raise a bunch of concerns "why do not use Blobs".
So more realistic approach would be to find some "file-system" like storage (but distributed) style S3 in AWS, hardrive on server if you're not on cloud, etc.
Then store that big image / video there and get an identifier (like path to it if we're talking about the harddrive) and store that identifier in the database along with the metadata that you're already store probably (like blogPostId, type of file, etc.)
Once the application become more "mature" - you can switch the "provider" - Grow as you go. There are even cloud storages designed especially for images (like Cloudinary).

How to validate BLOB object in oracle

I have BLOB data (pdf file attachment) in a table.
for us, its too expensive to write Java/ some other code to read BLOB for validate.
Is there any short cut/easy/less expensive way to validate my BLOB? Any command/s to read meta data and validate the BLOB?
I would like to check whether the BLOB object is corrupted or not
That's not something you should do in the database. A BLOB is a binary file which is interpreted by the appropriate client software (Adobe Reader, MS Word, whatever). As far as the database is concerned it's a black box. So your application ought to validate the file before it uploads it into the database.
However, there is a workaround. You can build an Oracle Text CONTEXT index on your BLOB column. CONTEXT is really designed for free text searching of documents but indexing is a way to prove that the uploaded file is readable.
The snag with CONTEXT indexes is that they aren't transactional: normally there's a background job running which indexes new documents but for this purpose you would probably want to call CTX_DDL.SYNC_INDEX() as part of the upload to present the user with timely feedback. Find out more.
I will reiterate that Text is a workaround, and expensive in terms of database resources. The index itself will consume space and the indexing process requires time and cpu cycles. That's a big investment unless you're going to work with the document inside the database.

Storing and processing timeseries with Hadoop

I would like to store a large amount of timeseries from devices. Also these timeseries have to be validated, can be modified by an operator and have to be exported to other systems. Holes in the timeseries must be found. Timeseries must be shown in the UI filtered by serialnumber and date range.
We have thought about using hadoop, hbase, opentsdb and spark for this scenario.
What do you think about it? Can Spark connect to opentsdb easily?
Thanks
OpenTSDB is really great for storing large amount of time series data. Internally, it is underpinned by HBase - which means that it had to find a way around HBase's limitations in order to perform well. As a result, the representation of time series is highly optimized and not easy to decode. AFAIK, there is no out-of-the-box connector that would allow to fetch data from OpenTSDB into Spark.
The following GitHub project might provide you with some guidance:
Achak1987's connector
If you are looking for libs that would help you with time series, have a look at spark-ts - it contains useful functions for missing data imputation as well.
Warp 10 offers the WarpScript language which can be used from Spark/Pig/Flink to manipulate time series and access data stored in Warp 10 via a Warp10InputFormat.
Warp 10 is Open Source and available at www.warp10.io
Disclaimer: I'm CTO of Cityzen Data, maker of Warp 10.
Take a look at Axibase Time Series Database which has a rather unique versioning feature to maintain a history of value changes for the same timestamp. Once enabled with per-metric granularity, the database keeps track of source, status and times of value modifications for audit trail or data reconciliation.
We have customers streaming data from Spark apps using Network API, typically once data is enriched with additional metadata (aks series tags) for downstream reporting.
You can query data from ATSD with REST API or SQL.
Disclaimer: I work for Axibase.

Is it possible to write multiple blobs in a single request?

We're planning to use Azure blob storage to save processing log data for later analysis. Our systems are generating roughly 2000 events per minute, and each "event" is a json document. Looking at the pricing for blob storage, the sheer number of write operations would cost us tons of money if we take each event and simply write it to a blob.
My question is: Is it possible to create multiple blobs in a single write operation, or should I instead plan to create blobs containing multiple event data items (for example, one blob for each minute's worth of data)?
It is possible ,but isn't good practice ,it take long times to multipart files to be merge, hence we are trying to separate upload action from entity persist operation by passing entity id and update doc[image] name in other controller
Also it keeps you clean upload functionality .Best Wish
It's impossible to create multiple blobs in a single write operation.
One feasible solution is to create blobs containing multiple event data items as you planned (which is hard to implement and query in my opinion); another solution is to store the event data into Azure Storage Table rather than Blob, and leverage EntityGroupTransaction to write table entities in one batch (which is billed as one transaction).
Please note that all table entities in one batch must have the same partition key, which should be considered when you're designing your table (see Azure Storage Table Design Guide for further information). If some of your events have large data size that exceeds the size limitation of Azure Storage Table (1MB per entity, 4MB per batch), you can save data of those events to Blob and store the blob links in Azure Storage Table.

Can I capture Performance Counters for an Azure Web/Worker Role remotely...?

I am aware of the generation of the Performance Counters and Diagnosis in webrole and worker-role in Azure.
My question is can I get the Performance Counter on a remote place or remote app, given the subscription ID and other certificates (3rd Party app to give performance Counter).
Question in other words, Can I get the Performance Counter Data, the way I use Service Management API for any hosted service...?
What are the pre-configurations required to be done in Server...? to get CPU data...???
Following is the description of the attributes for Performance counters table:
EventTickCount: Stores the tick count (in UTC) when the log entry was recorded.
DeploymentId: Id of your deployment.
Role: Role name
RoleInstance: Role instance name
CounterName: Name of the counter
CounterValue: Value of the performance counter
One of the key thing here is to understand how to effectively query this table (and other diagnostics table). One of the things we would want from the diagnostics table is to fetch the data for a certain period of time. Our natural instinct would be to query this table on Timestamp attribute. However that's a BAD DESIGN choice because you know in an Azure table the data is indexed on PartitionKey and RowKey. Querying on any other attribute will result in full table scan which will create a problem when your table contains a lot of data.
The good thing about these logs table is that PartitionKey value in a way represents the date/time when the data point was collected. Basically PartitionKey is created by using higher order bits of DateTime.Ticks (in UTC). So if you were to fetch the data for a certain date/time range, first you would need to calculate the Ticks for your range (in UTC) and then prepend a "0" in front of it and use those values in your query.
If you're querying using REST API, you would use syntax like:
PartitionKey ge '0<from date/time ticks in UTC>' and PartitionKey le '0<to date/time in UTC>'.
You could use this syntax if you're querying table storage in our tool Cloud Storage Studio, Visual Studio or Azure Storage Explorer.
Unfortunately I don't have much experience with the Storage Client library but let me work something out. May be I will write a blog post about it. Once I do that, I will post the link to my blog post here.
Gaurav
Since the performance counters data gets persisted in Windows Azure Table Storage (WADPerformanceCountersTable), you can query that table through a remote app (either by using Microsoft's Storage Client library or writing your own custom wrapper around Azure Table Service REST API to retrieve the data. All you will need is the storage account name and key.

Resources