Dynamics 365 free up files storage space - dynamics-crm

In the Power Platform admin center, Capacity tab, I see that the part "Files" takes a lot of space, and I would like to reduce it. I have checked:
email attachments
notes
but these are only responsible for a small part of the utilized storage. What else can be responsible for the "Files" part of the storage utilization?

File and database: The following tables store data in file and database storage:
Attachment
AnnotationBase
Any custom or out-of-the-box table that has columns of datatype file or image (full size)
Any table that is used by one or more installed Insights applications and ends in - Analytics
WebResourceBase
RibbonClientMetadataBase
https://learn.microsoft.com/en-us/power-platform/admin/capacity-storage#capacity-page-details

Related

How can I monitor a CSV file in Azure blob store for newly added records, similar to unix "tail -f"?

Context:
I am an information architect (not a data engineer, was once a Unix and Oracle developer), so my technical knowledge in Azure is limited to browsing Microsoft documentation.
The context of this problem is ingesting data from a constantly growing CSV file, in Azure ADLS into an Azure SQL MI database.
I am designing an Azure data platform that includes a SQL data warehouse with the first source system being a Dynamics 365 application.
The data warehouse is following Data Vault 2.0 patterns. This is well suited to the transaction log nature of the CSV files.
This platform is in early development - not in production.
The CSV files are created and updated (append mode) by an Azure Synapse Link that is exporting dataverse write operations on selected dataverse entities to our ADLS storage account. This service is configured in append mode, so all dataverse write operations (create, update and delate) produce an append action to the entities corresponding CSV file. Each CSV file is essentially a transaction log of the corresponding dataverse entity
Synapse Link operates in an event based fashion - creating a records in dataverse triggers a CSV append action. Latency is typically a few seconds. There aren't any SLAs (promises), and latency can be several minutes if the API caps are breached.
The CSV is partitioned Annually. This means the a new CSV file is created at the start of each year and continues to grow throughout the year.
We are currently trialling ADF as the means of extracting records from the CSV for loading into the data warehouse. We are not wedded to ADF and can consider changing horses.
Request:
I'm searching for an event based solution for ingesting that monitors a source CSV file for new records (appended to the end of the file) and extracts only those new records from the CSV file and then processes each record in sequence which result in one or more SQL insert operations for each new CSV record. If I was back in my old Unix days, I would build a process around the "tail -f" command as the start of the pipeline with the next step an ETL process that processed each record served by the tail command. But I can't figure out how to do this in Azure.
This process will be the pattern for many more similar ingestion processes - there could be approximately one thousand CSV files that need to be processed in this event based - near real time process. I assume one process per CSV file.
Some nonfunctional requirements are speed and efficiency.
My goal is for an event based solution (low latency = speed),
that doesn't need to read the entire file every 5 minutes to see if there are changes. This is an inefficient (micro) batch process that will be horribly inefficient (read: expensive - 15,000x redundant processing). This is where the desire for a process like Unix "tail -f" comes to mind. It watches the file for changes, emitting new data as it is appended to the source file. I'd hate to do something like a 'diff' every 5 minutes as this is inefficient and when scaled to thousands of tables will be prohibitively expensive.
One possible solution to your problem is to store each new CSV record as a separate blob.
You will then be able to use Azure Event Grid to raise events when a new blob is created in Blob Storage i.e. use Azure Blob Storage as Event Grid source.
The basic idea is to store the changed CSV data as new blob and have Event Grid wired to Blob Created event. An Azure Function can listen to these events and then only process the new data. For auditing purposes, you can save this data in a separate Append Blob once the CSV processing has been completed.

Partition Parquet files on Azure Blob (pyarrow)

I have been manually partitioning files with pandas (creating an index or multi-index and then writing a separate parquet file for each index in a loop) to Azure Blob.
However, when reading the docs for pyarrow, I see that it is possible to create a 'dataset' which includes a folder structure for partitioned data. https://arrow.apache.org/docs/python/parquet.html
The example for the Monthly / daily folder is exactly what I am trying to achieve.
dataset_name/
year=2007/
month=01/
0.parq
1.parq
...
month=02/
0.parq
1.parq
...
month=03/
...
year=2008/
month=01/
...
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one', 'two'], filesystem=fs)
Can I do this with Azure Blob (or Minio which uses S3 and wraps over my Azure Blob storage)? My ultimate goal is to only read files which make sense for my 'query'.
Just per my experience and based on your current environment Linux on Azure VM, I think there are two solutions can read partition parquet files from Azure Storage.
Follow the section Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one like the sample code to dataframes, finally to concat these dataframes to a single one.
Try to use Azure/azure-storage-fuse to mount a container of Azure Blob Storage to your Linux filesystem, then you just need to follow the document section Reading from Partitioned Datasets to read the Partitioned Dataset locally from Azure Blob Storage.

Offloading unstructured data saved in RDBMS to Hadoop

My organization is thinking about offloading the unstructured data like Text , images etc saved as part of Tables in Oracle Database , into Hadoop. The size of the DB is around 10 TB and growing. The size of the CLOB/BLOB columns is around 3 TB.Right now these columns are queried for certain kind of reports through a web application. They are also written into but not very frequently.
What kind of approach we can take to achieve proper offloading of data and ensuring that the offloaded data is available for read through existing web application.
You can get part of the answer in oracle blog (link).
If data needs to be pulled in HDFS environment via sqoop, then you must first read the following from sqoop documentation.
Sqoop handles large objects (BLOB and CLOB columns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs subdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the --inline-lob-limit argument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.
Reading via web application is possible if you are using MPP query engine like Impala and it works pretty well and it is production ready technology. We heavily use complex Impala queries to render content for SpringBoot application. Since Impala runs everything in memory, there is a chance of slowness or failure if it is multi-tenant Cloudera cluster. For smaller user groups (1000-2000 user base) it works perfectly fine.
Do let me know if you need more input.
Recommendation will be
Use Cloudera distribution (read here)
Give enough memory for Impala Deamons
Make sure you YARN is configured correctly for schedule (fair share or priority share) based ETL load vs Web Application Load
If required keep the Impala Daemons away from YARN
Define memory quota for Impala Memory so it allows concurrent queries
Flatten your queries so Impala runs faster without joins and shuffles.
If you are reading just a few columns, store in Parquet, it works very fast.

Custom patterns for stream analytics blob storage

My question is about saving out data from stream analytics to blob storage . In our system we are collecting clictstream-data from many websites via event hubs. Then we are doing some small grouping and aggregating. After that we send the results to our blob storage.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
Is there any way of achieving this?
This is a duplicate question:
Stream Analytics: Dynamic output path based on message payload
Azure Stream Analytics -> how much control over path prefix do I really have?
The short version of the above is you can't do this in Stream Analytics. If you have too many target paths for multiple sinks to be feasible, your best bet is to stream to a single blob store sink and process the results with something other than ASA. Azure Functions, WebJobs or ADF tasks are a few possible solutions.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
As this official document stream-analytics-define-outputs mentioned about Path Prefix Pattern of Blob storage output:
The file path used to write your blobs within the specified container.
Within the path, you may choose to use one or more instances of the following 2 variables to specify the frequency that blobs are written:
{date}, {time}
Example 1: cluster1/logs/{date}/{time}
Example 2: cluster1/logs/{date}
Based on my understanding, you could create multiple blob output targets from a single Stream Analytics job for each of your websites, and in your SQL-like query language, you could filter the event data and send data to the specific output. For more details, you could refer to Common query patterns.

Difference between Database and File Storage in Parse.com

Based on the FAQ at Parse.com:
What is the difference between database storage and file storage?
Database storage refers to data stored as Parse Objects, which are
limited to 128 KB in size. File storage refers to static assets that
are stored using the Parse File APIs, typically images, documents, and
other types of binary data.
Just want some clarification here:
So the Strings, Arrays etc created are considered as Parse Objects and would fall under the database storage, also the URL of the file will be considered under the database storage since it is a Parse Object. But the actual files itself are considered under File Storage?
Thanks.
Yes. Any file that you upload to Parse goes to the File storage, the rest is stored in the database including the URL of such files.

Resources