Azure Blob - getting list of blobs without knowing container name - azure-blob-storage

I have a requirement where I need to fetch blob files within subdirectory of a blob container.
Problem is I will not know the container name as containers are dynamically created.
I do not want to fetch list of containers, iterate through them and then get list of blobs. Just trying to avoid loops since there could be many containers.
Is there any simple way of getting all blobs from all containers without knowing container name?
Regards,
Madhu

Simple answer to your question is that it's not possible to get a list of blobs without knowing the container name as blobs belong to a container. You would need to know the name of the blob container to list the blobs in that container.

Related

Azure Data Factory copy activity creates empty files

Whenever I use ADF copy activity with Blob as source/sink, ADF creates an empty file named after the directory of the sink Blob.
For instance, if I want to copy from input/file.csv to process/file.csv, the copy happens but I also have a blob called "process" with size 0 byte created each time.
Any idea why?
Source
Sink
Firstly, I would suggest you optimize you pipeline copy active settings.
Since you are copying one file from one container/folder to another, you can directly set the source file with parameter. Wildcard path expression *.csv is usually used for folder the same type of files.
You can test again and check if the empty file exist again.
HTH.
This happens if you have a storage ADLS gen2 but you have not enabled the Hierarchical namespace and you select the ADLS gen2 while defining your Linked Service and Dataset. A quick fix for this is use Azure Blob Storage when defining LS and DS.

Partition Parquet files on Azure Blob (pyarrow)

I have been manually partitioning files with pandas (creating an index or multi-index and then writing a separate parquet file for each index in a loop) to Azure Blob.
However, when reading the docs for pyarrow, I see that it is possible to create a 'dataset' which includes a folder structure for partitioned data. https://arrow.apache.org/docs/python/parquet.html
The example for the Monthly / daily folder is exactly what I am trying to achieve.
dataset_name/
year=2007/
month=01/
0.parq
1.parq
...
month=02/
0.parq
1.parq
...
month=03/
...
year=2008/
month=01/
...
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one', 'two'], filesystem=fs)
Can I do this with Azure Blob (or Minio which uses S3 and wraps over my Azure Blob storage)? My ultimate goal is to only read files which make sense for my 'query'.
Just per my experience and based on your current environment Linux on Azure VM, I think there are two solutions can read partition parquet files from Azure Storage.
Follow the section Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one like the sample code to dataframes, finally to concat these dataframes to a single one.
Try to use Azure/azure-storage-fuse to mount a container of Azure Blob Storage to your Linux filesystem, then you just need to follow the document section Reading from Partitioned Datasets to read the Partitioned Dataset locally from Azure Blob Storage.

Custom patterns for stream analytics blob storage

My question is about saving out data from stream analytics to blob storage . In our system we are collecting clictstream-data from many websites via event hubs. Then we are doing some small grouping and aggregating. After that we send the results to our blob storage.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
Is there any way of achieving this?
This is a duplicate question:
Stream Analytics: Dynamic output path based on message payload
Azure Stream Analytics -> how much control over path prefix do I really have?
The short version of the above is you can't do this in Stream Analytics. If you have too many target paths for multiple sinks to be feasible, your best bet is to stream to a single blob store sink and process the results with something other than ASA. Azure Functions, WebJobs or ADF tasks are a few possible solutions.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
As this official document stream-analytics-define-outputs mentioned about Path Prefix Pattern of Blob storage output:
The file path used to write your blobs within the specified container.
Within the path, you may choose to use one or more instances of the following 2 variables to specify the frequency that blobs are written:
{date}, {time}
Example 1: cluster1/logs/{date}/{time}
Example 2: cluster1/logs/{date}
Based on my understanding, you could create multiple blob output targets from a single Stream Analytics job for each of your websites, and in your SQL-like query language, you could filter the event data and send data to the specific output. For more details, you could refer to Common query patterns.

Build pipeline from Oracle DB to AWS DynamoDB

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.
I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

Reading BLOB data from Oracle database using python

This is not a question of a code, I need to extract some BLOB data from an Oracle database using python script. My question is what are the steps in dealing with BLOB data and how to read as images, videos and text? Since I have no access to the database itself, is it possible to know the type of BLOBs stored if it is pictures, videos or texts? Do I need encoding or decoding in order to tranfer these BLOBs into .jpg, .avi or .txt files ? These are very basic questions but I am new to programming so need some help to find a starting point :)
If you have a pure BLOB in the database, as opposed to, say, an ORDImage that happens to be stored in a BLOB under the covers, the BLOB itself has no idea what sort of binary data it contains. Normally, when the table was designed, a column would be added that would store the data type and/or the file name.

Resources