How can I monitor a CSV file in Azure blob store for newly added records, similar to unix "tail -f"? - azure-blob-storage

Context:
I am an information architect (not a data engineer, was once a Unix and Oracle developer), so my technical knowledge in Azure is limited to browsing Microsoft documentation.
The context of this problem is ingesting data from a constantly growing CSV file, in Azure ADLS into an Azure SQL MI database.
I am designing an Azure data platform that includes a SQL data warehouse with the first source system being a Dynamics 365 application.
The data warehouse is following Data Vault 2.0 patterns. This is well suited to the transaction log nature of the CSV files.
This platform is in early development - not in production.
The CSV files are created and updated (append mode) by an Azure Synapse Link that is exporting dataverse write operations on selected dataverse entities to our ADLS storage account. This service is configured in append mode, so all dataverse write operations (create, update and delate) produce an append action to the entities corresponding CSV file. Each CSV file is essentially a transaction log of the corresponding dataverse entity
Synapse Link operates in an event based fashion - creating a records in dataverse triggers a CSV append action. Latency is typically a few seconds. There aren't any SLAs (promises), and latency can be several minutes if the API caps are breached.
The CSV is partitioned Annually. This means the a new CSV file is created at the start of each year and continues to grow throughout the year.
We are currently trialling ADF as the means of extracting records from the CSV for loading into the data warehouse. We are not wedded to ADF and can consider changing horses.
Request:
I'm searching for an event based solution for ingesting that monitors a source CSV file for new records (appended to the end of the file) and extracts only those new records from the CSV file and then processes each record in sequence which result in one or more SQL insert operations for each new CSV record. If I was back in my old Unix days, I would build a process around the "tail -f" command as the start of the pipeline with the next step an ETL process that processed each record served by the tail command. But I can't figure out how to do this in Azure.
This process will be the pattern for many more similar ingestion processes - there could be approximately one thousand CSV files that need to be processed in this event based - near real time process. I assume one process per CSV file.
Some nonfunctional requirements are speed and efficiency.
My goal is for an event based solution (low latency = speed),
that doesn't need to read the entire file every 5 minutes to see if there are changes. This is an inefficient (micro) batch process that will be horribly inefficient (read: expensive - 15,000x redundant processing). This is where the desire for a process like Unix "tail -f" comes to mind. It watches the file for changes, emitting new data as it is appended to the source file. I'd hate to do something like a 'diff' every 5 minutes as this is inefficient and when scaled to thousands of tables will be prohibitively expensive.

One possible solution to your problem is to store each new CSV record as a separate blob.
You will then be able to use Azure Event Grid to raise events when a new blob is created in Blob Storage i.e. use Azure Blob Storage as Event Grid source.
The basic idea is to store the changed CSV data as new blob and have Event Grid wired to Blob Created event. An Azure Function can listen to these events and then only process the new data. For auditing purposes, you can save this data in a separate Append Blob once the CSV processing has been completed.

Related

Strategies to upload data to Redis cache

I have 4 millions user data that is looked up by their phone number. Source data is in S3. Search response time is 50 millisec and system need to be available 99.995%.
I am thinking about a nightly job like below
S3 Source Data -> Glue ETL -> CSV file or RDS -> Cache Upload job -> AWS Redis global datastore
I am leaning towards RDS since in case upload job fails for some reason I don't want to restart from beginning. Every row uploaded to Redis will be mark as processed.
I was planning to do an initial load of data from S3 to Redis and then do incremental load from thereafter. The data team has informed me that they can't produce the incremental data, in other words they can't tell me what data changed yesterday.
As a result, every day I will have to empty the Redis cache and reload the entire data. Since this upload will take considerable time, I am wondering how to keep the response time to 50 ms while data is loading.
Will appreciate any ideas. Thanks!

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

Oracle small blob vs large blob

I would like to know what is better way of handling large files such as 3-4 gigabytes as Oracle blob SecureFile.
The scenario here is, I am planning to upload large files to oracle db over wcf service. I am spiltting file in to smaller chunks of 200mb and uploading it one by one. On oracle side, I just append to the single blob until whole files get uploaded. This happens in sequential manner. However, I am thinking to upload chunks in parallel so I can speed up the operation of uploading. But this will not possible to handle at Oracle end as I can't update single blob with multiple uploads as it would then write bytes not in the order it receives from the service. Is it good than to insert each blob separately and merge them later once into a single blob record in Oracle side?
Thanks
Jay

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Resources