I have a Spark Datalake in Synapse in which i have 6 tables. The data in all tables i have loaded from 6 different csv files. These csv files have been loaded and updated manually by third party if some new data comes. In future also the data in these files will be loaded manually. File name will always be same.
Currently in my synapse Notebook i am using those 6 tables data for transforming a new file that came for processing and i have transformed one file by using Pyspark in my synapse notebook. But In my case i am manually giving the file name in my code which is connected to Synapse ADLS as our Source files are coming there but in future the this process will be automated. The Code Should work for every new Source File that came for Processing .
My Question here is about the 6 Tables which are in my Spark Datalake when we create a ETL process for it in Synapse and load my code in Notebook Activity will at that time the 6 tables i am using in my Code will able to read data from those tables and Suppose if some new data been updatedto those 6 tables will i able to see changes in my tables and same in my transformed file also.
This is the Code which i am using for loading data from one of the table from my lake database into my notebook currently
%%pyspark
df_IndustryData = spark.sql("SELECT * FROM DATA.Industry_data")
Display(df_Industry_data)
Thanks in advance for your responses
I'm not sure I understand your question, but I think you may find this applicable.
Lake Database (Spark) Tables are persisted in ADLS as a folder of Parquet files, and then exposed as External Tables. You can then query these tables from either your notebook or Serverless SQL.
A Lake Database Table is a therefore a logical schema overlaid on top of the physical files in Storage. So whenever you update/overwrite the underlying data [the physical parquet files], the Lake Database Table (External Table) will show the current data in the files.
Related
Context:
I am an information architect (not a data engineer, was once a Unix and Oracle developer), so my technical knowledge in Azure is limited to browsing Microsoft documentation.
The context of this problem is ingesting data from a constantly growing CSV file, in Azure ADLS into an Azure SQL MI database.
I am designing an Azure data platform that includes a SQL data warehouse with the first source system being a Dynamics 365 application.
The data warehouse is following Data Vault 2.0 patterns. This is well suited to the transaction log nature of the CSV files.
This platform is in early development - not in production.
The CSV files are created and updated (append mode) by an Azure Synapse Link that is exporting dataverse write operations on selected dataverse entities to our ADLS storage account. This service is configured in append mode, so all dataverse write operations (create, update and delate) produce an append action to the entities corresponding CSV file. Each CSV file is essentially a transaction log of the corresponding dataverse entity
Synapse Link operates in an event based fashion - creating a records in dataverse triggers a CSV append action. Latency is typically a few seconds. There aren't any SLAs (promises), and latency can be several minutes if the API caps are breached.
The CSV is partitioned Annually. This means the a new CSV file is created at the start of each year and continues to grow throughout the year.
We are currently trialling ADF as the means of extracting records from the CSV for loading into the data warehouse. We are not wedded to ADF and can consider changing horses.
Request:
I'm searching for an event based solution for ingesting that monitors a source CSV file for new records (appended to the end of the file) and extracts only those new records from the CSV file and then processes each record in sequence which result in one or more SQL insert operations for each new CSV record. If I was back in my old Unix days, I would build a process around the "tail -f" command as the start of the pipeline with the next step an ETL process that processed each record served by the tail command. But I can't figure out how to do this in Azure.
This process will be the pattern for many more similar ingestion processes - there could be approximately one thousand CSV files that need to be processed in this event based - near real time process. I assume one process per CSV file.
Some nonfunctional requirements are speed and efficiency.
My goal is for an event based solution (low latency = speed),
that doesn't need to read the entire file every 5 minutes to see if there are changes. This is an inefficient (micro) batch process that will be horribly inefficient (read: expensive - 15,000x redundant processing). This is where the desire for a process like Unix "tail -f" comes to mind. It watches the file for changes, emitting new data as it is appended to the source file. I'd hate to do something like a 'diff' every 5 minutes as this is inefficient and when scaled to thousands of tables will be prohibitively expensive.
One possible solution to your problem is to store each new CSV record as a separate blob.
You will then be able to use Azure Event Grid to raise events when a new blob is created in Blob Storage i.e. use Azure Blob Storage as Event Grid source.
The basic idea is to store the changed CSV data as new blob and have Event Grid wired to Blob Created event. An Azure Function can listen to these events and then only process the new data. For auditing purposes, you can save this data in a separate Append Blob once the CSV processing has been completed.
We are currently using SQL Server in AWS. We are looking at ways to create a data warehouse from that data in SQL Server.
It seems like the easiest way was to use AWS DMS tool and send data to redshift having it constantly sync. Redshift is pretty expensive so looking at other ways of doing it.
I have been working with EMR. Currently I am using sqoop to take data from SQL Server and put it into Hive. I am currently use the HDFS volume to store data. I have not used S3 yet for that.
Our database has many tables with millions of rows in each.
What is the best way to update this data everyday? Does sqoop support updating data. If not what other tool is used for something like this.
Any help would be great.
My suggestion you can go for Hadoop clusters(EMR) if the processing is too complex and time taken process or better to use Redshift.
Choose the right tool. If it is for the data warehouse then go for Redshift.
And why DMS? are you going to sync in real-time? You want the daily sync. So no need to use DMS.
Better solution:
Make sure you have a primary key column and column that tell us when the row gets updated like updated_at or modified_at.
Run BCP to export the data in bulk from SQL Server to CSV files.
Upload the CSV to S3 then import to RedShift.
Use glue to fetch the incremental data (based on the primary key column and the update_at column) then export it to S3.
Import files from S3 to RedShift staging tables.
Run upsert command (update + insert) to merge the staging table with the main table.
If you feel running the glue is a bit expensive, then use SSIS or Powershell script to do steps 1 to 4. Then psql command to import files from S3 to Redshift and do steps 5 and 6.
This will handle the Insert and updates in your SQL server tables. But the deletes will not be a part of it. If you need all CRUD operations then go for the CDC method with DMS or Debezium. Then push it to S3 and RedShift.
We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?
I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?
We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.