I want to upload csv files from google sheets to snowflake database. I would like to know if there is any option in fivetran such that only upserts(only modified rows) of these csv files are synced to snowflake table?
You can definitely use Fivetran to upload Google Sheets data into Snowflake: https://fivetran.com/docs/files/google-sheets Fivetran by default only updates modified rows (adds/deletes/updates), so it will indeed upsert if that Google Sheet is modified. I believe by default Fivetran will only 'soft-delete' deletions from source data, so I think if you delete a row from your Google Sheet it will remain in Snowflake, but will have a _fivetran_deleted column flag. I would test this explicitly if this matters to you.
I'm not sure what you mean by 'only upserts'. No deletions of data once it's stored in Snowflake from Google Sheets? If so, Fivetran may already do exactly what you want out of the box. Fivetran will definitely insert new rows and update existing ones.
I don't think you can do this with Fivetran, but you can two-way sync a Google Sheet with Snowflake using Wax in real time (e.g. as soon as the Sheet is edited, the update is sent to Snowflake).
Disclaimer: I made this.
Related
I have created a number of large result spreadsheets to store data that pull data from another master spreadsheet using importrange that can be updated throughout the year.
My problem is that at the end of the year I want to store the data values from my results spreadsheets from this year statically in say copies but then have the dynamic spreadsheets continue on the next year with new values etc being populated.
Worst case I can copy the spreadsheets and change the importrange links but there are lots of sheets and links so I'm wondering if there is a good way to simply make a copy of the sheet then make it static so it no longer pulls data from the master sheet but keeps it's values.
I tried to download to google sheets as an excel file and re upload but they break when re uploading to google drive.
How can I make a copy of Data from a google sheet with importrange static?
Please clarify my confusion as I keep hearing we need read every Parquet file created by Databricks Delta tables to get to latest data in case of a SCD2 table. Is this true?
Can we simply use SQL and get the latest row?
Can we use some date/time columns to get to history of changes to that row?
Thanks
You can create and manage a Delta table as SCD2 if you wish. Actually, what confuses you must be the Time Travel feature. It just allow you to rollback your Delta table to any of its historical states which are automatically managed by Databricks.
Answer:
No. Instead, we need to look into all Delta lake transaction logs which are stored as JSON files inside _delta_log folder along with Parquet files and then see which parquet files are used as the latest data. See this doc for more information.
Yes. We can use SQL to get the latest row. Actually, any SQL script execution will always return latest data. However, there are 2 ways to get historical version of the table by SQL which are 1) specifying "as of version" and 2) specifying "as of timestamp". Read more.
PS This is a really good question.
I have multiple Fivetran destination tables in the Snowflake. Those tables were created by the Fivetran itself and the Fivetran currently writes data into the tables. Now I would like to stop syncing data in one of the tables and start writing to the table from a different source. Would I experience any troubles with this? Should I do something else in to make it possible?
What you mention is not possible because of how Fivetran works. Connector sources write to one destination schema and one only. Swapping destination tables between connectors is not a feature as of now.
I have a requirement in the project I am currently working on to compare the most recent version of a record with the previous historical record to detect changes.
I am using the Azure Offline data sync framework to transfer data from a client device to the server which causes records in the synced table to update based on user changes. I then have a trigger copying each update into a history table and a SQL query which runs when building a list of changes to compare the current record vs the most recent historical by doing column comparisons - mainly string but some integer and date values.
Is this the most efficient way of achieving this? Would it be quicker to load the data into memory and perform a code based comparison with rules?
Also, if I continually store all the historical data in a SQL table, will this affect the performance over time and would I be better storing this data in something like Azure Table Storage? I am also thinking along the lines of cost as SQL usage is much more expensive that Table Storage but obviously I cannot use a trigger and would need to insert each synced row into Table Storage manually.
You could avoid querying and comparing the historical data altogether, because the most recent version is already in the main table (and if it's not, it will certainly be new/changed data).
Consider a main table with 50.000 records and 1.000.000 records of historical data (and growing every day).
Instead of updating the main table directly and then querying the 1.000.000 records (and extracting the most recent record), you could query the smaller main table for that one record (probably an ID), compare the fields, and only if there is a change (or no data yet) update those fields and add the record to the historical data (or use a trigger / stored procedure for that).
That way you don't even need a database (probably containing multiple indexes) for the historical data, you could even store it in a flat file if you wanted, depending on what you want to do with that data.
The sync framework I am using deals with the actual data changes, so i only get new history records when there is an actual change. Given a batch of updates to a number of records, i need to compare all the changes with their previous state and produce an output list of whats changed.
We a need a csv viewer which can look at 10MM-15MM rows on a windows environment and each column can have some filtering capability (some regex or text searching) is fine.
I strongly suggest using a database instead, and running queries (eg, with Access). With proper SQL queries you should be able to filter on the columns you need to see, without handling such huge files all at once. You may need to have someone write a script to input each row of the csv file (and future csv file changes) into the database.
I don't want to be the end user of that app. Store the data in SQL. Surely you can define criteria to query on before generating a .csv file. Give the user an online interface with the column headers and filters to apply. Then generate a query based on the selected filters, providing the user only with the lines they need.
This will save many people time, headaches and eye sores.
We had this same issue and used a 'report builder' to build the criteria for the reports prior to actually generating the downloadable csv/Excel file.
As other guys suggested, I would also choose SQL database. It's already optimized to perform queries over large data sets. There're couple of embeded databases like SQLite or FirebirdSQL (embeded).
http://www.sqlite.org/
http://www.firebirdsql.org/manual/ufb-cs-embedded.html
You can easily import CSV into SQL database with just few lines of code and then build a SQL query instead of writing your own solution to filter large tabular data.