Queries over large data tables - performance

We have a large dataset of unstructured data (Azure Blob) and have started noticing that refreshing our model gets quite slow after a few thousand records are being loaded.
Our current query structure is:
#"Load Data"
Loads data from the Azure Blob, ~1000 files
Parses the files into a table with 3 columns (of list/record types which can be further expanded), ~700k rows
#"Sessions"
Reference #"Load Data"
Expand all 'Session' related columns
#"Users"
Reference #"Load Data"
Expand all 'User' related columns
#"Events"
Reference #"Load Data"
Expand all 'Event' related columns
#"Events By Name"
Reference #"Events"
Groups by 'event.name'- generates a column of tables to each event type's events and properties (these vary between events)
#"Event Name1" (2, 3, etc. one table per event type)
Reference #"Events by Name"
Expands that event name's Table, and generates a table with event.id and each of the properties for that event type
While running this and watching the resource monitor, the memory usage goes through the roof, and eventually tons of Hard Faults leading to Disk usage. From looking at the query execution popup, it seems a bunch of queries kick-off and run in parallel.
If I load the data from a local folder, they seem to all be fetching data, going through the files and loading the referenced common queries in parallel. I believe this is what's causing the memory usage to go haywire, the disk to kick in, and the queries to take hours to run.
I assumed referenced queries would run once first, and then have their resulting tables referenced by individual queries using it, but that doesn't seem to be the case. I've also tried using Table.Buffer as the last step of #"Load Data" and #"Events", in an attempt to make those queries be computed once and then shared across dependents, but that only seemed to make it worse. Are there ways to:
Make a query only run once, and have it's result passed forward to any queries referencing it
Prevent queries from running in parallel, and run sequentially instead
Am I just looking at this the wrong way? A lot of 'performance' articles I found only mention structuring your queries to allow Query Folding. However this is not a possibility for our current case, as the Azure Blob storage really just stores 'blob' files which have to be loaded and parsed locally.
It's being a real struggle to get these queries running on our current 700k test events, and we expect it to go up to millions in the real environment. Is our only option to treat the blobs and push the data into an SQL database and link our model to that instead?

You process your data first and store it into a table on your DB, and then use this table as a data source to your model. Refresh data in the source table by running a job which runs on a scheduled interval and update the table.

Related

How to do table operations in Google BigQuery?

Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update

Historical Data Comparison in realtime - faster in SQL or code?

I have a requirement in the project I am currently working on to compare the most recent version of a record with the previous historical record to detect changes.
I am using the Azure Offline data sync framework to transfer data from a client device to the server which causes records in the synced table to update based on user changes. I then have a trigger copying each update into a history table and a SQL query which runs when building a list of changes to compare the current record vs the most recent historical by doing column comparisons - mainly string but some integer and date values.
Is this the most efficient way of achieving this? Would it be quicker to load the data into memory and perform a code based comparison with rules?
Also, if I continually store all the historical data in a SQL table, will this affect the performance over time and would I be better storing this data in something like Azure Table Storage? I am also thinking along the lines of cost as SQL usage is much more expensive that Table Storage but obviously I cannot use a trigger and would need to insert each synced row into Table Storage manually.
You could avoid querying and comparing the historical data altogether, because the most recent version is already in the main table (and if it's not, it will certainly be new/changed data).
Consider a main table with 50.000 records and 1.000.000 records of historical data (and growing every day).
Instead of updating the main table directly and then querying the 1.000.000 records (and extracting the most recent record), you could query the smaller main table for that one record (probably an ID), compare the fields, and only if there is a change (or no data yet) update those fields and add the record to the historical data (or use a trigger / stored procedure for that).
That way you don't even need a database (probably containing multiple indexes) for the historical data, you could even store it in a flat file if you wanted, depending on what you want to do with that data.
The sync framework I am using deals with the actual data changes, so i only get new history records when there is an actual change. Given a batch of updates to a number of records, i need to compare all the changes with their previous state and produce an output list of whats changed.

temp table vs data flow task on physical table

Here is the scenario
I have one staging table for csv file which is My source I am loading it into physical staging table I will be doing transformations on this staging table data in later part of package I need fresh data (as it is from source)
Should I do transformation in temp table or should I use dataflow task again to reload staging table
The data isnt more [Smile] just less than a million only
There is a standard pattern for this.
Extract the data (from the CSV to your temp area)
Transform the data (clean it, convert it, format it, join other stuff to it, make it compatible with your new system)
Load the data (update/insert/delete to your live tables)
This is where the acronym for ETL comes from - http://en.wikipedia.org/wiki/Extract,_transform,_load
The primary advantages you have are that at point 1 you have only 1 thread/user loading the data so it can be extracted quickly, then at stage 2 you are manipulating the data without causing any locks on other tables. Finally, once the data is ready, you are able to load it in the quickest method possible to your live tables.
Your two biggest (often competing) concerns are Simplicity and Speed. Simplicity is great because it involves less code, makes for less debugging required and makes you far more confident that your data is clean. Sometimes you have to sacrifice simplicity for speed however.
In your case, since you are only loading a few million rows, I'd suggest you just reload the staging table every time so every single load uses the same ETL process. This keeps your ETL mechanism easy to code, maintain and explain.
FYI - if you're using SQL Server, check out SSIS.

What specific issues will I have to consider when saving files as binary data to a SQL Server 2005 database?

I'm writing an online tax return filing application using MVC3 and EF 4.1. Part of the application requires that the taxpayer be able to upload documents associated with their return. The users will be able to come back days or weeks later and possibly upload additional documents. Prior to finally submitting their return the user is able to view a list of files that have been uploaded. I've written the application to save the uploaded files to a directory defined in the web.config. When I display the review page to the user I loop through the files in the directory and display it as a list.
I'm now thinking that I should be saving the files to the actual SQL Server as binary data in addition to saving them to the directory. I'm trying to avoid what if scenarios.
What if
A staff member accidentally deletes a file from the directory.
The file server crashes (Other agencies use the same SAN as us)
A staff member saves other files to the same directory. The taxpayer should not see those
Any other scenario that causes us to have to request another copy of a file from a taxpayer (Failure is not an option)
I'm concerned that saving to the SQL Server database will have dire consequences that I am not aware of since I've not done this before in a production environment.
There's a really good paper by Microsoft Research called To Blob or Not To Blob.
Their conclusion after a large number of performance tests and analysis is this:
if your pictures or document are typically below 256K in size, storing them in a database VARBINARY column is more efficient
if your pictures or document are typically over 1 MB in size, storing them in the filesystem is more efficient (and with SQL Server 2008's FILESTREAM attribute, they're still under transactional control and part of the database)
in between those two, it's a bit of a toss-up depending on your use
If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee foto in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee foto, too, as part of your queries.
For filegroups, check out Files and Filegroup Architecture for an intro. Basically, you would either create your database with a separate filegroup for large data structures right from the beginning, or add an additional filegroup later. Let's call it "LARGE_DATA".
Now, whenever you have a new table to create which needs to store VARCHAR(MAX) or VARBINARY(MAX) columns, you can specify this file group for the large data:
CREATE TABLE dbo.YourTable
(....... define the fields here ......)
ON Data -- the basic "Data" filegroup for the regular data
TEXTIMAGE_ON LARGE_DATA -- the filegroup for large chunks of data
Check out the MSDN intro on filegroups, and play around with it!

Referencing object's identity before submitting changes in LINQ

is there a way of knowing ID of identity column of record inserted via InsertOnSubmit beforehand, e.g. before calling datasource's SubmitChanges?
Imagine I'm populating some kind of hierarchy in the database, but I wouldn't want to submit changes on each recursive call of each child node (e.g. if I had Directories table and Files table and am recreating my filesystem structure in the database).
I'd like to do it that way, so I create a Directory object, set its name and attributes,
then InsertOnSubmit it into DataContext.Directories collection, then reference Directory.ID in its child Files. Currently I need to call InsertOnSubmit to insert the 'directory' into the database and the database mapping fills its ID column. But this creates a lot of transactions and accesses to database and I imagine that if I did this inserting in a batch, the performance would be better.
What I'd like to do is to somehow use Directory.ID before commiting changes, create all my File and Directory objects in advance and then do a big submit that puts all stuff into database. I'm also open to solving this problem via a stored procedure, I assume the performance would be even better if all operations would be done directly in the database.
One way to get around this is to not use an identity column. Instead build an IdService that you can use in the code to get a new Id each time a Directory object is created.
You can implement the IdService by having a table that stores the last id used. When the service starts up have it grab that number. The service can then increment away while Directory objects are created and then update the table with the new last id used at the end of the run.
Alternatively, and a bit safer, when the service starts up have it grab the last id used and then update the last id used in the table by adding 1000 (for example). Then let it increment away. If it uses 1000 ids then have it grab the next 1000 and update the last id used table. Worst case is you waste some ids, but if you use a bigint you aren't ever going to care.
Since the Directory id is now controlled in code you can use it with child objects like Files prior to writing to the database.
Simply putting a lock around id acquisition makes this safe to use across multiple threads. I've been using this in a situation like yours. We're generating a ton of objects in memory across multiple threads and saving them in batches.
This blog post will give you a good start on saving batches in Linq to SQL.
Not sure off the top if there is a way to run a straight SQL query in LINQ, but this query will return the current identity value of the specified table.
USE [database];
GO
DBCC CHECKIDENT ("schema.table", NORESEED);
GO

Resources