nhibernate - archiving records - performance

This is a simplistic view of our domain model (we are in healthcare):
Account
{
List<Registration> Registrations {...}
DateTime CreatedDate {...}
Type1 Property1 {...}
Type2 Property2 {...}
...
}
Registration
{
InsuranceInformation {...}
PatientVisit {...}
Type1 Property1 {...}
Type2 Property2 {...}
...
}
Setup
We use Nhibernate/FluentNH to setup and configure the SessionFactory
And let's assume that we have setup all the required table indices
Usage
We get ~10,000 new account a day
We have about 500K accounts Total.
We have several Linq queries that operate over these accounts
All our queries use Linq, most queries are dynamically built using predicate builder pattern (we don't use Hql)
The problem is that,
As the number of accounts increases the execution time for these queries increases.
Note:
Only accounts that are within at 48 hours window are relevant for
our queries / application. But, Older accounts need to be preserved
(so cannot be deleted). Even though these accounts are not needed by
the application it may be consumed later by the analytics
application
To solve this performance issue:
we are considering archiving accounts that are older than 48hrs
Creating an Archive database with the same schema as the main db
Adding a windows service that is scheduled to run on a nightly basic that moves "old" accounts from the main db to the archive db
The windows service will using nhiberate to read old accounts from the main db and save the old accounts(again using nhibernate) to the archive database, and then delete the old accounts from the main db. Right now, we think this service will move one account at a time until all the old accounts are moved to the archive database.
Ocassionally, when we do get a request to restore an account from the archive db, we will reverse the above step
Questions:
Is this archival approach any good? If not, why? can you suggest
some alternate implementations?
Can I use the same sessionfactory to connect to the main db and the archive db during the copy process? How can i change the connection string dynamically? Can I have two simulateous open sessions that work with two database
Can I copy more than one account at a time using this approach? Batch Copy and batch deletes?
Any help appreciated, thank you for your input.

I think your issue is more database related then nhibernate. A database with 500k records is not so much. To optimize access you should think about how you query and how to optimize for those queries.
Query only the data you need
Optimize your table by making indexes
Use the 20/80 rule, find the 20% expensive queries and optimize the code/queries. You program will be 80% faster
NHibernate: optimize your mappings
HHibernate: use batch size if you do multiple updates
Add stored procedures if something is hard to do in code
If your db grows, hire a db expert to advise on database optimization (they can improve your performance by 10% to 90%). You need him first for a few days and then once a week/month depending on how much work there is.

Related

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

Cache and update regularly complex data

Lets star with background. I have an api endpoint that I have to query every 15 minutes and that returns complex data. Unfortunately this endpoint does not provide information of what exactly changed. So it requires me to compare the data that I have in db and compare everything and than execute update, add or delete. This is pretty boring...
I came to and idea that I can simply remove all data from certain tables and build everything from scratch... But it I have to also return this cached data to my clients. So there might be a situation that the db will be empty during some request from my client because it will be "refreshing/rebulding". And that cant happen because I have to return something
So I cam to and idea to
Lock the certain db tables so that the client will have to wait for the "refreshing the db"
or
CQRS https://martinfowler.com/bliki/CQRS.html
Do you have any suggestions how to solve the problem?
It sounds like you're using a relational database, so I'll try to outline a solution using database terms. The idea, however, is more general than that. In general, it's similar to Blue-Green deployment.
Have two data tables (or two databases, for that matter); one is active, and one is inactive.
When the software starts the update process, it can wipe the inactive table and write new data into it. During this process, the system keeps serving data from the active table.
Once the data update is entirely done, the system can begin to serve data from the previously inactive table. In other words, the inactive table becomes the active table, and vice versa.

How to handle multiple customers with different SQL databases

Summary
I have a project with multiple existing MSSQL databases, I already created an Azure Analysis Service where I deployed my first Tabular Cube. I already tested to access the Analysis Service, worked perfectly.
Finally I have to duplicate the above described for ~90 databases (90 different customers).
I'm unsure how to organize this project and I'm not sure about the possibilities I have.
What I did
I already browsed the Internet to find some information, but I just found a single source where somebody asked a similar question, the first reply is what I was already thinking about, as I described below.
The last reply I don't really understand, what does he mean with one solution, is there another hierarchy above the project?
Question
A possibility would be to import each database as a source in the same project, but I think this means I have to import each table from this source, means finally 5*90 = 450 tables, I think this gets quickly outta control?
Also I thought about duplicating the whole Visual Studio Project folder for ~90 times for each customer, but at the moment I fail to find all references to change the name, but I think this wouldn't be to hard.
Is there an easier way to achieve my goal? Especially regarding maintainability.
Solution
I will make a completely new Database with all the needed tables. Inside those tables I copy the databases from all customers with a new column customerId. The data I'll transfer with a cyclic job, periodicity to define. Updates in already existing row in the customer database I handle with a trigger.
For this the best approach would be to create a staging database and import the data from the other databases, so your Tabular Model can read the data from it.
Doing 90+ databases is going to be a massive admin overhead and getting the cube to lad them effectively is going to be problematic. Move the data using SSIS/Data factory as you'll be able to better orchestrate the data movement, and incremental loads that way. That way if you need to add/remove/update data sources it is not done in the Cube, its all done at the database/data factory level.
Just use one database for all the customers and differentiate each customer with a customer_id column.

Dynamically List contents of a table in database that continously updates

It's kinda real-world problem and I believe the solution exists but couldn't find one.
So We, have a Database called Transactions that contains tables such as Positions, Securities, Bogies, Accounts, Commodities and so on being updated continuously every second whenever a new transaction happens. For the time being, We have replicated master database Transaction to a new database with name TRN on which we do all the querying and updating stuff.
We want a sort of monitoring system ( like htop process viewer in Linux) for Database that dynamically lists updated rows in tables of the database at any time.
TL;DR Is there any way to get a continuous updating list of rows in any table in the database?
Currently we are working on Sybase & Oracle DBMS on Linux (Ubuntu) platform but we would like to receive generic answers that concern most of the platform as well as DBMS's(including MySQL) and any tools, utilities or scripts that can do so that It can help us in future to easily migrate to other platforms and or DBMS as well.
To list updated rows, you conceptually need either of the two things:
The updating statement's effect on the table.
A previous version of the table to compare with.
How you get them and in what form is completely up to you.
The 1st option allows you to list updates with statement granularity while the 2nd is more suitable for time-based granularity.
Some options from the top of my head:
Write to a temporary table
Add a field with transaction id/timestamp
Make clones of the table regularly
AFAICS, Oracle doesn't have built-in facilities to get the affected rows, only their count.
Not a lot of details in the question so not sure how much of this will be of use ...
'Sybase' is mentioned but nothing is said about which Sybase RDBMS product (ASE? SQLAnywhere? IQ? Advantage?)
by 'replicated master database transaction' I'm assuming this means the primary database is being replicated (as opposed to the database called 'master' in a Sybase ASE instance)
no mention is made of what products/tools are being used to 'replicate' the transactions to the 'new database' named 'TRN'
So, assuming part of your environment includes Sybase(SAP) ASE ...
MDA tables can be used to capture counters of DML operations (eg, insert/update/delete) over a given time period
MDA tables can capture some SQL text, though the volume/quality could be in doubt if a) MDA is not configured properly and/or b) the DML operations are wrapped up in prepared statements, stored procs and triggers
auditing could be enabled to capture some commands but again, volume/quality could be in doubt based on how the DML commands are executed
also keep in mind that there's a performance hit for using MDA tables and/or auditing, with the level of performance degradation based on individual config settings and the volume of DML activity
Assuming you're using the Sybase(SAP) Replication Server product, those replicated transactions sent through repserver likely have all the info you need to know which tables/rows are being affected; so you have a couple options:
route a copy of the transactions to another database where you can capture the transactions in whatever format you need [you'll need to design the database and/or any customized repserver function strings]
consider using the Sybase(SAP) Real Time Data Streaming product (yeah, additional li$ence is required) which is specifically designed for scenarios like yours, ie, pull transactions off the repserver queues and format for use in downstream systems (eg, tibco/mqs, custom apps)
I'm not aware of any 'generic' products that work, out of the box, as per your (limited) requirements. You're likely looking at some different solutions and/or customized code to cover your particular situation.

How to access data in Dynamics CRM?

What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.

Resources