loading method for the stage in business intelligence - business-intelligence

good night,
a query when the origin is passed to the stage base in business intelligence the loading method is total
or total + incremental,
I'm thinking of deleting all the data and reloading it, but if it were a very large database and many records would not be optimal. What do good practices suggest?
I will appreciate your opinions,
thank you very much,

I'm thinking of deleting all the data and reloading it, but if it were
a very large database and many records would not be optimal. What do
good practices suggest?
It depends.
Many companies are more comfortable with Delete-Truncate pattern because it is easy to implement and the amount data isn't a problem only if some conditions are verified (hardware, DBA..)
Incremental Loads (or Up-Sert pattern) are often used to keep data between two systems in sync with one another. They are used in cases when source data is being loaded into the destination on a repeating basis, such as every night or throughout the day.
Benefits of Incremental Data Loads :
They typically run considerably faster since they touch less data. Assuming no bottlenecks, the time to move and transform data is proportional to the amount of data being touched. If you touch half as much data, the run time is often reduced at a similar scale.
Disadvantages of Incremental Data Loads :
Maintainability: With a full load, if there's an error you can re-run the entire load without having to do much else in the way of cleanup / preparation. With an incremental load, the files generally need to be loaded in order. So if you have a problem with one batch, others queue up behind it till you correct it.
TRUNCATING and then INSERTING is two operations whereas UPDATEing is one, making the TRUNCATE and INSERT take (theoretically) more time.
There's also the ease-of-use factor. If you TRUNCATE then INSERT, you have to manually keep track of every column value. If you UPDATE, you just need to know what you want to change.

Related

Efficiently store daily dumps in Hadoop HDFS

I believe a common usage pattern for Hadoop is to build a "data lake" by loading regular (e.g. daily) snapshots of data from operational systems. For many systems, the rate of change from day to day is typically less than 5% of rows (and even when a row is updated, only a few fields may change).
Q: How can such historical data be structured on HDFS, so that it is both economical in space consumption, and efficient to access.
Of course, the answer will depend on how the data is commonly accessed. On our Hadoop cluster:
Most jobs only read and process the most recent version of the data
A few jobs process a period of historical data (e.g. 1 - 3 months)
A few jobs process all available historical data
This implies that, while keeping historical data is important, it shouldn't come at the cost of severely slowing down those jobs that only want to know what the data looked like at close-of-business yesterday.
I know of a few options, none of which seem quite satisfactory:
Store each full dump independently as a new subdirectory. This is the most obvious design, simple, and very compatible with the MapReduce paradigm. I'm sure some people use this approach, but I have to wonder how they justify the cost of storage? Supposing 1Tb is loaded each day, then that's 365Tb added to the cluster per year of mostly duplicated data. I know disks are cheap these days, but most budget-makers are accustomed to infrastructure expanding proportional to business growth, as opposed to growing linearly over time.
Store only the differences (delta) from the previous day. This is a natural choice when the source systems prefer to send updates in the form of deltas (a mindset which seems to date from the time when data was passed between systems in the form of CD-ROMs). It is more space efficient, but harder to get right (for example, how do you represent deletion?), and even worse it implies the need for consumers to scan the whole of history, "event sourcing"-style, in order to arrive at the current state of the system.
Store each version of a row once, with a start and end date. Known by terms such as "time variant data", this pattern pops up very frequently in data warehousing, and more generally in relational database design when there is a need to store historical values. When a row changes, update the previous version to set the "end date", then insert the new version with today as the "start date". Unfortunately, this doesn't translate well to the Hadoop paradigm, where append-only datasets are favoured, and there is no native concept of updating a row (although that effect can be achieved by overwriting the existing data files). This approach requires quite complicated logic to load the data, but admittedly it can be quite convenient to consume data with this structure.
(It's worth noting that all it takes is one particularly volatile field changing every day to make the latter options degrade to the same space efficiency as option 1).
So...is there another option that combines space efficiency with ease of use?
I'd suggest a variant of option 3 that respects the append only nature of HDFS.
Instead of one data set, we keep two with different kinds of information, stored separately:
The history of expired rows, most likely partitioned by the end date (perhaps monthly). This only has rows added to it when their end dates become known.
A collection of snapshots for particular days, including at least the most recent day, most likely partitioned by the snapshot date. New snapshots can be added each day, and old snapshots can be deleted after a couple of days since they can be reconstructed from the current snapshot and the history of expired records.
The difference from option 3 is just that we consider the unexpired rows to be a different kind of information from the expired ones.
Pro: Consistent with the append only nature of HDFS.
Pro: Queries using the current snapshot can run safely while a new day is added as long as we retain snapshots for a few days (longer than the longest query takes to run).
Pro: Queries using history can similarly run safely as long as they explicitly give a bound on the latest "end-date" that excludes any subsequent additions of expired rows while they are running.
Con: It is not just a simple "update" or "overwrite" each day. In practice in HDFS this generally needs to be implemented via copying and filtering anyway so this isn't really a con.
Con: Many queries need to combine the two data sets. To ease this we can create views or similar that appropriately union the two to produce something that looks exactly like option 3.
Con: Finding the latest snapshot requires finding the right partition. This can be eased by having a view that "rolls over" to the latest snapshot each time a new one is available.

Doing analytical queries on large dynamic sets of data

I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time.
The requirements are as follows -
Large sets of data can experience state changes. Updates need to be fast.
I should be able to aggregate data pivoted on various attributes.
Ideally - there should be a way to correlate individual data units to an aggregated results i.e. I want to drill down into the specific transactions that produced a certain aggregation.
(I am aware of the race conditions here, like the state of a data unit changing after an aggregation is performed ; but this is expected).
All aggregations are time based - i.e. sum of x on pivot y over a day, 2 days, week, month etc.
I am evaluating different technologies to meet these use cases, and would like to hear your suggestions. I have taken a look at Hive/Pig which fit the analytics/aggregation use case. However, I am concerned about the large bursts of updates that can come into the system at any time. I am not sure how this performs on HDFS files when compared to an indexed database (sql or nosql).
You'll probably arrive at the optimal solution only by stress testing actual scenarios in your environment, but here are some suggestions. First, if write speed is a bottleneck, it might make sense to write the changing state to an append-only store, separate from the immutable data, then join the data again for queries. Append-only writing (e.g., like log files) will be faster than updating existing records, primarily because it minimizes disk seeks. This strategy can also help with the problem of data changing underneath you during queries. You can query against a "snapshot" in time. For example, HBase keeps several timestamped updates to a record. (The number is configurable.)
This is a special case of the persistence strategy called Multiversion Concurrency Control - MVCC. Based on your description, MVCC is probably the most important underlying strategy for you to perform queries for a moment in time and get consistent state information returned, even while updates are happening simultaneously.
Of course, doing joins over split data like this will slow down query performance. So, if query performance is more important, then consider writing whole records where the immutable data is repeated along with the changing state. That will consume more space, as a tradeoff.
You might consider looking at Flexviews. It supports creating incrementally refreshable materialized views for MySQL. A materialized view is like a snapshot of a query that is updated periodically with the data which has changed. You can use materialized views to summarize on multiple attributes in different summary tables and keep these views transactionally consistent with each other. You can find some slides describing the functionality on slideshare.net
There is also Shard-Query which can be used in combination with InnoDB and MySQL partitioning, as well as supporting spreading data over many machines. This will satisfy both high update rates and will provide query parallelism for fast aggregation.
Of course, you can combine the two together.

SQL Server - Merging large tables without locking the data

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks
Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.
I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx
From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

Oracle select query performance

I am working on a application. It is in its initial stage so the number of records in table is not large, but later on it will have around 1 million records in the same table.
I want to know what points I should consider while writing select query which will fetch a huge amount of data from table so it does not slow down performance.
First rule:
Don't fetch huge amounts of data back to the application.
Unless you are going to display every single one of the items in the huge amount of data, do not fetch it. Communication between the DBMS and the application is (relatively) slow, so avoid it when possible. It isn't so slow that you shouldn't use the DBMS or anything like that, but if you can reduce the amount of data flowing between DBMS and application, the overall performance will usually improve.
Often, one easy way to do this is to list only those columns you actually need in the application, rather than using 'SELECT *' to retrieve all columns when you'll only use 4 of the 24 that exist.
Second rule:
Try to ensure that the DBMS does not have to look at huge amounts of data.
To the extent possible, minimize the work that the DBMS has to do. It is busy, and typically it is busy on behalf of many people at any given time. If you can reduce the amount of work that the DBMS has to do to process your query, everyone will be happier.
Consider things like ensuring you have appropriate indexes on the table - not too few, not too many. Designed judiciously, indexes can greatly improve the performance of many queries. Always remember, though, that each index has to be maintained, so inserts, deletes and updates are slower when there are more indexes to manage on a given table.
(I should mention: none of this advice is specific to Oracle - you can apply it to any DBMS.)
To get good performance with a database there is a lot of things you need to have in mind. At first, it is the design, and here you should primary think about normalization and denormalization (split up tables but still not as much as performance heavy joins are required).
There are often a big bunch of tuning when it comes to performance. However, 80% of the performance is determined from the SQL-code. Below are some links that might help you.
http://www.smart-soft.co.uk/Oracle/oracle-performance-tuning-part7.htm
http://www.orafaq.com/wiki/Oracle_database_Performance_Tuning_FAQ
A few points to remember:
Fetch only the columns you need to use on the client side.
Ensure you set up the correct indexes that are going to help you find records. These can be done later, but it is better to plan for them if you can.
Ensure you have properly accounted for column widths and data sizes. Don't use an INT when a TINYINT will hold all possible values. A row with 100 TINYINT fields will fetch faster than a row with 100 INT fields, and you'll also be able to fetch more rows per read.
Depending on how clean you need the data to be, it may be permissable to do a "dirty read", where the database fetches data while an update is in progress. This can speed things up significantly in some cases, though it means the data you get might not be the absolute latest.
Give your DBA beer. And hugs.
Jason

Database speed optimization: few tables with many rows, or many tables with few rows?

I have a big doubt.
Let's take as example a database for a whatever company's orders.
Let's say that this company make around 2000 orders per month, so, around 24K order per year, and they don't want to delete any orders, even if it's 5 years old (hey, this is an example, numbers don't mean anything).
In the meaning of have a good database query speed, its better have just one table, or will be faster having a table for every year?
My idea was to create a new table for the orders each year, calling such orders_2008, orders_2009, etc..
Can be a good idea to speed up db queries?
Usually the data that are used are those of the current year, so there are less lines the better is..
Obviously, this would give problems when I search in all the tables of the orders simultaneously, because should I will to run some complex UNION .. but this happens in the normal activities very rare.
I think is better to have an application that for 95% of the query is fast and the remaining somewhat slow, rather than an application that is always slow.
My actual database is on 130 tables, the new version of my application should have about 200-220 tables.. of which about 40% will be replicated annually.
Any suggestion?
EDIT: the RDBMS will be probably Postgresql, maybe (hope not) Mysql
Smaller tables are faster. Period.
If you have history that is rarely used, then getting the history into other tables will be faster.
This is what a data warehouse is about -- separate operational data from historical data.
You can run a periodic extract from operational and a load to historical. All the data is kept, it's just segregated.
Before you worry about query speed, consider the costs.
If you split the code into separate code, you will have to have code that handles it. Every bit of code you write has the chance to be wrong. You are asking for your code to be buggy at the expense of some unmeasured and imagined performance win.
Also consider the cost of machine time vs. programmer time.
If you use indexes properly, you probably need not split it into multiple tables. Most modern DBs will optimize access.
Another option you might consider is to have a table for the current year, and at the end append the data to another table which has data for all the previous years. ?
I would not split tables by year.
Instead I would archive data to a reporting database every year, and use that when needed.
Alternatively you could partition the data, amongst drives, thus maintaining performance, although i'm unsure if this is possible in postgresql.
For the volume of data you're looking at splitting the data seems like a lot of trouble for little gain. Postgres can do partitioning, but the fine manual [1] says that as a rule of thumb you should probably only consider it for tables that exceed the physical memory of the server. In my experience, that's at least a million rows.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
I agree that smaller tables are faster. But it depends on your business logic if it makes sense to split a single entity over multiple tables. If you need a lot of code to manage all the tables than it might not be a good idea.
It also depends on the database what logic you're able to use to tackle this problem. In Oracle a table can be partitioned (on year for example). Data is stored physically in different table spaces which should make it faster to address (as I would assume that all data of a single year is stored together)
An index will speed things up but if the data is scattered across the disk than a load of block reads are required which can make it slow.
Look into partitioning your tables in time slices. Partitioning is good for the log-like table case where no foreign keys point to the tables.

Resources