How do I improve my ETL performance? - etl

Background Info:
I have a tradition ETL (on SQL Server) which takes around 6 hours to complete. I am looking to optimise the ETL. Below are the steps I have already taken:
Removed unnecessary CURSOR from the logic. For the remaining one's that I am not able to remove, I used READ_ONLY, FAST_FORWARD, INSENSITIVE.
There was some sorting of data happening, which I removed.
Tune the SQL Queries that were long running by using compiler hints or Join hints.
Removed unnecessary columns that were being fetched from source.
Partitioned the tables as well. I use partition switch which did improve some performance.
Is there any other method I am missing which could help make the ETL faster? At this point, we don't have the option of adding more powerful hardware resources or migrating into Hadoop.
Any help would be appreciated.

Few questions:
Are your sources SQL Server databases?
Have you reviewed your destination database?
Is this a dimensional datawarehouse or a normalised data store?
Without much knowledge on your source and destination, some other things I might recommend:
1)Remove unwanted lookup transformations, if you have any.
2)If you can afford to, I would look at creating indexes on some of your source tables. Not always feasible, but this helps believe me.
3)Remove unwanted UNIONs
If its possible please share further info on your ETL/Database architecture and I am sure many brains over here would be able to shed more wisdom.
Cheers
Nithin

Related

Data export performance issue - loops or joins? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I look after a system which uploads flat files generated by ABAP. We have a large file (500,000 records) generated from the HR module in SAP every day which generates a record for every person for the next year. One person gets a record if they are rostered on on a certain day or have planned leave for a given day.
This job takes over 8 hours to run and it is starting to get time critical. I am not an ABAP programmer but I was concerned when discussing this with the programmers as they kept on mentioning 'loops'.
Looking at the source it's just a bunch of single row selects inside nested loops after nested loop. Not only that it has loads of SELECT
I suggested to the programmers that they use SQL more heavily but they insist the SAP approved way is to use loops instead of SQL and use the provided SAP functions (i.e to look up the work schedule rule), and that using SQL will be slower.
Being a database programmer I never use loops (cursors) because they are far slower than SQL, and cursors are usually a giveaway that a procedural programmer has been let loose on the database.
I just can't believe that changing an existing program to use SQL more heavily than loops will slow it down. Does anyone have any insight? I can provide more info if needed.
Looking at google, I'm guessing I'll get people from both sides saying it is better.
I've read the question and I stopped when I read this:
Looking at the source it's just a bunch of single row selects inside
nested loops after nested loop. Not only that it has loads of SELECT
*.
Without knowing more about the issue this looks overkilling, because with every loop the program executes a call to the database. Maybe this was done in this way because the dataset of the selected data is too big, however it is possible to load chunks of data, then treat them and then repeat the action or you can make a big JOIN and operate over that data. This is a little tricky but trust me that this does the job.
In SAP you must use this kind of techniques when this situations happen. Nothing is more efficient than handling datasets in memory. To this I can recommend the use of sorted and/or hashed tables and BINARY SEARCH.
On the other hand, using a JOIN does not necessarily improves performance, it depends on the knowledge and use of the indexes and foreign keys in the tables. For example, if you join a table to get a description I think is better to load that data in an internal table and get the description from it with a BINARY SEARCH.
I can't tell exactly what is the formula, it depends on the case, Most of the time you have to tweak the code, debug and test and make use of transactions 'ST05' and 'SE30' to check performance and repeat the process. The experience with those issues in SAP gives you a clear sight of these patterns.
My best advice for you is to make a copy of that program and correct it according to your experience. The code that you describe can definitely be improved. What can you loose?
Hope it helps
Sounds like the import as it stands is looping over single records and importing them into a DB one at a time. It's highly likely that there's a lot of redundancy there. It's a pattern I've seen many times and the general solution we've adopted is to import data in batches...
A SQL Server stored procedure can accept 'table' typed parameters, which on the client/C# side of the database connection are simple lists of some data structure corresponding to the table structure.
A stored procedure can then receive and process multiple rows of your csv file in one call, therefore any joins you need to do are being done on sets of input data which is how relational databases are designed to be used. This is especially beneficial if you're joining out to commonly used data or have lots of foreign keys (which are essentially invoking a join in order to validate the keys you're trying to insert).
We've found that the SQL Server CPU and IO load for a given amount of import data is much reduced by using this approach. It does however require consultation with DBAs and some tuning of indexes to get it to work well.
You are correct.
Without knowing the code, in most cases it is much faster to use views or joins instead of nested loops. (there are exceptions, but they are very rare).
You can define views in SE11 or SE80 and they usually heavily reduce the communication overhead between abap server and database server.
Often there are readily defined views from SAP for common cases.
edit:
You can check where your performance goes to: http://scn.sap.com/community/abap/testing-and-troubleshooting/blog/2007/11/13/the-abap-runtime-trace-se30--quick-and-easy
Badly written parts that are used sparsely don't matter.
With the statistics you know where it hurts an where your optimization effort pays.

Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr

I am relatively a newbie to big data processing looking for some specific guidance from the SO community.
We are currently setup with a monolithic/sequential ETL, needless to say it is not scalable as our data grows. What are our options (sure distributing and parallelizing are but need specifics)? I have played with Hadoop and it may be appropriate to use here, but I am wondering what are some of the other options out there? May be something that's easier to transition to for a database developer?
Kind of related to question above is we also have an OLAP cube for aggregated data. Is Elasticsearch or Solr good candidates for replacing an OLAP cube? Has anyone successfully done this? What are the gotchas?
same kind of use case currently we are working on.
our approach may be use full.
step 1: we are sqooping data to Hdfs from dbs
step 2: ETL logic in Pig scripting
step 3: building index on aggregated table data to solr.
step 4: search on solr through web interface.
in our use case we are developing pig jobs to perform transformation logic storing them to final folders incrementally. later MR indexer tool will index the data to solr.we are using cloudera-search. let me know if any thing.

Hive vs Pig when performing Joins

I have some scripts which process my website's logs. I have loaded this data into multiple tables in Hive. I run these scripts on daily basis to do the analysis of the traffic.
Lately I am seeing that the hive queries which I have written in these scripts is taking too much time. Earlier, it used to take around 10-15 mins to generate the reports, but now it takes hours to do the same.
I did the analysis of the data and its around 5-10% of increase in dataset.
One of my friends suggested me that Hive is not good when it comes to joining multiple hive tables and I should switch my scripts to Pig. Is Hive bad at joining tables when compared to Pig?
Is Hive bad at joining tables
No. Hive is actually pretty good, but sometimes it takes a bit playing around with the query optimizer.
Depending on which version of Hive you use, you may need to provide hints in your query to tell the optimizer to join the data using a certain algorithm. You can find some details about different hints here.
If you're thinking about using Pig, I think your choice should not be motivated only by performance considerations. In my experience there is no quantifiable gain in using Pig, I have used both over the past years, and in terms of performance there is no clear winner.
What Pig gives you however is more transparency when defining what kind of join you want to use instead of relying on some (sometimes obscure) optimizer hints.
In the end, Pig or Hive doesn't really matter, it just depends how you decide to optimize your queries. If you're considering switching to Pig, I would first really analyze what your needs in terms of processing are, as you'll probably fall even in terms of performance. Here is a good post if you want to compare the 2.

In what real-world scenarios is a semantic database better than a relational one?

Are there any real live (non-academic) and public (open-source or free) examples of a semantic database like Metalog being used to solve a computing problem that traditionally had been done with relational databases?
Semantic databases work much better if only part of your data follows a schema.
If you need additional columns in a semantic database, you just add them. Even for single rows. This is hard or inefficient in a relational database.
Also clustering is much more simple with semantic or tuple databases. Most often, this means just to install the database on N servers and set a few config options.

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

Resources