Best Ways to process bulk records in Delphi7 - performance

I am working with Delphi7 and SQL Server 2012 and facing an issue where need to process bulk data(~5-7GB) from DBNew and DBOld to generate a combined output.
Looking for an option in delphi7 so that it will be easy to join records of DBNew with DBOld. Like DataSet component exists in .net.
Looking for an option in Delphi7 where data handling in program will become easy and fast.
Edit:
What about TClientDataset, I found it in-memory database-independent dataset. Does it provides a set of records independent of any database and I can use it for Joins with OldDB?

Related

How much of Talend functionality is translated in SQL-Query and how much in Java?

I am facing an internship and they asked me to learn how to use talend ETL.
I did it, not so difficult.
One of the extra-tasks that have been assigned to me is to verify how much of the operations I set on the design workspace is executed in java and what is done through the use of queries.
I've set up a simple Join using the TMap component and I monitored the SQLdatabase through the use of SQL Profiler. the result is that only the essential create/drop and the select/insert of the table is done via sql while every other thing like the actual join is made "Java" side.
As long as it is an simple operation like join, wouldn't it be convenient to execute it through a query without having to bother java to perform it?
For those who also know SAP, in terms of performance is there so much difference between Talend and SAP?
Only operations in tDB components (create,select,insert, etc) are actually done through SQL. All operations done in other talend components (tMap, tFilter, aggregate, etc) are done through java.
Indeed you'll have better performances doing operations SQL-side. You then have to find the right balance between an "all-in-sql" type of job and an "all-java" one. (it could be harder for a talend developer to debug operations if all the sql part is done through a unique query inside a single component...).
You could definitely have your joins inside a tDBInput component, and output the result in a single output flow.
You can also check ELT* components : they let you use SQL-engine instead of java-engine to perform all operations (join,aggregate,filter) while using a talend interface.

Best solutions for implementing simple reporting that are intensive

Currently we have data in the transaction database (Oracle) and are fetching data through queries to form reports. e.g. fetch all people under company A along with their details and lookup values from some more tables. It looks something like:
Select p.name,
p.address,
(select country_name from country where country_id = p.country_id),
...
...
from
person p, company c, person_file pf...
where c.company_id = p.company_id and c.company_id = 1
.. <all joins and conditions for tables>
The query takes a lot of time to fetch the records when there are a number of people against a company. My question is, what would be a better reporting solution by design and technology to get results faster if I don't want to stick to oracle as in future data will grow. Logically, it would be to implement something that does work in parallel. Another option like Spark seems to be an overkill.
First of all if you want to use oracle as the existing solution for the parallel processing you can use spark as your data reconciliation framework. Though it needs some learning curve but by using spark sql you can use your own query to read data from oracle. You can read data in parallel though it depends on how many parallel sessions is been configured with your oracle profile. Please check with the DBA.
Another option is migrating to any nosql dB like Cassandra so you can horizontally scale your machines rather than vertically. But the migration won’t be an easy task and straight forward. As nosql database does not support joins by design so the data modelling should be changed accordingly. Once done you can use spark on top of it. You can also consider using Talend which has predefined spark component ready.

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.

What is the best method for paging with datagrid performance?

On a site with a high number of users, should paging be handled in code, or with a stored procedure. If you have employed caching, please include your success factors.
Personally, I never page stuff outside SQL Server. I do this at database level as if you have a million records to be paged, if you retrieve it in application layer and page it there, you are already paying a huge cost.
99.9% of the time, paging should be done on your database server. However, stored procedures are not required to do this, and, in fact, many stored procedure solutions rely on cursors and are quite inefficient. Ideally, use a single SQL statement tailored to your database platform to retrieve just the records you need and no more.
I would do it at database level. Talking about sql server 2005, i would use the new ROW_NUMBER() function, look at:
Paging SQL Server 2005 Results
Where a typical sql would be:
SELECT Row_Number() OVER(ORDER BY UserName) As RowID, UserFirstName, UserLastName
FROM Users
WHERE RowID Between 0 AND 9
Here https://web.archive.org/web/20210510021915/http://aspnet.4guysfromrolla.com/articles/031506-1.aspx you can see how it works and examine a little benchmark by Scott Mitchell.
Most database vendors offer rich paging support at the database. Make use of it ;-p Note that it doesn't have to be a stored-procedure to do this (I'll sideline the ever-running stored-proc vs ad-hoc command debate).
As an aside, many frameworks will also do this for you efficiently. For example, in .NET 3.5 (with LINQ), you can use Skip() and Take() to do paging that is used at the db.
I think It depends on number of records to be paged. For example you have 100 records to be paged I think there is no need SQL paging stuff to do this. I am always trying to keep in my mind KISS principle and Premature optimization.

Is Hibernate good for batch processing? What about memory usage?

I have a daily batch process that involves selecting out a large number of records and formatting up a file to send to an external system. I also need to mark these records as sent so they are not transmitted again tomorrow.
In my naive JDBC way, I would prepare and execute a statement and then begin to loop through the recordset. As I only go forwards through the recordset there is no need for my application server to hold the whole result set in memory at one time. Groups of records can be feed across from the database server.
Now, lets say I'm using hibernate. Won't I endup with a bunch of objects representing the whole result set in memory at once?
Hibernate does also iterate over the result set so only one row is kept in memory. This is the default. If it to load greedily, you must tell it so.
Reasons to use Hibernate:
"Someone" was "creative" with the column names (PRXFC0315.XXFZZCC12)
The DB design is still in flux and/or you want one place where column names are mapped to Java.
You're using Hibernate anyway
You have complex queries and you're not fluent in SQL
Reasons not to use Hibernate:
The rest of your app is pure JDBC
You don't need any of the power of Hibernate
You have complex queries and you're fluent in SQL
You need a specific feature of your DB to make the SQL perform
Hibernate offers some possibilities to keep the session small.
You can use Query.scroll(), Criteria.scroll() for JDBC-like scrolling. You can use Session.evict(Object entity) to remove entities from the session. You can use a StatelessSession to suppress dirty-checking. And there are some more performance optimizations, see the Hibernate documentation.
Hibernate as any ORM framework is intended for developing and maintaining systems based on object oriented programming principal. But most of the databases are relational and not object oriented, so in any case ORM is always a trade off between convenient OOP programming and optimized/most effective DB access.
I wouldn't use ORM for specific isolated tasks, but rather as an overall architectural choice for application persistence layer.
In my opinion I would NOT use Hibernate, since it makes your application a whole lot bigger and less maintainable and you do not really have a chance of optimizing the generated sql-scripts in a quick way.
Furthermore you could use all the SQL functionality the JDBC-bridge supports and are not limited to the hibernate functionality. Another thing is that you have the limitations too that come along with each layer of legacy code.
But in the end it is a philosophical question and you should do it the way it fits you're way of thinking best.
If there are possible performance issues then stick with the JDBC code.
There are a number of well known pure SQL optimisations which
which would be very difficult to do in Hibernate.
Only select the columns you use! (No "select *" stuff ).
Keep the SQl as simple as possible. e.g. Dont include small reference tables like currency codes in the join. Instead load the currency table into memory and resolve currency descriptions with a program lookup.
Depending on the DBMS minor re-ordering of the SQL where predicates can have a major effect on performance.
If you are updateing/inserting only commit every 100 to 1000 updates. i.e. Do not commit every unit of work but keep some counter so you commit less often.
Take advantage of the aggregate functions of your database. If you want totals by DEPT code then do it in the SQL with " SUM(amount) ... GROUP BY DEPT ".

Resources