We have a TDBGrid that connected to TClientDataSet via TDataSetProvider in Delphi 7 with Oracle database.
It goes fine to show content of small tables, but the program hangs when you try to open a table with many rows (for ex 2 million rows) because TClientDataSet tries to load the whole table in memory.
I tried to set "FetchOnDemand" to True for our TClientDataSet and "poFetchDetailsOnDemand" to True in Options for TDataSetProvider, but it does not help to solve the problem. Any ides?
Update:
My solution is:
TClientDataSet.FetchOnDemand = T
TDataSetProvider.Options.poFetchDetailsOnDemand = T
TClientDataSet.PacketRecords = 500
I succeeded to solve the problem by setting the "PacketRecords" property for TCustomClientDataSet. This property indicates the number or type of records in a single data packet. PacketRecords is automatically set to -1, meaning that a single packet should contain all records in the dataset, but I changed it to 500 rows.
When working with RDBMS, and especially with large datasets, trying to access a whole table is exactly what you shouldn't do. That's a typical newbie mistake, or a borrowing from old file based small database engines.
When working with RDBMS, you should load the rows you're interested in only, display/modify/update/insert, and send back changes to the database. That means a SELECT with a proper WHERE clause and also an ORDER BY - remember row ordering is never assured when you issue a SELECT without an OREDER BY, a database engine is free to retrieve rows in the order it sees fit for a given query.
If you have to perform bulk changes, you need to do them in SQL and have them processed on the server, not load a whole table client side, modify it, and send changes row by row to the database.
Loading large datasets client side may fali for several reasons, lack of memory (especially 32 bit applications), memory fragmentation, etc. etc., you will flood the network probably with data you don't need, force the database to perform a full scan, maybe flloding the database cache as well, and so on.
Thereby client datasets are not designed to handle millions of billions of rows. They are designed to cache the rows you need client side, and then apply changes to the remote data. You need to change your application logic.
Related
I’m in the middle of trying to migrate a large amount of data into a oracle database from existing excel-files.
Due to the large amount of rows loaded (10 000 and more) every time, it is not possible to use SQL Developer for this tasks.
In every work-sheet there’s data that need to go into different tables, but at the same time keep the relations and not dropping any data.
As for now, I use one .CSV file for each table and mapping them together afterwards. This is thou combined with a great risk of adding the wrong FK and with that screw up the hole shit. And I don’t have the time, energy or will for clean ups even if it is my own mess…
My initial thought was if I could bulk transfer with sql loader using some kind of plsql-script in maybe an ctl-file (the used for mapping the properties) but it seems like I.m quite out in the bush with that one… (or am I…? )
The other thought was to create a simple program In c# and use fastMember and load the database that way. (But that means that I need to take the time to actually make the program, however small it is).
I can’t possible be the only one that have had this issue, but trying to us my notToElevatedNinjaGoogling-skills ends up with either using sql developer (witch is not an alternative) or the bulk copy thing from sql load (and where I need to map it all together afterwards).
Is there any alternative solutions for my problem or is the above solutions the one that I need to cope with?
Did you consider using CSV files as external tables? As they act as if they were ordinary Oracle tables, you can write (PL/)SQL against them, inserting data into different tables in the target schema. That might give you some more freedom & control over what you are doing.
Behind the scene, it is still SQL*Loader.
I am new to Teradata & fortunately got a chance to work on both DDL-DML statements.
One thing I observed is Teradata is very slow when time comes to UPDATE the data in a table having large number of records.
The simplest way I found on the Google to perform this update is to write an INSERT-SELECT statement with a CASE on column holding values to be update with new values.
But what when this situation arrives in Data Warehouse environment, when we need to update multiple columns from a table holding millions of rows ?
Which would be the best approach to follow ?
INSERT-SELECT only OR MERGE-UPDATE OR MLOAD ?
Not sure if any of the above approach is not used for this UPDATE operation.
Thank you in advance!
At enterprise level, we expect volumes to be huge and updates are often part of some scheduled jobs/scripts.
With huge volume of data, Updates comes as a costly operation that involve risk of blocking table for some time in case the update fails (due to fallback journal). Although scripts are tested well, and failures seldom happen in production environments, it's always better to have data that needs to be updated loaded to a temporary table in required form and inserted back to same table after deleting matching records to maintain SCD-1 (Where we don't maintain history).
I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.
I want to select multiple rows from multiple tables, one of them having billions of rows. It sometimes take 20 seconds and there are over thousands of users using it so it is pretty bad.
I looked into COLUMNSTORE and tried it in my local machine and the performance is x50 faster than usual! (note that I was clearing the cache to see the difference)
However, the downside is I can't update, insert and delete rows, which is being constantly done for that table with the billion rows.
Is there a way to optimize it? (Besides the (NOLOCK) dirty read, which security is not an issue btw)
There are already indexes in that table, but doesn't help.
Is there a way to perform BATCH EXECUTION (I see it does row execution)? Or any optimization advice?
Using Microsoft SQL Server 2012
When you get to the scale of billions of rows, you often need to take different approaches for handling the data. Separating the content into multiple databases and storing on different machines might be more effective, however the design is considerably more complex.
An alternative is to consider using a combination of partitioned tables with a column-based index. That way at least, you can stage the updated data for the partition and then swap the updated one for the existing one to perform updates. See: http://technet.microsoft.com/en-us/library/gg492088.aspx#Update
An alternative is to consider using three tables: one that is static -- and is perhaps using column-based storage -- the other one dynamic, holding only recent updates and inserts, and the third holding just a list of deleted rows identified by the primary key. You then have to use a view to reconcile the content for queries.
I am working on Oracle SQL Developer and have created tables with Wikipedia data, so size of data is very huge and have 7 tables. I have created a search engine which fetches and display data using JSP, but the problem is that for each query the application has to access 4 tables making my application very time consuming.
I have added indexes to all tables but still it takes more time, so any suggestion on how to optimize my app and reduce time it is taking to display result.
There are several approaches you can take to tune your application. And it could be either tuning at the database end, front end or a combination of the two.
At the database end you could be looking at say a materialized view to summarize the more commonly searched data. This could either be for your search purposes only or to reduce the size and complexity of the resultset. You might also look at tuning the query itself - perhaps placing indexes on the relevant WHERE clauses of your search or look at denormalizing your tables.
At the application end - the retrieval of vast recordsets - can always cause problems where a single record is large (multi-columned) and the number or records in the resultset - numerous.
What you are probably looking for is a rapid response time from your application so your user doesn't feel they are waiting ... and waiting.
A technique I have seen and used is to retrieve the resultset either as
1) a recordset of ROWIDs and to page through these ROWIDs on the display
2) a simulated "paged" recordset. Retrieving the recordset in chunks.