Apache Drill has bad performance against SQL Server - performance

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!

Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources

One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.

Related

cache table advanced before executing the spark sql

We are doing a POC to compare different tools including spark sql, apache drill and so forth. The bechmark dataset includes almost one thousand parquet files. For the same query, apache drill takes like several seconds while spark sql takes more than 40 minutes. I guess the running time for spark sql is dominated by reading files from disk. Actually, the POC aims at finding how long do the queries themselves take, we kind of do not worry about the spend in reading files from disk at current. We are wondering is there a way to cache all the tables at advance and then execute the test queries on these cached tables. We understand that cache is also lazy and caches data right after first action on a query. Our solution now is to use a dummy action and a dummy query to cache the table first and then execute the test queries.
For example:
table1 = sqlContext.read.load(path1)
table1.registerTempTable("table1")
sqlContext.cache("table1")
result1 = sqlContext.sql("sql1")
table2 = sqlContext.read.load(path2)
table2.registerTempTable("table2")
sqlContext.cache("table2")
result2 = sqlContext.sql("sql2")
result1.show() //dummy action
result2.show() //dummy action
I'm wondering what does the cache really do? As we know that cache is also lazy and it happens at the first action. Does it make any difference for using different dummy actions and different queries?

insert data from one table to two tables group by for Oracle

I have a situation where I need a large amount of data (9+ billion per day) data being collected in a loading table that has fields like
-TABLE loader
first_seen,request,type,response,hits
1232036346,mydomain.com,A,203.11.12.1,200
1332036546,ogm.com,A,103.13.12.1,600
1432039646,mydomain.com,A,203.11.12.1,30
that need to split into two tables (de-duplicated)
-TABLE final
request,type,response,hitcount,id
mydomain.com,A,203.11.12.1,230,1
ogm.com,A,103.13.12.1,600,2
and
-TABLE timestamps
id,times_seen
1,1232036346
2,1432036546
1,1432039646
I can create the schemas and do the select like
select request,type,response,sum(hitcount) from loader group by request,type,response;
get data into the final table. for best performance I want to see if I can use "insert all" to move data from the loader to these two tables and perhaps use triggers in the database to try to achieve this. Any ideas and recommendations on the best ways to solve this?
"9+ billion per day"
That's more than just a large number of rows: that's a huge number, and it will require special engineering to handle it.
For starters, you don't just need INSERT statements. The requirement to maintain the count for existing (request,type,response) tuples points to UPDATE too. The need to generate and return a synthetic key is problematic in this scenario. It rules out MERGE, the easiest way of implementing upserts (because the MERGE syntax doesn't support the RETURNING clause).
Beyond that, attempting to handle nine billion rows in a single transaction is a bad idea. How long will it take to process? What happens if it fails halfway through? You need to define a more granular unit of work.
Although, that raises some business issues. What do the users only want to see the whole picture, after the Close-Of-Day? Or would they derive benefit from seeing Intra-day results? If yes, how to distinguish Intra-day from Close-Of-Day results? If no, how to hide partially processed results whilst the rest is still in flight? Also, how soon after Close-Of-Day do they want to see those totals?
Then there are the architectural considerations. These figure mean processing over one hundred thousand (one lakh) rows every second. That requires serious crunch and expensive licensing extras. Obviously Enterprise Edition for parallel processing but also Partitioning and perhaps RAC options.
By now you should have an inkling why nobody answered your question straight-away. This is a consultancy gig not a StackOverflow question.
But let's sketch a solution.
We must have continuous processing of incoming raw data. So we stream records for loading into FINAL and TIMESTAMP tables alongside the LOADER table, which becomes an audit of the raw data (or else perhaps we get rid of the LOADER table altogether).
We need to batch the incoming records to leverage set-based operations. Depending on the synthetic key implementation we should aim for pure SQL, otherwise Bulk PL/SQL.
Keeping the thing going is vital so we need to pay attention to Bulk Error Handling.
Ideally the target tables can be partitioned, so we can load into offline tables and use Partition Exchange to bring the cleaned data online.
For the synthetic key I would be tempted to use a hash key based on the (request,type,response) tuple rather than a sequence, as that would give us the option to load TIMESTAMP and FINAL independently. (Collisions are extremely unlikely.)
Just to be clear, this is a bagatelle not a serious architecture. You need to experiment and benchmark various approaches against realistic volumes of data on Production-equivalent hardware.

Dumping Azure tables quickly

My task is to dump entire Azure tables with arbitrary unknown schemas. Standard code to do this resembles the following:
TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>();
foreach (DynamicTableEntity entity in table.ExecuteQuery(query))
{
// Write a dump of the entity (row).
}
Depending on the table, this works at a rate of 1000-3000 rows per second on my system. I'm guessing this (lack of) performance has something to do with separate HTTP requests issued to retrieve the data in chunks. Unfortunately, some of the tables are multi-gigabyte in size, so this takes a rather long time.
Is there a good way to parallelize the above or speed it up some other way? It would seem that those HTTP requests could be sent by multiple threads, as in web crawlers and the like. However, I don't see an immediate method to do so.
Unless you know the PartitionKeys of the entities in the table (or some other querying criteria which includes PartitionKey), AFAIK you would need to take a top down approach which you're doing right now. In order for you to fire queries in parallel which would work efficiently you have to include PartitionKey in your queries.

Delphi: ClientDataSet is not working with big tables in Oracle

We have a TDBGrid that connected to TClientDataSet via TDataSetProvider in Delphi 7 with Oracle database.
It goes fine to show content of small tables, but the program hangs when you try to open a table with many rows (for ex 2 million rows) because TClientDataSet tries to load the whole table in memory.
I tried to set "FetchOnDemand" to True for our TClientDataSet and "poFetchDetailsOnDemand" to True in Options for TDataSetProvider, but it does not help to solve the problem. Any ides?
Update:
My solution is:
TClientDataSet.FetchOnDemand = T
TDataSetProvider.Options.poFetchDetailsOnDemand = T
TClientDataSet.PacketRecords = 500
I succeeded to solve the problem by setting the "PacketRecords" property for TCustomClientDataSet. This property indicates the number or type of records in a single data packet. PacketRecords is automatically set to -1, meaning that a single packet should contain all records in the dataset, but I changed it to 500 rows.
When working with RDBMS, and especially with large datasets, trying to access a whole table is exactly what you shouldn't do. That's a typical newbie mistake, or a borrowing from old file based small database engines.
When working with RDBMS, you should load the rows you're interested in only, display/modify/update/insert, and send back changes to the database. That means a SELECT with a proper WHERE clause and also an ORDER BY - remember row ordering is never assured when you issue a SELECT without an OREDER BY, a database engine is free to retrieve rows in the order it sees fit for a given query.
If you have to perform bulk changes, you need to do them in SQL and have them processed on the server, not load a whole table client side, modify it, and send changes row by row to the database.
Loading large datasets client side may fali for several reasons, lack of memory (especially 32 bit applications), memory fragmentation, etc. etc., you will flood the network probably with data you don't need, force the database to perform a full scan, maybe flloding the database cache as well, and so on.
Thereby client datasets are not designed to handle millions of billions of rows. They are designed to cache the rows you need client side, and then apply changes to the remote data. You need to change your application logic.

oracle user_constraints, user_tables etc views for production

Is it ok to use that views in production? I mean if queries to dictionary is intended to be frequently called or it is designed just for very rare usage with tools like sql navigator, sql developer etc.
It depends on your definition of "frequently", the size of those objects in your database, and why you need to query them.
In general, it's fine to query data dictionary tables on a regular basis in production-- tons of database monitoring tools, for example, will regularly query a bunch of data dictionary tables to gather performance data. At the same time, though, you can easily configure most of these tools to put a tremendous load on your database by gathering too much data too frequently so your performance monitoring tool becomes the source of performance problems. Normally, you can just dial back the amount of data getting captured and the frequency at which it is captured to get 99% of the monitoring benefit without creating a bunch of issues.
I'm not sure why any tool would frequently need to query user_tables-- since tables aren't getting created or destroyed at runtime in a proper system, there aren't too many reasons why you'd really need to query that particular view all that frequently.

Resources