Spark Rapids: Simple HashAggregate Example - rapids

[Hi All, I am new to Spark Rapids. I was going through the basic introduction to Spark Rapids, where I got a figure (attached) explaining the difference between CPU and GPU based query plans for hashaggregate example. All things in the plans, except the last phase converting to the Row Format is not clear to me. Can anyone please suggest the reason behind this.]

I do not see the referenced figure, but I suspect what is happening in your particular query comes down to one of two possible cases.
If your query is performing some kind of collection of the data back to the driver (e.g.: .show or .collect in Scala or otherwise directly displaying the query results) then the columnar GPU data needs to be converted back to rows before being returned to the driver. Ultimately the driver is working with RDD[InternalRow] which is why a transition from RDD[ColumnarBatch] needs to occur in those cases.
If your query ends by writing the output to files (e.g.: to Parquet or ORC) then the plan often shows a final GpuColumnarToRow transition. Spark's Catalyst optimizer automatically inserts ColumnarToRow transitions when it sees operations that are capable of producing columnar output (i.e.: RDD[ColumnarBatch]) and then the plugin updates those transitions to GpuColumnarToRow when the previous node will operate on the GPU. However in this case the query node is a data write command, and those produce no output in the query plan sense. Output is directly written to files when the node is executed instead of sending the output to a downstream node for further processing. Therefore this is a degenerate transition in practice, as the data write command sends no data to the columnar-to-row transition. I filed an issue against the RAPIDS Accelerator to clean up that degenerate transition, but it has no impact on query performance.

Related

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.

Lambda Architecture - Why batch layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.
Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!
Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements
You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

Need: In memory object database, transactional safety, indices, LINQ, no persistence

Anyone an idea?
The issue is: I am writing a high performance application. It has a SQL database which I use for persistence. In memory objects get updated, then the changes queued for a disc write (which is pretty much always an insert in a versioned table). The small time risk is given as accepted - in case of a crash, program code will resynclocal state with external systems.
Now, quite often I need to run lookups on certain values, and it would be nice to have standard interface. Basically a bag of objects, but with the ability to run queries efficiently against an in memory index. For example I have a table of "instruments" which all have a unique code, and I need to look up this code.... about 30.000 times per second as I get updates for every instrument.
Anyone an idea for a decent high performance library for this?
You should be able to use an in-memory SQLite database (:memory) with System.Data.SQLite.

Performance of bcp/BULK INSERT vs. Table-Valued Parameters

I'm about to have to rewrite some rather old code using SQL Server's BULK INSERT command because the schema has changed, and it occurred to me that maybe I should think about switching to a stored procedure with a TVP instead, but I'm wondering what effect it might have on performance.
Some background information that might help explain why I'm asking this question:
The data actually comes in via a web service. The web service writes a text file to a shared folder on the database server which in turn performs a BULK INSERT. This process was originally implemented on SQL Server 2000, and at the time there was really no alternative other than chucking a few hundred INSERT statements at the server, which actually was the original process and was a performance disaster.
The data is bulk inserted into a permanent staging table and then merged into a much larger table (after which it is deleted from the staging table).
The amount of data to insert is "large", but not "huge" - usually a few hundred rows, maybe 5-10k rows tops in rare instances. Therefore my gut feeling is that BULK INSERT being a non-logged operation won't make that big a difference (but of course I'm not sure, hence the question).
The insertion is actually part of a much larger pipelined batch process and needs to happen many times in succession; therefore performance is critical.
The reasons I would like to replace the BULK INSERT with a TVP are:
Writing the text file over NetBIOS is probably already costing some time, and it's pretty gruesome from an architectural perspective.
I believe that the staging table can (and should) be eliminated. The main reason it's there is that the inserted data needs to be used for a couple of other updates at the same time of insertion, and it's far costlier to attempt the update from the massive production table than it is to use an almost-empty staging table. With a TVP, the parameter basically is the staging table, I can do anything I want with it before/after the main insert.
I could pretty much do away with dupe-checking, cleanup code, and all of the overhead associated with bulk inserts.
No need to worry about lock contention on the staging table or tempdb if the server gets a few of these transactions at once (we try to avoid it, but it happens).
I'm obviously going to profile this before putting anything into production, but I thought it might be a good idea to ask around first before I spend all that time, see if anybody has any stern warnings to issue about using TVPs for this purpose.
So - for anyone who's cozy enough with SQL Server 2008 to have tried or at least investigated this, what's the verdict? For inserts of, let's say, a few hundred to a few thousand rows, happening on a fairly frequent basis, do TVPs cut the mustard? Is there a significant difference in performance compared to bulk inserts?
Update: Now with 92% fewer question marks!
(AKA: Test Results)
The end result is now in production after what feels like a 36-stage deployment process. Both solutions were extensively tested:
Ripping out the shared-folder code and using the SqlBulkCopy class directly;
Switching to a Stored Procedure with TVPs.
Just so readers can get an idea of what exactly was tested, to allay any doubts as to the reliability of this data, here is a more detailed explanation of what this import process actually does:
Start with a temporal data sequence that is ordinarily about 20-50 data points (although it can sometimes be up a few hundred);
Do a whole bunch of crazy processing on it that's mostly independent of the database. This process is parallelized, so about 8-10 of the sequences in (1) are being processed at the same time. Each parallel process generates 3 additional sequences.
Take all 3 sequences and the original sequence and combine them into a batch.
Combine the batches from all 8-10 now-finished processing tasks into one big super-batch.
Import it using either the BULK INSERT strategy (see next step), or TVP strategy (skip to step 8).
Use the SqlBulkCopy class to dump the entire super-batch into 4 permanent staging tables.
Run a Stored Procedure that (a) performs a bunch of aggregation steps on 2 of the tables, including several JOIN conditions, and then (b) performs a MERGE on 6 production tables using both the aggregated and non-aggregated data. (Finished)
OR
Generate 4 DataTable objects containing the data to be merged; 3 of them contain CLR types which unfortunately aren't properly supported by ADO.NET TVPs, so they have to be shoved in as string representations, which hurts performance a bit.
Feed the TVPs to a Stored Procedure, which does essentially the same processing as (7), but directly with the received tables. (Finished)
The results were reasonably close, but the TVP approach ultimately performed better on average, even when the data exceeded 1000 rows by a small amount.
Note that this import process is run many thousands of times in succession, so it was very easy to get an average time simply by counting how many hours (yes, hours) it took to finish all of the merges.
Originally, an average merge took almost exactly 8 seconds to complete (under normal load). Removing the NetBIOS kludge and switching to SqlBulkCopy reduced the time to almost exactly 7 seconds. Switching to TVPs further reduced the time to 5.2 seconds per batch. That's a 35% improvement in throughput for a process whose running time is measured in hours - so not bad at all. It's also a ~25% improvement over SqlBulkCopy.
I am actually fairly confident that the true improvement was significantly more than this. During testing it became apparent that the final merge was no longer the critical path; instead, the Web Service that was doing all of the data processing was starting to buckle under the number of requests coming in. Neither the CPU nor the database I/O were really maxed out, and there was no significant locking activity. In some cases we were seeing a gap of a few idle seconds between successive merges. There was a slight gap, but much smaller (half a second or so) when using SqlBulkCopy. But I suppose that will become a tale for another day.
Conclusion: Table-Valued Parameters really do perform better than BULK INSERT operations for complex import+transform processes operating on mid-sized data sets.
I'd like to add one other point, just to assuage any apprehension on part of the folks who are pro-staging-tables. In a way, this entire service is one giant staging process. Every step of the process is heavily audited, so we don't need a staging table to determine why some particular merge failed (although in practice it almost never happens). All we have to do is set a debug flag in the service and it will break to the debugger or dump its data to a file instead of the database.
In other words, we already have more than enough insight into the process and don't need the safety of a staging table; the only reason we had the staging table in the first place was to avoid thrashing on all of the INSERT and UPDATE statements that we would have had to use otherwise. In the original process, the staging data only lived in the staging table for fractions of a second anyway, so it added no value in maintenance/maintainability terms.
Also note that we have not replaced every single BULK INSERT operation with TVPs. Several operations that deal with larger amounts of data and/or don't need to do anything special with the data other than throw it at the DB still use SqlBulkCopy. I am not suggesting that TVPs are a performance panacea, only that they succeeded over SqlBulkCopy in this specific instance involving several transforms between the initial staging and the final merge.
So there you have it. Point goes to TToni for finding the most relevant link, but I appreciate the other responses as well. Thanks again!
I don't really have experience with TVP yet, however there is an nice performance comparison chart vs. BULK INSERT in MSDN here.
They say that BULK INSERT has higher startup cost, but is faster thereafter. In a remote client scenario they draw the line at around 1000 rows (for "simple" server logic). Judging from their description I would say you should be fine with using TVP's. The performance hit - if any - is probably negligible and the architectural benefits seem very good.
Edit: On a side note you can avoid the server-local file and still use bulk copy by using the SqlBulkCopy object. Just populate a DataTable, and feed it into the "WriteToServer"-Method of an SqlBulkCopy instance. Easy to use, and very fast.
The chart mentioned with regards to the link provided in #TToni's answer needs to be taken in context. I am not sure how much actual research went into those recommendations (also note that the chart seems to only be available in the 2008 and 2008 R2 versions of that documentation).
On the other hand there is this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP
I have been using TVPs since 2009 and have found, at least in my experience, that for anything other than simple insert into a destination table with no additional logic needs (which is rarely ever the case), then TVPs are typically the better option.
I tend to avoid staging tables as data validation should be done at the app layer. By using TVPs, that is easily accommodated and the TVP Table Variable in the stored procedure is, by its very nature, a localized staging table (hence no conflict with other processes running at the same time like you get when using a real table for staging).
Regarding the testing done in the Question, I think it could be shown to be even faster than what was originally found:
You should not be using a DataTable, unless your application has use for it outside of sending the values to the TVP. Using the IEnumerable<SqlDataRecord> interface is faster and uses less memory as you are not duplicating the collection in memory only to send it to the DB. I have this documented in the following places:
How can I insert 10 million records in the shortest time possible? (lots of extra info and links here as well)
Pass Dictionary<string,int> to Stored Procedure T-SQL
Streaming Data Into SQL Server 2008 From an Application (on SQLServerCentral.com ; free registration required)
TVPs are Table Variables and as such do not maintain statistics. Meaning, they report only having 1 row to the Query Optimizer. So, in your proc, either:
Use statement-level recompile on any queries using the TVP for anything other than a simple SELECT: OPTION (RECOMPILE)
Create a local temporary table (i.e. single #) and copy the contents of the TVP into the temp table
I think I'd still stick with a bulk insert approach. You may find that tempdb still gets hit using a TVP with a reasonable number of rows. This is my gut feeling, I can't say I've tested the performance of using TVP (I am interested in hearing others input too though)
You don't mention if you use .NET, but the approach that I've taken to optimise previous solutions was to do a bulk load of data using the SqlBulkCopy class - you don't need to write the data to a file first before loading, just give the SqlBulkCopy class (e.g.) a DataTable - that's the fastest way to insert data into the DB. 5-10K rows isn't much, I've used this for up to 750K rows. I suspect that in general, with a few hundred rows it wouldn't make a vast difference using a TVP. But scaling up would be limited IMHO.
Perhaps the new MERGE functionality in SQL 2008 would benefit you?
Also, if your existing staging table is a single table that is used for each instance of this process and you're worried about contention etc, have you considered creating a new "temporary" but physical staging table each time, then dropping it when it's finished with?
Note you can optimize the loading into this staging table, by populating it without any indexes. Then once populated, add any required indexes on at that point (FILLFACTOR=100 for optimal read performance, as at this point it will not be updated).
Staging tables are good! Really I wouldn't want to do it any other way. Why? Because data imports can change unexpectedly (And often in ways you can't foresee, like the time the columns were still called first name and last name but had the first name data in the last name column, for instance, to pick an example not at random.) Easy to research the problem with a staging table so you can see exactly what data was in the columns the import handled. Harder to find I think when you use an in memory table. I know a lot of people who do imports for a living as I do and all of them recommend using staging tables. I suspect there is a reason for this.
Further fixing a small schema change to a working process is easier and less time consuming than redesigning the process. If it is working and no one is willing to pay for hours to change it, then only fix what needs to be fixed due to the schema change. By changing the whole process, you introduce far more potential new bugs than by making a small change to an existing, tested working process.
And just how are you going to do away with all the data cleanup tasks? You may be doing them differently, but they still need to be done. Again, changing the process the way you describe is very risky.
Personally it sounds to me like you are just offended by using older techniques rather than getting the chance to play with new toys. You seem to have no real basis for wanting to change other than bulk insert is so 2000.

Resources