Custom Mapper and Reducer vs HiveQL - performance

Problem Statement:-
I need to compare two tables Table1 and Table2 and they both store same thing. So I need to compare Table2 with Table1 as Table1 is the main table through which comparisons need to be made. So after comparing I need to make a report that Table2 has some sort of discrepancy. And these two tables has lots of data, around TB of data. So currently I have written HiveQL to do the comparisons and get the data back.
So my question is which is better in terms of PERFORMANCE, writing a CUSTOM MAPPER and REDUCER to do this kind of job or the HiveQL that I wrote will be fine as I will be joining these two tables on millions of records. As far as I know HiveQL internally (behind the scenes) generates optimized custom map-reducer and submits for execution and gets back the results.

The answer to your question is two-fold.
Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.
Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.
Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.

Related

How much of Talend functionality is translated in SQL-Query and how much in Java?

I am facing an internship and they asked me to learn how to use talend ETL.
I did it, not so difficult.
One of the extra-tasks that have been assigned to me is to verify how much of the operations I set on the design workspace is executed in java and what is done through the use of queries.
I've set up a simple Join using the TMap component and I monitored the SQLdatabase through the use of SQL Profiler. the result is that only the essential create/drop and the select/insert of the table is done via sql while every other thing like the actual join is made "Java" side.
As long as it is an simple operation like join, wouldn't it be convenient to execute it through a query without having to bother java to perform it?
For those who also know SAP, in terms of performance is there so much difference between Talend and SAP?
Only operations in tDB components (create,select,insert, etc) are actually done through SQL. All operations done in other talend components (tMap, tFilter, aggregate, etc) are done through java.
Indeed you'll have better performances doing operations SQL-side. You then have to find the right balance between an "all-in-sql" type of job and an "all-java" one. (it could be harder for a talend developer to debug operations if all the sql part is done through a unique query inside a single component...).
You could definitely have your joins inside a tDBInput component, and output the result in a single output flow.
You can also check ELT* components : they let you use SQL-engine instead of java-engine to perform all operations (join,aggregate,filter) while using a talend interface.

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.

insert data from one table to two tables group by for Oracle

I have a situation where I need a large amount of data (9+ billion per day) data being collected in a loading table that has fields like
-TABLE loader
first_seen,request,type,response,hits
1232036346,mydomain.com,A,203.11.12.1,200
1332036546,ogm.com,A,103.13.12.1,600
1432039646,mydomain.com,A,203.11.12.1,30
that need to split into two tables (de-duplicated)
-TABLE final
request,type,response,hitcount,id
mydomain.com,A,203.11.12.1,230,1
ogm.com,A,103.13.12.1,600,2
and
-TABLE timestamps
id,times_seen
1,1232036346
2,1432036546
1,1432039646
I can create the schemas and do the select like
select request,type,response,sum(hitcount) from loader group by request,type,response;
get data into the final table. for best performance I want to see if I can use "insert all" to move data from the loader to these two tables and perhaps use triggers in the database to try to achieve this. Any ideas and recommendations on the best ways to solve this?
"9+ billion per day"
That's more than just a large number of rows: that's a huge number, and it will require special engineering to handle it.
For starters, you don't just need INSERT statements. The requirement to maintain the count for existing (request,type,response) tuples points to UPDATE too. The need to generate and return a synthetic key is problematic in this scenario. It rules out MERGE, the easiest way of implementing upserts (because the MERGE syntax doesn't support the RETURNING clause).
Beyond that, attempting to handle nine billion rows in a single transaction is a bad idea. How long will it take to process? What happens if it fails halfway through? You need to define a more granular unit of work.
Although, that raises some business issues. What do the users only want to see the whole picture, after the Close-Of-Day? Or would they derive benefit from seeing Intra-day results? If yes, how to distinguish Intra-day from Close-Of-Day results? If no, how to hide partially processed results whilst the rest is still in flight? Also, how soon after Close-Of-Day do they want to see those totals?
Then there are the architectural considerations. These figure mean processing over one hundred thousand (one lakh) rows every second. That requires serious crunch and expensive licensing extras. Obviously Enterprise Edition for parallel processing but also Partitioning and perhaps RAC options.
By now you should have an inkling why nobody answered your question straight-away. This is a consultancy gig not a StackOverflow question.
But let's sketch a solution.
We must have continuous processing of incoming raw data. So we stream records for loading into FINAL and TIMESTAMP tables alongside the LOADER table, which becomes an audit of the raw data (or else perhaps we get rid of the LOADER table altogether).
We need to batch the incoming records to leverage set-based operations. Depending on the synthetic key implementation we should aim for pure SQL, otherwise Bulk PL/SQL.
Keeping the thing going is vital so we need to pay attention to Bulk Error Handling.
Ideally the target tables can be partitioned, so we can load into offline tables and use Partition Exchange to bring the cleaned data online.
For the synthetic key I would be tempted to use a hash key based on the (request,type,response) tuple rather than a sequence, as that would give us the option to load TIMESTAMP and FINAL independently. (Collisions are extremely unlikely.)
Just to be clear, this is a bagatelle not a serious architecture. You need to experiment and benchmark various approaches against realistic volumes of data on Production-equivalent hardware.

Resources