Optimised Hive query with JOIN , having million records - hadoop

I have 2 tables-
bpm_agent_data - 40 Million records , 5 Columns
bpm_loan_data - 20 Million records, 5 Columns
Now I ran a query in Hive-
select count(bpm_agent_data.AgentID), count(bpm_loan_data.LoanNumber) from bpm_agent_data JOIN bpm_loan_data where bpm_loan_data.id = bpm_agent_data.id;
which is taking long long time to complete.
What should be the ideal way to write the query in HIVE so that Reducer must not take so much time.

Found the solution for the above query,
replaced where with ON
select count(bpm_agent_data.AgentID), count(bpm_loan_data.LoanNumber) from bpm_agent_data JOIN bpm_loan_data ON( bpm_loan_data.id = bpm_agent_data.id);

Related

How to Improve Cross Join Performance in Hive TEZ?

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.
For achieving this I am doing a cross join like
select *
from table1 join table 2
ON 1 = 1;
This is taking 5 hours to run with the highest possible memory parameters.
Is there any other short or easier way to achieve this in less time ?
Turn on map-join:
set hive.auto.convert.join=true;
select *
from table1 cross join table2;
The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.
Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.
with t1 as (
selct col1, col2,..., 0 as k from table1
)
,t2 as (
selct col3, col4,..., 0 as k from table2
)
selct
*
from t1 join t2
on t1.k = t2.k
Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

Performance issue in hive version 0.13.1

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

Sql stored prcedure take's more time to execute whn records are getting increased is there any way to optimize it

I have 6,00,000 records and i want to fetch 10 records from them as i want to display only 10 records in the grid my stored procedure is working properly when i m fetching records between 1-10000 E.G (500-510) after that the execution time is increased when the row number is increased E.G if i fetch record b/w 1,00,000-1,00,010 it takes more time to execute
can any one please help me i have used ROW_NUMBER() to get the number row number and used between to retrieve data.
please give a optimized way to get records
The stored procedure creats a sql query as given below
I have 6,00,000 records and i want to fetch 10 records from them as i want to display only 10 records in the grid my stored procedure is working properly when i m fetching records between 1-10000 E.G (500-510) after that the execution time is increased when the row number is increased E.G if i fetch record b/w 1,00,000-1,00,010 it takes more time to execute
can any one please help me i have used ROW_NUMBER() to get the number row number and used between to retrieve data.
please give a optimized way to get records
The stored procedure create a sql query as given below
SELECT FuelClaimId from
( SELECT fc.FuelClaimId,ROW_NUMBER() OVER ( order by fc.FuelClaimId ) AS RowNum
from FuelClaims fc
INNER JOIN Vehicles v on fc.VehicleId =v.VehicleId
INNER JOIN Drivers d on d.DriverId =v.OfficialID
INNER JOIN Departments de on de.DepartmentId =d.DepartmentId
INNER JOIN Provinces p on de.ProvinceId =p.ProvinceId
INNER JOIN FuelRates f on f.FuelRateId =fc.FuelRateId
INNER JOIN FuelClaimStatuses fs on fs.FuelClaimStatusId= fc.statusid
INNER JOIN LogsheetMonths l on l.LogsheetMonthId =f.LogsheetMonthId
Where fc.IsDeleted = 0) AS MyDerivedTable WHERE MyDerivedTable.RowNum BETWEEN
600000 And 600010
Try this instead:
SELECT TOP 10 fc.FuelClaimId
FROM FuelClaims fc
INNER JOIN Vehicles v ON fc.VehicleId = v.VehicleId
INNER JOIN Drivers d ON d.DriverId = v.OfficialID
INNER JOIN Departments de ON de.DepartmentId = d.DepartmentId
INNER JOIN Provinces p ON de.ProvinceId = p.ProvinceId
INNER JOIN FuelRates f ON f.FuelRateId = fc.FuelRateId
INNER JOIN FuelClaimStatuses fs ON fs.FuelClaimStatusId = fc.statusid
INNER JOIN LogsheetMonths l ON l.LogsheetMonthId = f.LogsheetMonthId
WHERE fc.IsDeleted = 0 AND fc.FuelClaimId BETWEEN 600001 AND 600010
ORDER BY fc.FuelClaimId
Also BETWEEN is inclusive so BETWEEN 10 and 20 actually returns 10,11,12,13,14,15,16,17,18,19 and 20 so 11 rows not 10. As identity values usually start at 1 you really want BETWEEN 11 AND 20 (hence 600001 in the above)
The above query should fix your issue where your performance degrades as you query the larger range of items.
While it won't always return 10 records the fix for that is:
WHERE fc.IsDeleted = 0 AND fc.FuelClaimId > #LastMaxFuelClaimId
Where #LastMaxFuelClaimId is the previous MAX FuelClaimId you had returned from the previous query execution.
Edit: The reason why it keeps getting slower is because it has to read more and more of the table to read the next chunk, it doesn't skip reading the first 600,000 records it reads them all and then only returns the next 10 hence each time you query it reads all the previous records all over again, the above does not suffer from the same problem.
You should post an execution plan but a probable cause of performance problems would be inadequate or lack of indexing.
Make sure you have
an index on all your foreign key relations
a covering index on the fields you retrieve and select from
Covering Index
CREATE INDEX IX_FUELCLAIMS_FUELCLAIMID_ISDELETED
ON dbo.FuelClaims (FuelClaimId, VehicleID, IsDeleted)

SQL Server performance using ROWCOUNT

I use SET ROWCOUNT 27900 And then select two columns:
Select
emp.employeeid,
empd.employeedetailid
From
employee emp (NOLOCK)
join
employeedetail empd (NOLOCK) on emp.employeeid = empd.employeeid
This query executes in 3 sec
If I use SET ROWCOUNT 27950 then the same query takes 20 sec to execute.
I am not a sql DBA, why there is a difference of 17 sec for just 50 rows. Is this anything related to page size or index?
Can anyone help me to fine tune the query?
Have you tried doing this using TOP instead of SET ROWCOUNT, and adding an ORDER BY?
SELECT TOP 27900 emp.employeeid, ...
ORDER BY ...;
This will give the optimizer a much better chance at optimizing. In simplest terms, SET ROWCOUNT is applied after the entire query has been processed, and only affects the rows that are sent back to the caller...

Query Optimization

i'm having problem with a query generated by mondrian OLAP Server . it's taking too long. the query look like as follows.
select "TABLE1"."column0" as "c0" from "FACT_TABLE" as "FACT_TABLE",
"TABLE7" as "TABLE7",
"TABLE6" as "TABLE6",
"TABLE5" as "TABLE5",
"TABLE4" as "TABLE4",
"TABLE4" as "TABLE3",
"TABLE2" as "TABLE2",
"TABLE1" as "TABLE2"
where "FACT_TABLE"."column1" = 'VALUE' and
"FACT_TABLE"."column2" =0 and
"FACT_TABLE"."column3" = 0 and
"TABLE2"."table1Fk" = "TABLE1"."table1Id" and
"TABLE3"."table2Fk" = "TABLE2"."table2Id" and
"TABLE4"."table3Fk" = "TABLE3"."table3Id" and
"TABLE5"."table4Fk" = "TABLE4"."table4Id" and
"TABLE6"."table5Fk" = "TABLE5"."table5Id" and
"TABLE7"."table6Fk" = "TABLE6"."table6Id" and
"FACT_TABLE"."table7Fk" = "TABLE7"."table7Id"
group by "TABLE1"."column0"
order by "TABLE1"."column0" ASC
FACT_TABLE has 1.346.000 rows aprox.
TABLE7 has 895 rows aprox.
TABLE6 has 445 rows.
TABLE5 has 183 rows.
TABLE4 has 258 rows.
TABLE3 = TABLE4.
TABLE2 has 126 rows.
TABLE1 has 29 rows.
The query is taking 2.00 seg aprox over the production enviroment, and i really dont know what to do to improved query performace.
im over a 8GB RAM.
intel xeon L5420 at 2.50GHZ.
MSSQL 2005.
i'll apreciated any help that you can give me
Have a look at the execution plan. It will notify you if it notices any tables that could use a non-clustered index to improve query time.
Link
That link helps a lot with optimizing queries (and learning a ton of other things.)
I would suspect that you don't have indexes on your foreign key fields.
This is not causing your problem but I personally wouldn't want to use a tool that creates implicit instead of explicit joins. How can you trust something that is using a syntax that was replaced 18 years ago with something better?
Oh something else I just noticed, I don't see any aggregates, so why are you grouping by?

Resources