Optimizing Apache Spark SQL Queries - performance

I am facing very long latencies on Apache Spark when running some SQL queries. In order to simplify the query, I run my calculations in a sequential manner: The output of each query is stored as a temporary table (.registerTempTable('TEMP')) so it can be used in the following SQL query and so on... But the query takes too much time, while in 'Pure Python' code, it takes just a few minutes.
sqlContext.sql("""
SELECT PFMT.* ,
DICO_SITES.CodeAPI
FROM PFMT
INNER JOIN DICO_SITES
ON PFMT.assembly_department = DICO_SITES.CodeProg """).registerTempTable("PFMT_API_CODE")
sqlContext.sql("""
SELECT GAMMA.*,
(GAMMA.VOLUME*GAMMA.PRORATA)/100 AS VOLUME_PER_SUPPLIER
FROM
(SELECT PFMT_API_CODE.* ,
SUPPLIERS_PROP.CODE_SITE_FOURNISSEUR,
SUPPLIERS_PROP.PRORATA
FROM PFMT_API_CODE
INNER JOIN SUPPLIERS_PROP ON PFMT_API_CODE.reference = SUPPLIERS_PROP.PIE_NUMERO
AND PFMT_API_CODE.project_code = SUPPLIERS_PROP.FAM_CODE
AND PFMT_API_CODE.CodeAPI = SUPPLIERS_PROP.SITE_UTILISATION_FINAL) GAMMA """).registerTempTable("TEMP_ONE")
sqlContext.sql("""
SELECT TEMP_ONE.* ,
ADCP_DATA.* ,
CASE
WHEN ADCP_DATA.WEEK <= weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_ST + ADCP_DATA.ADD_CAPACITY_ST
WHEN ADCP_DATA.WEEK > weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_LT + ADCP_DATA.ADD_CAPACITY_LT
END AS CAPACITY_REF
FROM TEMP_ONE
INNER JOIN ADCP_DATA
ON TEMP_ONE.reference = ADCP_DATA.PART_NUMBER
AND TEMP_ONE.CodeAPI = ADCP_DATA.API_CODE
AND TEMP_ONE.project_code = ADCP_DATA.PROJECT_CODE
AND TEMP_ONE.CODE_SITE_FOURNISSEUR = ADCP_DATA.SUPPLIER_SITE_CODE
AND TEMP_ONE.WEEK_NUM = ADCP_DATA.WEEK_NUM
""" ).registerTempTable('TEMP_BIS')
sqlContext.sql("""
SELECT TEMP_BIS.CSF_ID,
TEMP_BIS.CF_ID ,
TEMP_BIS.CAPACITY_REF,
TEMP_BIS.VOLUME_PER_SUPPLIER,
CASE
WHEN TEMP_BIS.CAPACITY_REF >= VOLUME_PER_SUPPLIER THEN 'CAPACITY_OK'
WHEN TEMP_BIS.CAPACITY_REF < VOLUME_PER_SUPPLIER THEN 'CAPACITY_NOK'
END AS CAPACITY_CHECK
FROM TEMP_BIS
""").take(100)
Could anyone highlight (if there are any) the best practices for writing pyspark SQL queries on Spark?
Does it make sense that locally on my computer the script is much faster than on the Hadoop cluster?
Thanks in advance

You should cache your intermediate results, what is the data source?
can you retrieve only relevant data from it or only relevant columns. There are many options you should provide more info about your data.

Related

Load only some elements of a nested collection efficiently with LINQ

I have the following LINQ query (using EF Core 6 and MS SQL Server):
var resultSet = dbContext.Systems
.Include(system => system.Project)
.Include(system => system.Template.Type)
.Select(system => new
{
System = system,
TemplateText = system.Template.TemplateTexts.FirstOrDefault(templateText => templateText.Language == locale.LanguageIdentifier),
TypeText = system.Template.Type.TypeTexts.FirstOrDefault(typeText => typeText.Language == locale.LanguageIdentifier)
})
.FirstOrDefault(x => x.System.Id == request.Id);
The requirement is to retrieve the system matching the requested ID and load its project, template and template's type info. The template has multiple TemplateTexts (one for each translated language) but I only want to load the one matching the requested locale, same deal with the TypeTexts elements of the template's type.
The LINQ query above does that in one query and it gets converted to the following SQL query (I edited the SELECT statements to use * instead of the long list of columns generated):
SELECT [t1].*, [t2].*, [t5].*
FROM (
SELECT TOP(1) [p].*, [t].*, [t0].*
FROM [ParkerSystems] AS [p]
LEFT JOIN [Templates] AS [t] ON [p].[TemplateId] = [t].[Id]
LEFT JOIN [Types] AS [t0] ON [t].[TypeId] = [t0].[Id]
LEFT JOIN [Projects] AS [p0] ON [p].[Project_ProjectId] = [p0].[ProjectId]
WHERE [p].[SystemId] = #__request_Id_1
) AS [t1]
LEFT JOIN (
SELECT [t3].*
FROM (
SELECT [t4].*, ROW_NUMBER() OVER(PARTITION BY [t4].[ReferenceId] ORDER BY [t4].[Id]) AS [row]
FROM [TemplateTexts] AS [t4]
WHERE [t4].[Language] = #__locale_LanguageIdentifier_0
) AS [t3]
WHERE [t3].[row] <= 1
) AS [t2] ON [t1].[Id] = [t2].[ReferenceId]
LEFT JOIN (
SELECT [t6].*
FROM (
SELECT [t7].*, ROW_NUMBER() OVER(PARTITION BY [t7].[ReferenceId] ORDER BY [t7].[Id]) AS [row]
FROM [TypeTexts] AS [t7]
WHERE [t7].[Language] = #__locale_LanguageIdentifier_0
) AS [t6]
WHERE [t6].[row] <= 1
) AS [t5] ON [t1].[Id0] = [t5].[ReferenceId]
which is not bad, it's not a super complicated query, but I feel like my requirement can be solved with a much simpler SQL query:
SELECT *
FROM [Systems] AS [p]
JOIN [Templates] AS [t] ON [p].[TemplateId] = [t].[Id]
JOIN [TemplateTexts] AS [tt] ON [p].[TemplateId] = [tt].[ReferenceId]
JOIN [Types] AS [ty] ON [t].[TypeId] = [ty].[Id]
JOIN [TemplateTexts] AS [tyt] ON [ty].[Id] = [tyt].[ReferenceId]
WHERE [p].[SystemId] = #systemId and tt.[Language] = 2 and tyt.[Language] = 2
My question is: is there a different/simpler LINQ expression (either in Method syntax or Query syntax) that produces the same result (get all info in one go) because ideally I'd like to not have to have an anonymous object where the filtered sub-collections are aggregated. For even more brownie points, it'd be great if the generated SQL would be simpler/closer to what I think would be a simple query.
Is there a different/simpler LINQ expression (...) that produces the same result
Yes (maybe) and no.
No, because you're querying dbContext.Systems, therefore EF will return all systems that match your filter, also when they don't have TemplateTexts etc. That's why it has to generate outer joins. EF is not aware of your apparent intention to skip systems without these nested data or of any guarantee that these systems don't occur in the database. (Which you seem to assume, seeing the second query).
That accounts for the left joins to subqueries.
These subqueries are generated because of FirstOrDefault. In SQL it always requires some sort of subquery to get "first" records of one-to-many relationships. This ROW_NUMBER() OVER construction is actually quite efficient. Your second query doesn't have any notion of "first" records. It'll probably return different data.
Yes (maybe) because you also Include data. I'm not sure why. Some people seem to think Include is necessary to make subsequent projections (.Select) work, but it isn't. If that's your reason to use Includes then you can remove them and thus remove the first couple of joins.
OTOH you also Include system.Project which is not in the projection, so you seem to have added the Includes deliberately. And in this case they have effect, because the entire entity system is in the projection, otherwise EF would ignore them.
If you need the Includes then again, EF has to generate outer joins for the reason mentioned above.
EF decides to handle the Includes and projections separately, while hand-crafted SQL, aided by prior knowledge of the data could do that more efficiently. There's no way to affect that behavior though.
This LINQ query is close to your SQL, but I'm afraid of correctness of the result:
var resultSet =
(from system in dbContext.Systems
from templateText in system.Template.TemplateTexts
where templateText.Language == locale.LanguageIdentifier
from typeText in system.Template.Type.TypeTexts
where typeText.Language == locale.LanguageIdentifier
select new
{
System = system,
TemplateText = templateText
TypeText = typeText
})
.FirstOrDefault(x => x.System.Id == request.Id);

How to optimize EF/LINQ query to get a single rather than nested SELECT statement

The following LINQ executes query which takes 90 milliseconds to execute:
.Where(Function(i) (i.RequestedByUserId = MySession.ApplicationUserId)
And (i.RequestKey1 = searchJson)
And (i.RequestMethod = "ProjectPlanService.GetProjectPlanMaintenanceData"))
.Select(Function(i) i.ResultJson).FirstOrDefault
The SQL generated is as below :
SELECT
[Limit1].[ResultJson] AS [ResultJson]
FROM ( SELECT TOP (1)
[Extent1].[ResultJson] AS [ResultJson]
FROM [dbo].[ApplicationCache] AS [Extent1]
WHERE ([Extent1].[RequestedByUserId] = 2) AND ([Extent1].[RequestKey1] = '{"SortProperty":"","SortOrder":0,"PageNumber":1,"RecordsPerPage":15,"CriteriaCount":"1","CriteriaString":"~=~Id"}') AND ('ProjectPlanService.GetProjectPlanMaintenanceData' = [Extent1].[RequestMethod])
) AS [Limit1]
How Can I optimize the above LINQ expression to reduce the time taken to execute?
Is there a way to get a single Select statement like below:
SELECT
[ResultJson]
FROM [dbo].[ApplicationCache] AS [Extent1]
WHERE ([Extent1].[RequestedByUserId] = 2) AND ([Extent1].[RequestKey1] = '{"SortProperty":"","SortOrder":0,"PageNumber":1,"RecordsPerPage":15,"CriteriaCount":"1","CriteriaString":"~=~Id"}') AND ('ProjectPlanService.GetProjectPlanMaintenanceData' = [Extent1].[RequestMethod])
1: There's no need to optimise, it's just fine as it is - The extra "wrapping" won't change anything.
2: It's doing a top 1 because you asked for FirstOrDefault(), if you want a full list of results don't do that.

Too many queries for each user to assets table in joomla

My site generate a lot of queries to database, for each user generates 6 queries. I tried to fine the source of that but my knowlegde was not enough to find. If anyone could help me how to fine the source of that queries?
I used:
Joomla 2.5.8
Main Components: CB, Kunena, SH404SEF, K2, Komento, UddeIM PMS
Main Modules: Gavick News PRO4
Block spam IP
Block bots
The queries which are generate for each user:
SELECT *
FROM `_users`
WHERE `id` = 15
SELECT `g`.`id`,`g`.`title`
FROM `_usergroups` AS g
INNER JOIN `_user_usergroup_map` AS m ON m.group_id = g.id
WHERE `m`.`user_id` = 15
SELECT b.id
FROM _user_usergroup_map AS map
LEFT JOIN _usergroups AS a ON a.id = map.group_id
LEFT JOIN _usergroups AS b ON b.lft <= a.lft
AND b.rgt >= a.rgt
WHERE map.user_id = 15
SELECT a.rules
FROM _assets AS a
WHERE (a.id = 1)
GROUP BY a.id, a.rules, a.lft
SELECT id
FROM _assets
WHERE parent_id = 0
SELECT b.rules
FROM _assets AS a
LEFT JOIN _assets AS b ON b.lft <= a.lft
AND b.rgt >= a.rgt
WHERE (a.id = 1 OR a.parent_id = 0)
GROUP BY b.id, b.rules, b.lft
ORDER BY b.lft
well of course there are going to be a fair amount of queries per use, as you are using extensions such as CB and Kunena, that include queries for each user. Unless you get a message from your host saying too much memory is being used or there is too much traffic, you should be fine.
Joomla is a CMS and therefore these sort of things need to be expected when there are a fair amount of users.
Actually we just fixed a bug in the rules field that was generating an excessive number of queries. It will be fixed in 3.0.4 which is due out next week and whenever another 2.5 release comes out. In the meantime you can fix it yourself.
https://github.com/joomla/joomla-platform/pull/1792
But that's not what you are asking about. The number of queries isn't really the issue (it's totally reasonable) the question is how fast are they.

converting sql to linq woes

At my job our main application was written long ago before n-tier was really a thing, ergo - it has tons and tons of business logic begin handled in stored procs and such.
So we have finally decided to bite the bullet and make it not suck so bad. I have been tasked with converting a 900+ line sql script to a .NET exe, which I am doing in C#/Linq. Problem is...for the last 5-6 years at another job, I had been doing Linq exclusively, so my SQL has gotten somewhat rusty, and some of thing I am converting I have never tried to do before in Linq, so I'm hitting some roadblocks.
Anyway, enough whining.
I'm having trouble with the following sql statement, I think due to the fact that he is joining on a temp table and a derived table. Here's the SQL:
insert into #processedBatchesPurgeList
select d.pricebatchdetailid
from pricebatchheader h (nolock)
join pricebatchstatus pbs (nolock) on h.pricebatchstatusid = pbs.pricebatchstatusid
join pricebatchdetail d (nolock) on h.pricebatchheaderid = d.pricebatchheaderid
join
( -- Grab most recent REG.
select
item_key
,store_no
,pricebatchdetailid = max(pricebatchdetailid)
from pricebatchdetail _pbd (nolock)
join pricechgtype pct (nolock) on _pbd.pricechgtypeid = pct.pricechgtypeid
where
lower(rtrim(ltrim(pct.pricechgtypedesc))) = 'reg'
and expired = 0
group by item_key, store_no
) dreg
on d.item_key = dreg.item_key
and d.store_no = dreg.store_no
where
d.pricebatchdetailid < dreg.pricebatchdetailid -- Make sure PBD is not most recent REG.
and h.processeddate < #processedBatchesPurgeDateLimit
and lower(rtrim(ltrim(pbs.pricebatchstatusdesc))) = 'processed' -- Pushed/processed batches only.
So that's raising an overall question first: how to handle temp tables in Linq? This script uses about 10 of them. I currently have them as List. The problem is, if I try to .Join() on one in a query, I get the "Local sequence cannot be used in LINQ to SQL implementations of query operators except the Contains operator." error.
I was able to get the join to the derived table to work using 2 queries, just so a single one wouldn't get nightmarishly long:
var dreg = (from _pbd in db.PriceBatchDetails.Where(pbd => pbd.Expired == false && pbd.PriceChgType.PriceChgTypeDesc.ToLower().Trim() == "reg")
group _pbd by new { _pbd.Item_Key, _pbd.Store_No } into _pbds
select new
{
Item_Key = _pbds.Key.Item_Key,
Store_No = _pbds.Key.Store_No,
PriceBatchDetailID = _pbds.Max(pbdet => pbdet.PriceBatchDetailID)
});
var query = (from h in db.PriceBatchHeaders.Where(pbh => pbh.ProcessedDate < processedBatchesPurgeDateLimit)
join pbs in db.PriceBatchStatus on h.PriceBatchStatusID equals pbs.PriceBatchStatusID
join d in db.PriceBatchDetails on h.PriceBatchHeaderID equals d.PriceBatchHeaderID
join dr in dreg on new { d.Item_Key, d.Store_No } equals new { dr.Item_Key, dr.Store_No }
where d.PriceBatchDetailID < dr.PriceBatchDetailID
&& pbs.PriceBatchStatusDesc.ToLower().Trim() == "processed"
select d.PriceBatchDetailID);
So that query gives the expected results, which I am holding in a List, but then I need to join the results of that query to another one selected from the database, which is leading me back to the aforementioned "Local sequence cannot be used..." error.
That query is this:
insert into #pbhArchiveFullListSaved
select h.pricebatchheaderid
from pricebatchheader h (nolock)
join pricebatchdetail d (nolock)
on h.pricebatchheaderid = d.pricebatchheaderid
join #processedBatchesPurgeList dlist
on d.pricebatchdetailid = dlist.pricebatchdetailid -- PBH list is restricted to PBD purge list rows that have PBH references.
group by h.pricebatchheaderid
The join there on #processedBatchesPurgeList is the problem I am running into.
So uh...help? I have never written SQL like this, and certainly never tried to convert it to Linq.
As pointed out by the comments above, this is no longer being rewritten as Linq.
Was hoping to get a performance improvement along with achieving better SOX compliance, which was the whole reason for the rewrite in the first place.
I'm happy with just satisfying the SOX compliance issues.
Thanks, everyone.

how to sort_area_size increase

how set sort_area_size in oracle 10g and what size should be as i have more than 2.2m rows in single table. and please tell me the suggested size of SORT_AREA_RETAINED_SIZE
as my queries are too much slow they takes more than 1 hours to complete. (mostly)
please suggest me the way by which i can optimize my queries and tune the database oracle 10g
thanks
updated with query
the query is
SELECT A.TITLE,C.TOWN_VILL U_R,F.CODE TOWN_CODE,F.CITY_TOWN_MAKE,A.FRM,A.PRD_CODE,A.BR_CODE,A.SIZE_CODE ,B.PRICES,
A.PROJECT_YY,A.PROJECT_MM,d.province ,D.BR_CODE BRANCH_CODE,D.STRATUM,L.LSM_GRP LSM,
SUM(GET_FRAC_FACTOR_ALL_PR_NEW(A.FRM,A.PRD_CODE,A.BR_CODE,A.SIZE_CODE,A.PROJECT_YY,A.PROJECT_MM,A.FRAC_CODE ,B.PRICES,A.QTY_USED,A.VERIF_CODE, A.PACKING_CODE, J.TYPE ,'R') )
* MAX(D.UNIVERSE) / MAX(E.SAMPLE) /1000000 MARKET , D.UNIVERSE ,E.SAMPLE
FROM A2_FOR_CPMARKETS A,
BRAND J,
PRICES B,CP_SAMPLE_ALL_MONTHS C ,
CP_LSM L,
HOUSEHOLD_GL D,
SAMPLE_CP_ALL_MONTHS E ,
City_Town_ALL F
WHERE A.PRD_CODE = B.PRD_CODE
AND A.BR_CODE = B.BR_CODE
AND DECODE(A.SIZE_CODE,NULL,'L',A.SIZE_CODE) = B.SIZE_CODE -- for unbranded loose
AND DECODE(B.VAR_CODE,'X','X',A.VAR_CODE) = B.VAR_CODE
AND DECODE(B.COL_CODE,'X','X',A.COL_CODE) = B.COL_CODE
AND DECODE(B.PACK_CODE,'X','X',A.PACKING_CODE) = B.PACK_CODE
AND A.project_yy||A.project_MM BETWEEN B.START_DATE AND B.END_DATE
AND A.PRD_CODE=J.PRD_CODE
AND A.BR_CODE=J.BR_CODE
AND A.FRM = C.FRM
AND A.PROJECT_YY=L.YEAR
AND A.frm=L.FORM_NO
AND C.TOWN_VILL= D.U_R
AND C.CLASS = D.CLASS
AND D.TOWN=F.GRP
AND D.TOWN = E.TOWN_CODE
AND A.PROJECT_YY = E.PROJECT_YY
AND A.PROJECT_MM = E.PROJECT_MM
AND A.PROJECT_YY = C.PROJECT_YY
AND A.PROJECT_MM = C.PROJECT_MM
-- FOR HOUSEJOLD_GL
AND A.PROJECT_YY = D.YEAR
AND A.PROJECT_MM = D.MONTH
-- END HOUSEHOLD_GL
AND C.TOWN_VILL = E.TOWN_VILL
AND C.CLASS = E.CLASS
AND C.TOWN_VILL = F.TOWN_VILL
AND C.TOWN_CODE=F.CODE
AND (DECODE(e.PROJECT_YY,'1997','1','1998','1','1999','1','2000','1','2001','1','2002','1','2') = F.TYP )
GROUP BY A.TITLE,C.TOWN_VILL,F.CODE ,F.CITY_TOWN_MAKE,A.FRM,A.PRD_CODE,A.BR_CODE,A.SIZE_CODE ,B.PRICES,
A.PROJECT_YY,A.PROJECT_MM,d.province,D.BR_CODE ,D.STRATUM,L.LSM_GRP ,
UNIVERSE ,E.SAMPLE
![alt text][1]
[1]: http://C:\Documents and Settings\Hussain\My Documents\My Pictures\explain plan.jpg
Check here for Oracle Documentation for SORT_AREA_SIZE. You can use alter session set sort_area_size=10000 command to modify this for the session and alter system for system. It is the same way for SORT_AREA_RETAINED_SIZE.
Is you entire table (with 2.2 m rows) fetched in the result set? Is there some sort operation in it?
There could be some other reasons for the query to perform badly. Can you share the query and explain plan?
When you run an execution plan for the query using the DBMS_Xplan.Display method oracle will estimate (usually pretty reasonably) what size of temporary tablespace storage you would need to execute it.
2.2 Million rows may be irrelevant to the sort size by the way. the memory required for aggregate operations such as MAX and SUM are more related to the size of the result set than to the size of the source data.
Providing a link to a jpg file stored on your pc does not count as having provided an execution plan, btw.
A.project_yy||A.project_MM BETWEEN B.START_DATE AND B.END_DATE
You know we have DATE datatypes in databases, right ? Using the incorrect datatypes makes it harder for Oracle to determine data distributions, predicate selectivity and appropriate query plans

Resources