I am running this query
SELECT ?p
WHERE
{
?s ?p ?o .
}
ORDER BY ASC(?p)
in order to get all the ?p available. However what I really want to do is to group the results by number of appearance say
p1 <- most appeared
p2
p3
rather than:
p1
p1
p1
p1
p2
p2
p2
How can I achieve this one?
Related
I have a series of Pig scripts that are transforming hundreds of millions of records from multiple data sources that need to be joined together. Towards the end of each script, I reach a point where JOIN performance becomes terribly slow. Looking at the DAG in the Tez View, I see that it is split into relatively few tasks (typically 100-200), but each task takes multiple hours to complete. The task description shows that it's doing a HASH_JOIN.
Interestingly, I only run into this bottleneck when running on the Tez execution engine. On MapReduce, it can still take a while, but nothing like the agonizing crawl I get on Tez. However, running on MapReduce is a problem as I have an issue with MapReduce for which I've asked another question here.
Here's a sample of my code (apologies, I've had to make the code very generic to be able to post on the interwebs). I'm wondering what I can do to remove this bottleneck -- would specifying parallelism help? Is there something wrong with my approach?
-- Incoming data:
-- A: hundreds of millions of rows, 19 fields
-- B: hundreds of millions of rows, 3 fields
-- C: hundreds of millions of rows, 5 fields
-- D: a few thousand rows, 5 fields
J = -- This reduces the size of A, but still probably in the hundreds of millions
FILTER A
BY qualifying == 1;
K = -- This is a one-to-one join that doesn't explode the number of rows in J
JOIN J BY Id
, B BY Id;
L =
FOREACH K
GENERATE J1 AS L1
, J2 AS L2
, J3 AS L3
, J4 AS L4
, J5 AS L5
, J6 AS L6
, J7 AS L7
, J8 AS L8
, B1 AS L9
, B2 AS L10
;
M = -- Reduces the size of C to around one hundred million rows
FILTER C
BY Code matches 'Code-.+';
M_WithYear =
FOREACH M
GENERATE *
, (int)REGEX_EXTRACT(Code, 'Code-.+-([0-9]+)', 1) AS year:int
;
SPLIT M_WithYear
INTO M_annual IF year <= (int)'$currentYear' -- roughly 75% of the data from M
, M_lifetime IF Code == 'Code-Lifetime'; -- roughly 25% of the data from M
-- Transformations for M_annual
N =
JOIN M_WithYear BY Id, D BY Id USING 'replicated';
O = -- This is where performance falls apart
JOIN N BY (Id, year, M7) -- M7 matches L7
, L BY (Id, year, L7);
P =
FOREACH O
GENERATE N1 AS P1
, N2 AS P2
, N3 AS P3
, N4 AS P4
, N5 AS P5
, N6 AS P6
, N7 AS P7
, N8 AS P8
, N9 AS P9
, L1 AS P10
, L2 AS P11
;
-- Transformations N-P above repeated for M_lifetime
Table A
A1 A2
1 7
2 8
1 9
Table B
A1 B2
1 2
2 3
i want something like this
select A.A1,sum(case when distinct A.A1 then B2),sum(A.A2) from
A,B
where A.A1=B.A1(+)
group by A.A1
After joining my table will be
A1 A2 B2
1 7 2
2 8 3
1 9 2
Resulting Table
A1 A2 B2
1 7+9 2(only once)
2 8 3
how to get sum of B2 when distinct A1 after joining the tables as stated above.
Thanks in advance
Use JOIN and GROUP BY.
Query
SELECT t1.A1, SUM(t1.A2) AS A1, SUM(t2.B2) AS B2
FROM TableA t1
JOIN TableB t2
ON t1.A1 = t2.A1
GROUP BY t1.A1;
Since table_b.a1 is unique, the best way to do this would be to work out the sum of table_a.a2 first to reduce the number of rows you're joining against, and then join to table_b. Then you don't need to worry about summing the distinct table_b.b2 values, which you would otherwise have to do.
WITH table_a AS (SELECT 1 a1, 7 a2 FROM dual UNION ALL
SELECT 2 a1, 8 a2 FROM dual UNION ALL
SELECT 1 a1, 9 a2 FROM dual),
table_b AS (SELECT 1 a1, 2 b2 FROM dual UNION ALL
SELECT 2 a1, 3 b2 FROM dual)
-- end of mimicking your two tables with sample_data in them;
-- see the sql below:
SELECT ta.a1,
ta.a2,
tb.b2
FROM (SELECT a1, SUM(a2) a2
FROM table_a
GROUP BY a1) ta
INNER JOIN table_b tb ON ta.a1 = tb.a1;
A1 A2 B2
---------- ---------- ----------
1 16 2
2 8 3
If you absolutely must join the two tables first (I don't recommend; this is making more work for the database to do), then you could do something like:
WITH table_a AS (SELECT 1 a1, 7 a2 FROM dual UNION ALL
SELECT 2 a1, 8 a2 FROM dual UNION ALL
SELECT 1 a1, 9 a2 FROM dual),
table_b AS (SELECT 1 a1, 2 b2 FROM dual UNION ALL
SELECT 2 a1, 3 b2 FROM dual)
SELECT ta.a1,
SUM(ta.a2) a2,
MAX(tb.b2) b2
FROM table_a ta
INNER JOIN table_b tb ON ta.a1 = tb.a1
GROUP BY ta.a1;
A1 A2 B2
---------- ---------- ----------
1 16 2
2 8 3
Since there can only be one distinct value for table_b.b2 per table_a.a1, we can just pick one of the values to use via MAX (we could have used MIN or SUM(distinct tb.b2) instead, fyi).
I'm searching for a measure to utilize within SSAS Tabular model that will me to perform dynamic ranking that will automatically update the associated rank value based on filters and slicer values that are applied.
I am not in this kind of scearios : PowerPivot DAX - Dynamic Ranking Per Group (Min Per Group)
The difference is the following, my data are not in the same table :
I have a fact table like this :
-------------------------------------------------------------------------------------
ClientID | ProductID | Transaction Date | Sales
------------------------------------------------------------------------------------
C1 P3 1/1/2012 $100
C2 P1 8/1/2012 $150
C3 P4 9/1/2012 $200
C1 P2 3/5/2012 $315
C2 P2 9/5/2012 $50
C3 P2 12/9/2012 $50
------------------------------------------------------------------------------------
A Customer table
-------------------------------------------------------------------------------------
ClientID | ClientCountry |
C1 France
C2 France
C3 Germany
------------------------------------------------------------------------------------
...and also a Product table
-------------------------------------------------------------------------------------
ProductID | ProductSubCategory |
P1 SB1
P2 SB1
P3 SB2
P4 SB3
------------------------------------------------------------------------------------
So here is my visualization pivot table :
-------------------------------------------------------------------------------------
ProductSubCategory | Sales
SB1 565 (150 + 315 + 50 + 50)
SB2 100
SB3 200
And the measure I'm looking for should perform like this :
-------------------------------------------------------------------------------------
ProductSubCategory | Sales | Rank
SB1 565 (150 + 315 + 50 + 50) 1
SB2 100 3
SB3 200 2
...simple, I browse my cube into Excel, put the ProductSubCategory in line, sum of Sales and expect my measure gives me correct ranking by ProductSubCategory.
Now, scenario also includes using a slicer on ClientCountry.
So when I select 'France', I expect my measure gives me an adapted ranking, only including ProductSubCategory for Clients living in France (so C1 and C2).
I tried a lot of solutions but without any result. Has anyone and idea with this kind of scenario ?
I greatly appreciate your help with this!
Thank's all
Suppose we are having the following data:
Key Value Desired Rank
--- ----- ------------
P1 0.6 2
P1 0.6 2
P1 0.6 2
P2 0.8 1
P2 0.8 1
P3 0.6 3
P3 0.6 3
I want to select Distinct Keys ordered by Value DESC to be displayed in a grid that supports pagination.
I don’t know how to generate rank as the values displayed in Desired Rank column. So that I can paginate correctly over the data set
When I tried to use: DENSE_RANK() OVER(ORDER BY value), the result was
Key Value DENSE_RANK() OVER(ORDER BY value)
--- ----- ------------
P1 0.6 2
P1 0.6 2
P1 0.6 2
P2 0.8 1
P2 0.8 1
P3 0.6 2
P3 0.6 2
When I try to select the first two keys “rank between 1 and 2” I receive back 3 keys. And this ruins the required pagination mechanism.
Any ideas?
Thanks
If you want the distinct keys and values, why not use distinct?
select distinct
t.Key,
t.Value
from
YourTable t
order by
t.value
Do you actualle need the rank?
If you do, you still could
select distinct
t.Key,
t.Value,
dense_rank() over () order by (t.Value, t.Key) as Rank
from
YourTable t
order by
t.value
This whould work without the distinct as well.
'When I try to select the first two
keys “rank between 1 and 2” I receive
back 3 keys.'
That is because you are ordering just by VALUE, so all KEYS with the same value are assigned the same rank. So you need to include the KEY in the ordering clause. Like this:
DENSE_RANK() OVER (ORDER BY key ASC, value DESC)
Consider the following
Sample Input
SalesBoyName Product Amount
------------ ------- ------
Boy1 P1 100
Boy1 P1 40
Boy1 P2 100
Boy2 P1 100
Boy2 P3 12
Desired Output
SalesBoyName P1 P2 P3
------------ ---- ---- ----
Boy1 140 100 null
Boy2 100 null 12
The below SQL SERVER 2005 query will do the work
SELECT SalesBoyName, [P1] AS P1, [P2] AS P2,[P3] AS P3
FROM
(SELECT * FROM tblSales ) s
PIVOT
(
SUM (Amount)
FOR Product IN
( [P1], [P2], [P3])
) AS pvt
I want to perform the same thing in Oracle 10g.
How to do this?
This may be trivial, but since i am very new to Oracle, so I am seeking for help.
Thanks
You can do it like this in 10G:
select salesboyname,
sum (case when product='P1' then amount end) as p1,
sum (case when product='P2' then amount end) as p2,
sum (case when product='P3' then amount end) as p3
from tblsales
group by salesboyname;
In 11G there is a PIVOT keyword similar to SQL Server's.