pyspark groupBy and orderBy use together - sorting

Hi there I want to achieve something like this
SAS SQL: select * from flightData2015 group by DEST_COUNTRY_NAME order by count
My data looks like this:
This is my spark code:
flightData2015.selectExpr("*").groupBy("DEST_COUNTRY_NAME").orderBy("count").show()
I received this error:
AttributeError: 'GroupedData' object has no attribute 'orderBy'. I am new to pyspark. Pyspark's groupby and orderby are not the same as SAS SQL?
I also try sortflightData2015.selectExpr("*").groupBy("DEST_COUNTRY_NAME").sort("count").show()and I received kind of same error. "AttributeError: 'GroupedData' object has no attribute 'sort'"
Please help!

In Spark, groupBy returns a GroupedData, not a DataFrame. And usually, you'd always have an aggregation after groupBy. In this case, even though the SAS SQL doesn't have any aggregation, you still have to define one (and drop it later if you want).
(flightData2015
.groupBy("DEST_COUNTRY_NAME")
.count() # this is the "dummy" aggregation
.orderBy("count")
.show()
)

There is no need for group by if you want every row.
You can order by multiple columns.
from pyspark.sql import functions as F
vals = [("United States", "Angola",13), ("United States","Anguilla" , 38), ("United States","Antigua", 20), ("United Kingdom", "Antigua", 22), ("United Kingdom","Peru", 50), ("United Kingdom", "Russisa",13), ("Argentina", "United Kingdom",13),]
cols = ["destination_country_name","origin_conutry_name", "count"]
df = spark.createDataFrame(vals, cols)
#display(df.orderBy(['destination_country_name', F.col('count').desc()])) If you want count to be descending
display(df.orderBy(['destination_country_name', 'count']))

Related

Convert rank() partition by oracle query to pyspark sql [duplicate]

I'm trying to use some windows functions (ntile and percentRank) for a data frame but I don't know how to use them.
Can anyone help me with this please? In the Python API documentation there are no examples about it.
Specifically, I'm trying to get quantiles of a numeric field in my data frame.
I'm using spark 1.4.0.
To be able to use window function you have to create a window first. Definition is pretty much the same as for normal SQL it means you can define either order, partition or both. First lets create some dummy data:
import numpy as np
np.random.seed(1)
keys = ["foo"] * 10 + ["bar"] * 10
values = np.hstack([np.random.normal(0, 1, 10), np.random.normal(10, 1, 100)])
df = sqlContext.createDataFrame([
{"k": k, "v": round(float(v), 3)} for k, v in zip(keys, values)])
Make sure you're using HiveContext (Spark < 2.0 only):
from pyspark.sql import HiveContext
assert isinstance(sqlContext, HiveContext)
Create a window:
from pyspark.sql.window import Window
w = Window.partitionBy(df.k).orderBy(df.v)
which is equivalent to
(PARTITION BY k ORDER BY v)
in SQL.
As a rule of thumb window definitions should always contain PARTITION BY clause otherwise Spark will move all data to a single partition. ORDER BY is required for some functions, while in different cases (typically aggregates) may be optional.
There are also two optional which can be used to define window span - ROWS BETWEEN and RANGE BETWEEN. These won't be useful for us in this particular scenario.
Finally we can use it for a query:
from pyspark.sql.functions import percentRank, ntile
df.select(
"k", "v",
percentRank().over(w).alias("percent_rank"),
ntile(3).over(w).alias("ntile3")
)
Note that ntile is not related in any way to the quantiles.

Pyspark filter with a column from a different dataframe

I would like to filter Id from price where exist in events dataframe. My code is below but it is not working in pyspark. How am I going to fix this?
events = spark.createDataFrame([(657,'Conferences'),
(765, 'Seminars '),
(776, 'Meetings'),
(879, 'Conferences'),
(765, 'Meetings'),
(879, 'Seminars'),
(985, 'Meetings'),
(879, 'Meetings'),
(657, 'Seminars'),
(657,'Conferences')]
,['Id', 'event_name'])
events.show()
price = spark.createDataFrame([(657,10),
(879,45),
(776,54),
(879,45),
(765, 65)]
,['Id','Price'])
price[price.Id.isin(events.Id)].show()
A simple join will get only the prices for the ids present in the events table
events.join(price, "Id").select("Id", "Price").distinct().show()

Pig - how to select only some values from the list (not just simple distinct)?

Let's say I have intput_file.txt (user_id, event_code, event_date):
1,a,1
1,b,2
2,a,3
2,b,4
2,b,5
2,b,6
2,c,7
2,b,8
as you can see, user_id = 2, has events like this: abbbcb
I'd like to have a result like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
So when we have few events, with the same code, I'd like to take only the last one.
Can you please share any hints?
Regards
Pawel
The main thing you are describing is what GROUP BY does.
In this case:
B = GROUP A BY user_id;
Gets your records together by user_id. Your data will now look like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
You say you only want the last one (I assume you mean the one with the greatest event_date). To do this, you can do a nested FOREACH with an ORDER BY to sort by date, and then take the first one with LIMIT. Note that this has arbitrary behavior when there are ties.
C = FOREACH B {
DA = ORDER A BY event_date DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.event_code), FLATTEN(DB.event_date);
}
Your data should now look like this:
1,b,2
2,b,8
Another option would be to use a UDF to write some custom behavior on the groups given by GROUP BY:
B = GROUP A BY user_id;
C = FOREACH B GENERATE YourUDFThatYouBuilt(group, A);
In that UDF you'd write whatever custom behavior you want (in this case return the tuple with the greatest date)
It seems like you could use the DistinctBy UDF from Apache DataFu to achieve this. This UDF, given a bag, returns the first instance found for a given field. In your case the field you care about is event_code. But we have to reverse the order, as you actually want the last instance.
One clarification though. Correct me if I'm wrong, but I think the intended output is:
1,{(a,1),(b,2)}
2,{(a,3),(b,6),(c,7),(b,8)}
That is, the (a,3) event occurs for member 2. The (a,2) event occurs for member 1.
Here's how you can do it:
-- pass in 1 because we want distinct by event code (position 1)
define DistinctBy datafu.pig.bags.DistinctBy('1');
FOREACH (GROUP A BY user_id) {
-- reverse so we can take the last event code occurrence
A_reversed = ORDER A BY event_date DESC;
-- use DistinctBy to get the first tuple having an occurrence of a field value
A_distinct_by_code = DistinctBy(A_reversed);
-- put back in order again
A_ordered = ORDER A_distinct_by_code BY event_date ASC;
GENERATE group as user_id, A_ordered.(event_code,event_date);
}

Apache Pig: Easier way to filter by a bunch of values from the same field

Say I want select a subset of data by their values of from the same field. Right now I have to do something like this
TestLocationsResults = FILTER SalesData by (StoreId =='17'
or StoreId =='85'
or StoreId =='12'
or StoreId =='45'
or StoreId =='26'
or StoreId =='75'
or StoreId =='13'
)
in SQL, we can simply do this :
SELECT * FROM SalesData where StoreID IN (17, 12, 85, 45, 26, 75, 13)
Is there a similiar shortcut in Pig that I am missing?
It looks like Pig 0.12 added an IN operator.
So you can do
FILTER SalesData BY StoreID IN (17, 12, 85, 45, 26, 75, 13);
There is no IN keyword in Pig to do this sort of set membership detection.
One suggestion if to write a UDF (as seen in this question / answer).
Another could be to create a relationship with values for each StoreId you want to filter by and then perform an inner join on the two relationships.
My solution to this, when the data type is chararray, is to use a regular expression:
TestLocationsResults = FILTER SalesData by StoreID MATCHES '(17|12|85|45|26|75|13)';
When the data type is an int, you could try casting to a chararray.
One work around could be to use a built in function "INDEXOF"
Ex:
TestLocationsResults = FILTER SalesData by INDEXOF(',17,12,85,45,26,75,13,', CONCAT(CONCAT(',', StoreId), ',')) > -1;
Amended to take into account the comment, introduce the ',' symbols around StoreId to have the exact match and not a partial
The way you're currently doing it is the best way to do it in Pig. All of the alternatives to what you're doing now are either hacky, slow, or both. Hopefully, Pig adds an "in" query in a future version, but for now you're doing it the best way available.

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources