How to get distinct rows in dataframe using pyspark? - distinct

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:
I have a interim dataframe:
+----------------------------+---+
|host |day|
+----------------------------+---+
|in24.inetnebr.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|ix-esc-ca2-07.ix.netcom.com |1 |
|uplherc.upl.com |1 |
What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:
+----------------------------+---+
|host |day|
+----------------------------+---+
|in24.inetnebr.com |1 |
|uplherc.upl.com |1 |
|ix-esc-ca2-07.ix.netcom.com |1 |
|uplherc.upl.com |1 |

If df is the name of your DataFrame, there are two ways to get unique rows:
df2 = df.distinct()
or
df2 = df.drop_duplicates()

The normal distinct not so user friendly, because you cant set the column.
In this case enough for you:
df = df.distinct()
but if you have other value in date column, you wont get back the distinct elements from host:
+--------------------+---+
| host|day|
+--------------------+---+
| in24.inetnebr.com| 1|
| uplherc.upl.com| 1|
| uplherc.upl.com| 2|
| uplherc.upl.com| 1|
| uplherc.upl.com| 1|
|ix-esc-ca2-07.ix....| 1|
| uplherc.upl.com| 1|
+--------------------+---+
after distinct you will get back as follows:
df.distinct().show()
+--------------------+---+
| host|day|
+--------------------+---+
| in24.inetnebr.com| 1|
| uplherc.upl.com| 2|
| uplherc.upl.com| 1|
|ix-esc-ca2-07.ix....| 1|
+--------------------+---+
thus you should use this:
df = df.dropDuplicates(['host'])
it will keep the first value of day
If you familiar with SQL language it will also works for you:
df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")
+--------------------+-----------------+
| first(host, false)|first(day, false)|
+--------------------+-----------------+
| in24.inetnebr.com| 1|
|ix-esc-ca2-07.ix....| 1|
| uplherc.upl.com| 1|
+--------------------+-----------------+

Related

predicting vehicle utilization

predicting vehicle utilization :
dears this is my data for the full year (8000 records) ...and like to make prediction for next 3 months
could you please advice which Algorithm should i use and any other advice...(i am beginner)
|Branch |month| date |Util |
|:----- |:---:|:-----------:|-----------:|
|1101 |1 | 2022-01-01 | 43.54 |
|1103 |1 | 2022-01-02 | 74.37 |
|1104 |1 | 2022-01-03 | 0 |
|1126 |2 | 2022-01-04 | 65.83 |

How to update multiple values in oracle?

I have two tables:
table 1:
|Project type|Quarter1|Quarter2|Quarter3|Quarter4|
|------------|--------|--------|--------|--------|
|type1 |1 |3 |5 |7 |
|type2 |2 |4 |6 |8 |
table 2:
|Project|Value|Quarter|
|-------|-----|-------|
|type1 | |1 |
|type2 | |1 |
|type1 | |2 |
|type2 | |2 |
|type1 | |3 |
|type2 | |3 |
|type1 | |4 |
|type2 | |4 |
I want to update table 2 value section with data from table 1 and the expected outcome is:
|Project|Value|Quarter|
|-------|-----|-------|
|type1 |1 |1 |
|type2 |2 |1 |
|type1 |3 |2 |
|type2 |4 |2 |
|type1 |5 |3 |
|type2 |6 |3 |
|type1 |7 |4 |
|type2 |8 |4 |
I know updating single one value can be written as:
update table2 a
set a.value = (select Quarter1
from table1
where projecttype = 'type1')
where a.project = 'type1'
and a.quarter = '1';
Please tell me how can I update all value in one time?
Thank you!
One way is with a merge statement:
merge into table_2 t
using table_1 s
on (t.project = s.project_type)
when matched then update
set t.value = case t.quarter when 1 then s.quarter1
when 2 then s.quarter2
when 3 then s.quarter3
when 4 then s.quarter4 end
;
This is my primary thought about using loop to repeat the updating process. The main body refer to mathguy's answer (Thanks again). It may complicate the code in this scenario, but would be useful when there are numerous columns in table1, such as years instead quarters.
declare
quart_num number;
code varchar2(2000);
begin
for quart_num in 1..4
loop
code := 'merge into table2 a
using table1 b
on (a.project = b.projecttype)
when matched then
update set a.value = quarter'||
quart_num || 'where a.quarter =' ||quart_num;
execute immediate(code);
end loop;
end;

Laravel relationship method to get some total from array

I have some question about my code, i had table like these :
tb teacher
|---------------------------------|
|id | name | id_prodi|
-----------------------------------
|1 | Teacer A | 1 |
|2 | Teacher B | 2 |
|3 | Teacher C | 1 |
tb prodi
|------------------------
|id | prodi |
-------------------------
|1 | Prodi 1 |
|2 | Prodi 2 |
|3 | Prodi 3 |
tb year
|------------------------
|id | year |
-------------------------
|1 | 2018 |
|2 | 2019 |
|3 | 2020 |
tb attendence
|---------------------------------------------------|
|id | teacher_id | year_id | date_in |
-----------------------------------------------------
| 1 | 1 | 3 | 2020-03-20|
| 2 | 1 | 3 | 2020-03-21|
| 3 | 3 | 3 | 2020-03-21|
| 4 | 1 | 3 | 2020-03-22|
thats my tables. what I need to do here, get result like this on my datatable
i have some condition with combobox to select by a year
|----------------------------------------------------------|
|# | name | year | total_attend |
------------------------------------------------------------
| 1 | Teacher A | 2020 | 3 |
| 2 | Teacher C | 2020 | 1 |
some like thats for final result
and this my controller
public function showData(Request $request)
{
$year = $request->year_id;
$prodi_id = $request->prodi_id;
$datas = Teacher::whereHas('attendance', function($query) use ($year) {
$query->where('year_id', $year);
})
->where('prodi_id',$prodi_id)
->get();
return response()->json([
'success' => 'Success',
'data' => $datas
]);
}
my problem is, I cant get total column by count the attendance year. Anyone has clue??
--help
You can achieve this using selectRaw() with a count() function for teacher_name:
DB::table('attendence')->selectRaw('t.name as name,y.year as year,count(t.name) as total_attend')
->join('teacher as t','t.id','=','attendence.teacher_id')
->join('year as y','attendence.year_id','=','y.id')
->where([['attendence.year_id',$year],['t.prodi_id',$prodi_id]])
->groupBy('name','year')->get();

Spark : Java : Why Dataset.show() is not showing all fields when using custom encoder?

While using map() on Dataset, it is not returning all fields.
code snippet:
RecordParser parser = new RecordParser();
Dataset<CensusData> censusData =
records.map(parser,Encoders.bean(CensusData.class));
censusData.show(40);
The above code is returning only 5 fields, whereas it has 13 fields.
output:
+----------+---------+-------+------------+--------+-------------------+
|activityId|contentId|daypart|deviceTypeId|errorMsg| genreId|
+----------+---------+-------+------------+--------+-------------------+
| null| null| null| null| null| null|
| 4| 0002| 1| 1| null| DR1|
| 4| 0004| 1| 2| null|Children (0-12 yrs)|
| | 0018| 1| 3| null| Entertainment|
How can i solve this problem?
Edit:
Details:
Java version : 8
Spark version : 2.1
I found out the issue, the Bean class must have getters and setters for all fields you want.

How to update duplicate rows within a given condition

I have the following table:
ID|group_id|subjectlist_id|article_id
1 |1 |2 |1
2 |2 |2 |1
3 |3 |3 |4
4 |4 |1 |1
5 |5 |1 |1
How do I update the table so it looks like this?
ID|group_id|subjectlist_id|article_id|marked
1 |1 |2 |1 |done
2 |2 |2 |1 |done
3 |3 |3 |4 |
4 |4 |1 |1 |
5 |5 |1 |1 |
So far I have this query:
$duplicates = DB::table('table')
->select('subjectlist_id', 'article_id')
->whereIn('group_id', array(1,2,3))
->groupBy('subjectlist_id', 'article_id')
->havingRaw('COUNT(*) > 1')
->update(['marked' => 'done']);
Simply remove ->havingRaw('COUNT(*) > 1').
If you add DB::raw(COUNT(*)) to the select, then you can see how many records each possible combination of subjectlist_id and article_id have.

Resources