How to apply function on each row in a group after groupby in pyspark? - user-defined-functions

I have data like this:
userID sessionID time
"" xxx 2019-06-01
"" xxx 2019-06-02
user1 xxx 2019-06-03
"" yyy 2019-06-04
user2 yyy 2019-06-05
"" yyy 2019-06-06
user3 yyy 2019-06-07
What I want is:
userID sessionID time
user1 xxx 2019-06-01
user1 xxx 2019-06-02
user1 xxx 2019-06-03
user2 yyy 2019-06-04
user2 yyy 2019-06-05
user3 yyy 2019-06-06
user3 yyy 2019-06-07
Can I group by on seesionID and apply a UDF on each group and get userID of each row in each session.
update:
I solved it by replacing empty string with null then:
from pyspark.sql import Window
from pyspark.sql.functions import first
import sys
# define the window
window = Window.partitionBy('jsession')\
.orderBy('request_time')\
.rowsBetween(0, sys.maxsize)
# define the forward-filled column
filled_column = first(df['userid'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('filled_userid', filled_column)

replace empty string "" with null then:
from pyspark.sql import Window
from pyspark.sql.functions import first
import sys
# define the window
window = Window.partitionBy('jsession')\
.orderBy('request_time')\
.rowsBetween(0, sys.maxsize)
# define the forward-filled column
filled_column = first(df['userid'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('filled_userid', filled_column)

Related

Connecting oracle database using Spark/Scala getting error

I'm new to Sacala,I have text file where i'm trying to read and load into dataframe thereafter I'm trying to load into database, while loading into database I'm getting error which is given below on other-hand when same credential I'm using in Toad I'm able connect successfully.Any help will be appreciated
text.txt
TYPE,CODE,SQ_CODE,RE_TYPE,VERY_ID,IN_DATE,DATE
"F","000544","2017002","OP","95032015062763298","20150610","20150529"
"F","000544","2017002","LD","95032015062763261","20150611","20150519"
"F","000544","2017002","AK","95037854336743246","20150611","20150429"
val sparkSession = SparkSession.builder().master("local").appName("IT_DATA").getOrCreate()
Driver=oracle.jdbc.driver.OracleDriver
Url=jdbc:oracle:thin:#xxx.com:1521/DATA00.WORLD
username=xxx
password=xxx
val dbProp = new java.util.Properties
dbProp.setProperty("driver", Driver)
dbProp.setProperty("user", username)
dbProp.setProperty("password", password)
//Create dataframe boject
val df = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("location", "/xx/xx/xx/xx/test.csv")
.option("delimiter", ",")
.option("dateFormat", "yyyyMMdd")
.load().cache()
df.write.mode("append").jdbc(Url, TableTemp, dbProp)
df.show
+-----+-------+---------+---------+-------------------+---------+-------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE| DATE |
+-----+-------+---------+---------+-------------------+---------+-------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610| 2015-05-29|
| F | 000544| 2017002| LD | 95032015062763261| 20150611| 2015-05-19|
| F | 000544| 2017002| AK | 95037854336743246| 20150611| 2015-04-29|
+-----+-------+---------+--+------+-------------------+---------+-------------+
Error
java.sql.SQLException: ORA-01017: invalid username/password; logon denied

monetdb remote table: cannot register

I have two nodes and am attempting to create a remote table.  To set up I do the following:
on each host:
$ monetdbd create /opt/mdbdata/dbfarm
$ monetdbd set listenaddr=0.0.0.0 /opt/mdbdata/dbfarm
$ monetdbd start /opt/mdbdata/dbfarm
On the first host:
$ monetdb create w0
$ monetdb release w0
On second:
$ monetdb create mst
$ monetdb release mst
$ mclient -u monetdb -d mst
password:
Welcome to mclient, the MonetDB/SQL interactive terminal (Dec2016-SP4)
Database: MonetDB v11.25.21 (Dec2016-SP4), 'mapi:monetdb://nkcdev11:50000/mst'
Type \q to quit, \? for a list of available commands
auto commit mode: on
sql>create table usr ( id integer not null, name text not null );
operation successful (0.895ms)
sql>insert into usr values(1,'abc'),(2,'def');
2 affected rows (0.845ms)
sql>select * from usr;
+------+------+
| id   | name |
+======+======+
|    1 | abc  |
|    2 | def  |
+------+------+
2 tuples (0.652ms)
sql>
On first:
$ mclient -u monetdb -d w0
password:
Welcome to mclient, the MonetDB/SQL interactive terminal (Dec2016-SP4)
Database: MonetDB v11.25.21 (Dec2016-SP4), 'mapi:monetdb://nkcdev10:50000/w0'
Type \q to quit, \? for a list of available commands
auto commit mode: on
sql>create remote table usr_rmt ( id integer not null, name text not null ) on 'mapi:monetdb://nkcdev11:50000/mst';
operation successful (1.222ms)
sql>select * from usr_rmt;
(mapi:monetdb://monetdb#nkcdev11/mst) Cannot register  
project (
table(sys.usr_rmt) [ usr_rmt.id NOT NULL, usr_rmt.name NOT NULL ] COUNT 
) [ usr_rmt.id NOT NULL, usr_rmt.name NOT NULL ] REMOTE mapi:monetdb://nkcdev11:50000/mst
sql>
$
$ monetdb discover
             location
mapi:monetdb://nkcdev10:50000/w0
mapi:monetdb://nkcdev11:50000/mst
Can anyone nudge me in the right direction?
[EDIT - Solved]
The problem was self-inflicted, the remote table name must be exactly the same as the local table name, I had usr_rmt as the remote table name.
at first sight what you are trying to do ought to work.
Recently, I had similar problems with remote table access, though that was with the non-released version, see bug 6289. (The MonetDB version number mentioned in that bug report is incorrect.) What you are experiencing may or may not be the same underlying issue.
After the weekend I will check if I can reproduce your example on, on -SP4 and on the development version.
Joeri

schedule and automate sqoop import/export tasks

I have a sqoop job which requires to import data from oracle to hdfs.
The sqoop query i'm using is
sqoop import --connect jdbc:oracle:thin:#hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '1' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test1 --fields-terminated-by '\t'
I am re-running the same query again and again with change in partitionid from 1 to 96. so I should execute the sqoop import command manually 96 times. The table 'ORDERS' contains millions of rows and each row has a partitionid from 1 to 96. I need to import 10001 rows from each partitionid into hdfs.
Is there any way to do this? How to automate the sqoop job?
Run script : $ ./script.sh 20 //------- for 20th entry
ramisetty#HadoopVMbox:~/ramu$ cat script.sh
#!/bin/bash
PART_ID=$1
TARGET_DIR_ID=$PART_ID
echo "PART_ID:" $PART_ID "TARGET_DIR_ID: "$TARGET_DIR_ID
sqoop import --connect jdbc:oracle:thin:#hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '$PART_ID' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test/$TARGET_DIR_ID --fields-terminated-by '\t'
For all 1 to 96 - single shot
ramisetty#HadoopVMbox:~/ramu$ cat script_for_all.sh
#!/bin/bash
for part_id in {1..96};
do
PART_ID=$part_id
TARGET_DIR_ID=$PART_ID
echo "PART_ID:" $PART_ID "TARGET_DIR_ID: "$TARGET_DIR_ID
sqoop import --connect jdbc:oracle:thin:#hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '$PART_ID' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test/$TARGET_DIR_ID --fields-terminated-by '\t'
done
Use crontab for scheduling purposes. Crontab documentation can be found here or you could use man crontab in terminal.
Add your sqoop import command in shell script and execute this shell script using crontab.

Iteration of batch script based on search string

I've a Program2.bat having Oracle query, which generates a file log.txt having content-structure either like:
SQL> SELECT SID, SERIAL#, STATUS, USERNAME, LOGON_TIME FROM GV$SESSION WHERE USERNAME = 'DBADMIN' and status in ('ACTIVE','INACTIVE','KILLED') ;
no rows selected
SQL> spool off;
OR like:
SQL> SELECT SID, SERIAL#, STATUS, USERNAME, LOGON_TIME FROM GV$SESSION WHERE USERNAME = 'DBADMIN' and status in ('ACTIVE','INACTIVE','KILLED') ;
SID SERIAL# STATUS USERNAME LOGON_TI
---------- ---------- ---------- ---------- -------------
2388 54 Active DBADMIN 28-FEB-14
2391 37 Active DBADMIN 17-FEB-14
2395 111 Inactive DBADMIN 28-FEB-14
2780 325 Killed DBADMIN 26-FEB-14
2790 111 Killed DBADMIN 8-FEB-14
5 rows selected.
SQL> spool off
I'm trying to achieve following things:
Action-1: Call Program2.bat, which will generate log.txt
Action-2: Search for the string "no rows selected" in log.txt
Action-3: If string found then call program1.bat
Action-4: If string not found then do below:
call program2.bat again and
Do Action-2 again at an interval of 5 mins (means keep searching for string "no rows selected" in log.txt on an interval of 5 mins untill string "no rows selected" got found ).
#echo off
setlocal
Call Program2.bat
>nul findstr /c:"no rows selected" log.txt && (
call program1.bat
) || (
call program2.bat
)
How Point-2 of Action-4 can be achieved in above script or do we have any other way to achieve all in the easiest way?

Appending heredoc string after a specific heredoc string if file doesn't contain that heredoc string?

I want to append a heredoc string after a specific heredoc string if file doesn't contain it already.
Eg.
Here are 2 files:
# file1
Description:
I am a coder
Username: user1
Password: password1
# file2
Description:
I am a coder
Username: user2
Password: password2
Address:
Email: user#gmail.com
Street: user street 19 A
I want to add:
Address:
Email: user#gmail.com
Street: user street 19 A
if file doesn't contain it already, and after:
Description:
I am a coder
So in the above files it will be added to the first one only. and that file will then look like this:
# file1
Description:
I am a coder
Address:
Email: user#gmail.com
Street: user street 19 A
Username: user1
Password: password1
How could I do this in Ruby?
The question is not well formulated - you are getting the concept "Here Docs" confused.
I'll leave some code wich I hope helps in your task, in some way
end_of_line_delimiter = "\n"
file1_arr = File.read('file1.txt').split(end_of_line_delimiter) #Array of lines
file1_has_address = file1_arr.index {|a_line| a_line =~ /^Address:/ }
unless file1_has_address
#file1 does not contain "Address:"
#Build address_txt
email = "some#email"
street = "some street"
address_txt = <<END
Address:
Email: #{email}
Street: #{street}
END
#Insert address_txt 2 lines after the "Description:" line
description_line_index = file1_arr.index {|a_line| a_line =~ /^Description:/ }
raise "Trying to insert address, but 'Description:' line was not found!" unless description_line_index
insert_line_index = description_line_index + 2
file1_arr.insert(insert_line_index, *address_txt.split(end_of_line_delimiter))
end
#file1_arr will now have any Address needed added
file1_txt = file1_arr.join(end_of_line_delimiter)
puts file1_txt
Please report back any success with the code :)

Resources