Save a file in HDFS from Pyspark - hadoop

I have an empty table in Hive I mean there are no records in that table.
Using this empty table I have created a data frame in pyspark
df = sqlContext.table("testing.123_test")
I have registered this data frame as an temp table in
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
In this table I have column called id.
Now I want to query the temp table like below
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I want to save date, min_id and max_id into a file in HDFS
I have done like below:
from pyspark.sql import functions as f
(sqlContext.table("myTempTable").select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id"))).coalesce(1).write.format("text").mode("append").save("/tmp/fooo"))
Now when I check the file in HDFS it show all NULL values.
The file output in HDFS is below.
NULL,NULL,NULL
What I want is
Date,0,0
Here date is the current timestamp
How can I achieve what I want.

This is in scala but you should be easily able to replicate it to Python.
The function you need here is na.fill function. And you'll have to replace Scala Maps with Python Dictionaries in the below code:
This is what your DF looks like:
scala> nullDF.show
+----+----+----+
|date| x| y|
+----+----+----+
|null|null|null|
+----+----+----+
// You have already done this using Python's datetime functions
val format = new java.text.SimpleDateFormat("dd/MM/YYYY HH:mm:ss")
val curr_timestamp = format.format(new java.util.Date())
//Use na fill to replace null values
//Column names as keys in map
//And values are what you want to replace NULL with
val df = nullDF.na.fill(scala.collection.immutable.Map(
"date" -> ) ,
"x" -> "0" ,
"y" -> "0" ) )
This should give you
+-------------------+---+---+
| date| x| y|
+-------------------+---+---+
|10/06/2017 12:10:20| 0| 0|
+-------------------+---+---+

Related

Match individual records during Batch predictions with VertexAI pipeline

I have a custom model in Vertex AI and a table storing the features for the model along with the record_id.
I am building pipeline component for the batch prediction and facing a critical issue.
When I submit the batch_prediction, I should exclude the record_id for the job but How can I map the record if I don't have the record_id in the result?
from google.cloud import bigquery
from google.cloud import aiplatform
aiplatform.init(project=project_id)
client = bigquery.Client(project=project_id)
query = '''
SELECT * except(record_id) FROM `table`
'''
df = client.query(query).to_dataframe() # drop the record_id and load it to another table
job = client.load_table_from_dataframe(
X, "table_wo_id",
)
clf = aiplatform.Model(model_id = 'custom_model')
clf.batch_predict(job_display_name = 'custom model batch prediction',
bigquery_source = 'bq://table_wo_id',
instances_format = 'bigquery',
bigquery_destination_prefix = 'bq://prediction_result_table',
predictions_format = 'bigquery',
machine_type = 'n1-standard-4',
max_replica_count = 1
)
like the above example, there is no record_id column in prediction_result_table. There is no way to map the result back to each record

issues returning pyspark dataframe using for loop

I am applying for loop in pyspark. How can I get the actual values in dataframe . I am doing dataframe joins and filtering too.
I havent added dataset here, I need the approach or psuedo code just to figure out what I am doing worng here.
Help is really appreciated, I am stuck since long.
values1 = values.collect()
temp1 = []
for index, row in enumerate(sorted(values1, key=lambda x:x.w_vote, reverse = False)):
tmp = data_int.filter(data_int.w_vote >= row.w_vote)
# Left join service types to results
it1 = dt.join(master_info,dt.value == master_info.value, 'left').drop(dt.value)
print(tmp)
it1 = it1.withcolumn('iteration',F.lit('index')).otherwise(it1.iteration1)
it1 = it1.collect()[index]
# concatenate the results to the final hh list
temp1.append(it1)
print ('iterations left:', total_values - (index+1), "Threshold:", row.w_vote)
The problem I am facing is the output of temp1 comes as below
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 240 Threshold: 0.1
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 239 Threshold: 0.2
Why my actual values are not getting displayed in uutput as a list
print applied to a Dataframe execute the __repr__ method of the dataframes, which is what you get. If you want to print the content of the dataframe, use either show to display the first 20 lines, or collect to get the full dataframe.

Extracting a specific number from a string using regex function in Spark SQL

I have a table in mysql which has POST_ID and corresponding INTEREST:
I used following regular expression query to select interest containing 1,2,3.
SELECT * FROM INTEREST_POST where INTEREST REGEXP '(?=.*[[:<:]]1[[:>:]])(?=.*[[:<:]]3[[:>:]])(?=.*[[:<:]]2[[:>:]])';
I imported the table in HDFS. However, when I use the same query in SparkSQL, it shows null records.
How to use REGEXP function here in spark to select interest containing 1,2,3?
The Regex you are using need to be changed a bit. You could do something like the following.
scala> val myDf2 = spark.sql("SELECT * FROM INTEREST_POST where INTEREST REGEXP '^[1-3](,[1-3])*$'")
myDf2: org.apache.spark.sql.DataFrame = [INTEREST_POST_ID: int, USER_POST_ID: int ... 1 more field]
scala> myDf2.show
+----------------+------------+--------+
|INTEREST_POST_ID|USER_POST_ID|INTEREST|
+----------------+------------+--------+
| 1| 1| 1,2,3|
I got the solution. You can do something like this:
var result = hiveContext.sql("""SELECT USER_POST_ID
| FROMINTEREST_POST_TABLE
| WHERE INTEREST REGEXP '(?=.*0[1])(?=.*0[2])(?=.*0[3])' """)
result.show
Fetching Records from INTEREST_POST_TABLE

SparkRDD Operations

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.
Input Table :
COLUMN A Column B
56 Walmart
72 Flipkart
96 Amazon
100 AliExpress
Output Table:
COLUMN A Column B
100 AliExpress
This is what i tried till now
SourceCode:
SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));
From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.
You can use either top or takeOrdered functions to achieve it.
rdd.top(1) //gives you top element in your RDD
Data:
COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress
Creating df using Spark 2
val df = sqlContext.read.option("header", "true")
.option("inferSchema", "true")
.csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
Using Dataframe functions
df.orderBy(desc("COLUMN_A")).take(1).foreach(println)
OUTPUT:
[100,AliExpress]
Using RDD functions
df.rdd
.map(row => (row(0).toString.toInt, row(1)))
.sortByKey(false)
.take(1).foreach(println)
OUTPUT:
(100,AliExpress)

How to split rows of a Spark RDD by Deliminator

I am trying to split data in Spark into the form of an RDD of Array[String]. Currently I have loaded the file into an RDD of String.
> val csvFile = textFile("/input/spam.csv")
I would like to split on a a , deliminator.
This:
val csvFile = textFile("/input/spam.csv").map(line => line.split(","))
returns you RDD[Array[String]].
If you need first column as one RDD then using map function return only first index from Array:
val firstCol = csvFile.map(_.(0))
You should be using the spark-csv library which is able to parse your file considering headers and allow you to specify the delimitor. Also, it makes a pretty good job at infering the schema. I'll let you read the documentation to discover the plenty of options at your disposal.
This may look like this :
sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","your delimitor")
.load(pathToFile)
Be aware, this returns a DataFrame that you may have to convert to an rdd using .rdd function.
Of course, you will have to load the package into the driver for it to work.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// convert dataframe to rdd[row]
val rddRow = df.rdd
// print 2 rows
rddRow.take(2)
// convert df to rdd[string] for specific column
val oneColumn = df.select("colName").as[(String)].rdd
oneColumn.take(2)
// convert df to rdd[string] for multiple columns
val multiColumn = df.select("col1Name","col2Name").as[(String, String)].rdd
multiColumn.take(2)

Resources