withColumn does not return negative value - azure-databricks

I am trying to add a column to a dataframe using withColumn. if the reported date or acknoledgement date is null it should return -1 otherwise it should return the difference.
I write dataframe to a csv. It adds a new column to the csv, with the date difference as mentioned in otherwise, but does not return -1 if either of the date values is null. the CSV file has blank value for the when clause. What am I doing wrong?
val df_asbreportssv2 = df_asbreportssv1.withColumn(("AckOverdueby"), when(((df_asbreportssv1("nh_reporteddate").isNull)||(df_asbreportssv1("nh_acknowledgementdate").isNull) == "true"), -1 ).otherwise( datediff((df_asbreportssv1("nh_acknowledgementdate")),(df_asbreportssv1("nh_reporteddate")))))
val TempFilePath = "adl://dldataplatformdev1.azuredatalakestore.net/DDS_Learn/DDS_ASB/temp"
df_asbreportssv2.write
.mode("overwrite")
.format("csv")
.option("header", "true")
.save(TempFilePath)

The first line of code is not working properly as expected and it needs some refinement as follows.
val df_asbreportssv2 = df_asbreportssv1.withColumn(("AckOverdueby"), when((df_asbreportssv1("nh_reporteddate").isNull) || (df_asbreportssv1("nh_acknowledgementdate").isNull), -1 ).otherwise( datediff((df_asbreportssv1("nh_acknowledgementdate")),(df_asbreportssv1("nh_reporteddate")))))
Hope it helps :)

Related

Excel PowerQuery: how do I add an IsNotNull column

I have a simple function that I'd like to run on the values in a column, resulting in another column.
let
ThisIsNotNull = (input) => if (input = null) then false else true,
Source = ...
eventually there is a text column with Nulls in it, let's call it TextColumn.
I'd like to add another column alongside it with a value of =ThisIsNotNull(TextColumn).
Add column... Custom column with formula
= ThisIsNotNull([NameOfColumnToTest])
But really you can skip the function and just use
= if [NameOfColumnToTest] = null then false else true

issues returning pyspark dataframe using for loop

I am applying for loop in pyspark. How can I get the actual values in dataframe . I am doing dataframe joins and filtering too.
I havent added dataset here, I need the approach or psuedo code just to figure out what I am doing worng here.
Help is really appreciated, I am stuck since long.
values1 = values.collect()
temp1 = []
for index, row in enumerate(sorted(values1, key=lambda x:x.w_vote, reverse = False)):
tmp = data_int.filter(data_int.w_vote >= row.w_vote)
# Left join service types to results
it1 = dt.join(master_info,dt.value == master_info.value, 'left').drop(dt.value)
print(tmp)
it1 = it1.withcolumn('iteration',F.lit('index')).otherwise(it1.iteration1)
it1 = it1.collect()[index]
# concatenate the results to the final hh list
temp1.append(it1)
print ('iterations left:', total_values - (index+1), "Threshold:", row.w_vote)
The problem I am facing is the output of temp1 comes as below
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 240 Threshold: 0.1
DataFrame[value_x: bigint, value_y: bigint, type_x: string, type_y: string, w_vote: double]
iterations left: 239 Threshold: 0.2
Why my actual values are not getting displayed in uutput as a list
print applied to a Dataframe execute the __repr__ method of the dataframes, which is what you get. If you want to print the content of the dataframe, use either show to display the first 20 lines, or collect to get the full dataframe.

Talend: Save variable for later use

I´m trying to save a value in spreadsheet's header for later use as a new column value.
This is the reduced version with value (XYZ) in header:
The value in header must be used for new column CODE:
This is my design:
tFilterRow_1 is used to reject rows without values in A, B, C columns.
There is a conditional in tJavaRow_1 to set a global variable:
if(String.valueOf(row1.col_a).equals("CODE:")){
globalMap.putIfAbsent("code", row1.col_b);
}
The Var expression in tMap_1 to get the global variable is:
(String)globalMap.get("code")
The Var "code" is mapped to column "code" but I'm getting this output:
a1|b1|c1|
a2|b2|c2|
a3|b3|c3|
What is missed or there is a better approach to accomplish this escenario ?
Thanks in advance.
Short answer:
I tJavaRow use the input_row or the actual rowN in this case row4.
Longer answer, how I'd do it.
I'd do is let the excel flow in AS-IS. By using some Java tricks we can simply skip the first few rows then let the rest of the flow go through.
So the filter + tjavarow combo can be replaced with a tJavaFlex.
tJavaFlex I'd do:
begin:
boolean contentFound = false;
main
if(input_row.col1 != null && input_row.col1.equalsIgnoreCase("Code:") ) {
globalMap.put("code",input_row.col2);
}
if(input_row.col1 != null && input_row.col1.equalsIgnoreCase("Column A:") ) {
contentFound = true;
} else {
if(false == contentFound) continue;
}
This way you'll simply skip the first few records (i.e header) and only care about the actual data.

Spark - How to count number of records by key

This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a count column in the dataset that I can use as the value in a groupByKey or reduceByKey. I thought of using a reduceByKey() but that requires a key-value pair and I only want to count the key and make a counter as the value. How do I go about this?
val lines = sc.textFile("/home/cloudera/desktop/file.txt")
val split_lines = lines.map(_.split(","))
val femaleOnly = split_lines.filter(x => x._10 == "Female")
Here is where I am stuck. The country is index 13 in the dataset also.
The output should something look like this:
(Australia, 201000)
(America, 420000)
etc
Any help would be great.
Thanks
You're nearly there! All you need is a countByValue:
val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()
// Prints (Australia, 230), (America, 23242), etc.
(In your example, I assume you meant x(10) rather than x._10)
All together:
sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x(10) == "Female")
.map(_(13))
.countByValue()
Have you considered manipulating your RDD using the Dataframes API ?
It looks like you're loading a CSV file, which you can do with spark-csv.
Then it's a simple matter (if your CSV is titled with the obvious column names) of:
import com.databricks.spark.csv._
val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field
.filter($"gender" === "Female")
.groupBy("country").count().show()
If you want to go deeper in this kind of manipulation, here's the guide:
https://spark.apache.org/docs/latest/sql-programming-guide.html
You can easily create a key, it doesn't have to be in the file/database. For example:
val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")
.map(_.split(","))
.filter(x => x._10 == "Female")
.map(x => (x._13, x._10)) // <<<< here you generate a new key
.groupByKey();

Linq DateTime comparison not working

I have the following code:
DateTime timeStamp = Convert.ToDateTime(Request.QueryString["TimeStamp"]);
var result = (from rs in db.VRec
where
rs.TimeStamp == timeStamp &&
rs.Fixure == wFixture
select rs).ToList();
The result shows 0 even though the correct timeStamp is passed.
If I remove the part where I do the TimeStamp comparison:
rs.TimeStamp == timeStamp
The code works fine.
Any idea on why the datetime comparison may not be working?
DateTime has a pretty fine resolution - likely you are comparing timestamps that only differ in milliseconds, which will fail. You probably want something like:
DateTime now = DateTime.Now;
DateTime then = now.Add(TimeSpan.FromMilliseconds(1));
const int EPSILON_MS = 10;
if(now.Subtract(then).TotalMilliseconds < EPSILON_MS)
{
Console.WriteLine("More or less equal!");
}
Linq converts DateTime arguments to DateTime2 in the sql query executed.
That is, when you do the comparison the actual sql executed will compare a DateTime to a DateTime2. This comparison will "cast" the DateTime to a DateTime2 and the millisecond part will be expanded to a greater resolution (in an odd way in my opinion, please enlighten me).
Try to execute the following sql:
declare #d1 datetime = '2016-08-24 06:53:01.383'
declare #d2 datetime2 = '2016-08-24 06:53:01.383'
declare #d3 datetime2 = #d1
select #d1 as 'd1', #d2 'd2', #d3 'converted'
select (case when (#d1 = #d2) then 'True' else 'False' end) as 'Equal',
(case when (#d1 > #d2) then 'True' else 'False' end) as 'd1 greatest'
From the question, I do not know if you want to compare the date with time or only the date part. If you only want to compare date then following would work
var result = (from rs in db.VRec
where
rs.TimeStamp.Date == timeStamp.Date &&
rs.Fixure == wFixture
select rs).ToList();
Since you are using some reference to db, it gives me a feeling that you are fetching your records from database (which ORM you are using is not obvious from the question or tags). Assuming that you are using Entity framework the above query will fail with exception that .Date has no direct translation to sql. If so you can rewrite the query as following to make it work.
var result = (from rs in db.VRec
where
rs.TimeStamp.Day == timeStamp.Day &&
rs.TimeStamp.Month == timeStamp.Month &&
rs.TimeStamp.Year == timeStamp.Year &&
rs.Fixure == wFixture
select rs).ToList();
The benefit of this approach is that you can compare properties to arbitrary deep level i.e you can compare Hours, Minutes,Seconds etc. in your query. The second query is tested in Entity framework 5.

Resources