Using hash mod to sample a dataframe - hadoop

I have a dataframe with a field transactionId and I want to sample on this
field. I'm wanting to sample on the hash of the field because the sampled data will be join to the sample of another sampled dataframe and I want tohave the same ids in both samples. Problem is I'm getting stuck on how to hash and mod within a filter, having tried various versions of this
scala> val dfSampled = df.filter($"transactionId".hashCode() % 10 == 0)
<console>:27: error: overloaded method value filter with alternatives:
(conditionExpr: String)org.apache.spark.sql.DataFrame <and>
(condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
val dfSampled = df.filter($"transactionId".hashCode() % 10 == 0)
^
`
Can anyone give me some advice

This incorrect for two different reasons:
you take hash of a column object not values in the DataFrame,
you use incorrect equality operator.
Correct solution would be something like this:
import org.apache.spark.sql.functions.hash
val df = sc.range(1L, 100L).toDF("transactionId").show
// +-------------+
// |transactionId|
// +-------------+
// | 4|
// | 16|
// | 18|
// | 26|
// | 27|
// +-------------+
df.filter(hash($"transactionId") % 10 === 0)
Please note that it is using Murmur3Hash not hash codes.

Related

Extracting a specific number from a string using regex function in Spark SQL

I have a table in mysql which has POST_ID and corresponding INTEREST:
I used following regular expression query to select interest containing 1,2,3.
SELECT * FROM INTEREST_POST where INTEREST REGEXP '(?=.*[[:<:]]1[[:>:]])(?=.*[[:<:]]3[[:>:]])(?=.*[[:<:]]2[[:>:]])';
I imported the table in HDFS. However, when I use the same query in SparkSQL, it shows null records.
How to use REGEXP function here in spark to select interest containing 1,2,3?
The Regex you are using need to be changed a bit. You could do something like the following.
scala> val myDf2 = spark.sql("SELECT * FROM INTEREST_POST where INTEREST REGEXP '^[1-3](,[1-3])*$'")
myDf2: org.apache.spark.sql.DataFrame = [INTEREST_POST_ID: int, USER_POST_ID: int ... 1 more field]
scala> myDf2.show
+----------------+------------+--------+
|INTEREST_POST_ID|USER_POST_ID|INTEREST|
+----------------+------------+--------+
| 1| 1| 1,2,3|
I got the solution. You can do something like this:
var result = hiveContext.sql("""SELECT USER_POST_ID
| FROMINTEREST_POST_TABLE
| WHERE INTEREST REGEXP '(?=.*0[1])(?=.*0[2])(?=.*0[3])' """)
result.show
Fetching Records from INTEREST_POST_TABLE

Avoid multiple sums in custom crossfilter reduce functions

This question arise from some difficulties in creating a crossfilter dataset, in particular on how to group the different dimension and compute a derived values. The final aim is to have a number of dc.js graphs using the dimensions and groups.
(Fiddle example https://jsfiddle.net/raino01r/0vjtqsjL/)
Question
Before going on with the explanation of the setting, the key question is the following:
How to create custom add, remove, init, functions to pass in .reduce so that the first two do not sum multiple times the same feature?
Data
Let's say I want to monitor the failure rate of a number of machines (just an example). I do this using different dimension: month, machine's location, and type of failure.
For example I have the data in the following form:
| month | room | failureType | failCount | machineCount |
|---------|------|-------------|-----------|--------------|
| 2015-01 | 1 | A | 10 | 5 |
| 2015-01 | 1 | B | 2 | 5 |
| 2015-01 | 2 | A | 0 | 3 |
| 2015-01 | 2 | B | 1 | 3 |
| 2015-02 | . | . | . | . |
Expected
For the three given dimensions, I should have:
month_1_rate = $\frac{10+2+0+1}{5+3}$;
room_1_rate = $\frac{10+2}{5}$;
type_A_rate = $\frac{10+0}{5+3}$.
Idea
Essentially, what counts in this setting is the couple (day, room). I.e. given a day and a room there should be a rate attached to them (then the crossfilter should act to take in account the other filters).
Therefore, a way to go could be to store the couples that have already been used and do not sum machineCount for them - however we still want to update the failCount value.
Attempt (failing)
My attempt was to create custom reduce functions and not summing MachineCount that were already taken into account.
However there are some unexpected behaviours. I'm sure this is not the way to go - so I hope to have some suggestion on this.
// A dimension is one of:
// ndx = crossfilter(data);
// ndx.dimension(function(d){return d.month;})
// ndx.dimension(function(d){return d.room;})
// ndx.dimension(function(d){return d.failureType;})
// Goal: have a general way to get the group given the dimension:
function get_group(dim){
return dim.group().reduce(add_rate, remove_rate, initial_rate);
}
// month is given as datetime object
var monthNameFormat = d3.time.format("%Y-%m");
//
function check_done(p, v){
return p.done.indexOf(v.room+'_'+monthNameFormat(v.month))==-1;
}
// The three functions needed for the custom `.reduce` block.
function add_rate(p, v){
var index = check_done(p, v);
if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
var count_to_sum = (index)? v.machineCount:0;
p.mach_count += count_to_sum;
p.fail_count += v.failCount;
p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
return p;
}
function remove_rate(p, v){
var index = check_done(p, v);
var count_to_subtract = (index)? v.machineCount:0;
if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
p.mach_count -= count_to_subtract;
p.fail_count -= v.failCount;
p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
return p;
}
function initial_rate(){
return {rate: 0, mach_count:0, fail_count:0, done: new Array()};
}
Connection with dc.js
As mentioned, the previous code is needed to create dimension, group to be passed in three different bar graphs using dc.js.
Each graph will have .valueAccessor(function(d){return d.value.rate};).
See the jsfiddle (https://jsfiddle.net/raino01r/0vjtqsjL/), for an implementation. Different numbers, but the datastructure is the same. Notice the in the fiddle you expect a Machine count to be 18 (in both months), however you always get the double (because of the 2 different locations).
Edit
Reduction + dc.js
Following Ethan Jewett answer, I used reductio to take care of the grouping. The updated fiddle is here https://jsfiddle.net/raino01r/dpa3vv69/
My reducer object needs two exception (month, room), when summing the machineCount values. Hence it is built as follows:
var reducer = reductio()
reducer.value('mach_count')
.exception(function(d) { return d.room; })
.exception(function(d) { return d.month; })
.exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
.sum(function(d) { return d.failCount; })
This seems to fix the numbers when the graphs are rendered.
However, I do have a strange behaviour when filtering one single month and looking at the numbers in the type graph.
Possible solution
Rather double create two exception, I could merge the two fields when processing the data. I.e. as soon the data is defined I couls:
data.foreach(function(x){
x['room_month'] = x['room'] + '_' + x['month'];
})
Then the above reduction code should become:
var reducer = reductio()
reducer.value('mach_count')
.exception(function(d) { return d.room_month; })
.exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
.sum(function(d) { return d.failCount; })
This solution seems to work. However I am not sure if this is a sensible things to do: if the dataset is large,adding a new feature could slow down things quite a lot!
A few things:
Don't calculate rates in your Crossfilter reducers. Calculate the components of the rates. This will keep both simpler and faster. Do the actual division in your value accessor.
You've basically got the right idea. I think there are two problems that I see immediately:
In your remove_rate your are not removing the key from the p.done array. You should be doing something like if (index) p.done.splice(p.done.indexOf(v.room+'_'+monthNameFormat(v.month)), 1); to remove it.
In your reduce functions, index is a boolean. (index == -1) will never evaluate to true, IIRC. So your added machine count will always be 0. Use var count_to_sum = index ? v.machineCount:0; instead.
If you want to put together a working example, I or someone else will be happy to get it going for you, I'm sure.
You may also want to try Reductio. Crossfilter reducers are difficult to do right and efficiently, so it may make sense to use a library to help. With Reductio, creating a group that calculates your machine count and failure count looks like this:
var reducer = reductio()
reducer.value('mach_count')
.exception(function(d) { return d.room; })
.exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
.sum(function(d) { return d.failCount; })
var dim = ndx.dimension(...)
var grp = dim.group()
reducer(group)

Kotlin - Sort List by using formatted date string (functional)

I'm trying to create a Kotlin REST API, which retrieves values from a PostgreSQL database. Now the values in these results are f.e. "14-10-2016 | 15:48" and "01-08-2015 | 09:29" So the syntax basically is dd-MM-yyyy | hh:mm
Now what I'm trying to do is create a function that will order them by date placed. (Assume these strings are in an array)
var list = listOf("14-10-2016 | 15:48",
"01-08-2015 | 09:29",
"15-11-2016 | 19:43")
What would be the cleanest (and most functional) way of sorting these? (so f.e. is there a way where I don't have to take substrings of the day, month, etc. cast them to an Int, compare them in a nested loop and write the results to a different array? (that's the only way I could think of).
More than one approach can be used. It depends on how you process after you get the sorted result.
Points to note:
java.time.LocalDateTime has already implemented
java.lang.Comparable<T> Interface. We can use the kotlin stdlib List.sortBy to sort the List<LocalDateTime> directly.
Ref:
// https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/sorted-by.html
fun <T, R : Comparable<R>> Iterable<T>.sortedBy(
selector: (T) -> R?
): List<T>
The easiest way is to transform the String -> java.time.LocalDateTime and use the List.sortBy directly.
The whole implementation could be like this:
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
...
// Create a convert function, String -> LocalDateTime
val dateTimeStrToLocalDateTime: (String) -> LocalDateTime = {
LocalDateTime.parse(it, DateTimeFormatter.ofPattern("dd-MM-yyyy | HH:mm"))
}
val list = listOf("14-10-2016 | 15:48",
"01-08-2015 | 09:29",
"15-11-2016 | 19:43")
// You will get List<LocalDateTime> sorted in ascending order
list.map(dateTimeStrToLocalDateTime).sorted()
// You will get List<LocalDateTime> sorted in descending order
list.map(dateTimeStrToLocalDateTime).sortedDescending()
// You will get List<String> which is sorted in ascending order
list.sortedBy(dateTimeStrToLocalDateTime)
// You will get List<String> which is sorted in descending order
list.sortedByDescending(dateTimeStrToLocalDateTime)
If you want to use org.joda.time.DateTime, you can just make a tiny change on the convert function.
A friendly reminder, always pick val as your first choice in Kotlin :).
Another alternative to the excellent Saravana answer (for minimalist and compact freaks like me..) is:
val cmp = compareBy<String> { LocalDateTime.parse(it, DateTimeFormatter.ofPattern("dd-MM-yyyy | HH:mm")) }
list.sortedWith(cmp).forEach(::println)
01-08-2015 | 09:29
14-10-2016 | 15:48
15-11-2016 | 19:43
Ps: it is the default variable for single inputs
You could use DateTimeFormatter to parse and then compare with LocalDateTime
List<String> dates = Arrays.asList("14-10-2016 | 15:48", "01-08-2015 | 09:29", "15-11-2016 | 19:43");
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MM-yyyy | HH:mm");
List<LocalDateTime> dateTimes = dates.stream().map(date -> LocalDateTime.parse(date, formatter)).sorted().collect(Collectors.toList());
System.out.println(dateTimes);
Output
[2015-08-01T09:29, 2016-10-14T15:48, 2016-11-15T19:43]
Update
You could simply convert to LocalDateTime in Comparator alone
List<String> sortedDates = dates.stream().sorted(Comparator.comparing(date -> LocalDateTime.parse(date, formatter))).collect(Collectors.toList());
output
[01-08-2015 | 09:29, 14-10-2016 | 15:48, 15-11-2016 | 19:43]
If you are using a custom object dates list sorted as below.
println("--- ASC ---")
dates.sortBy { it.year }
println("--- DESC ---")
dates.sortByDescending { it.year }
You can use sortByDescending {it.field} for descending order.

Spark: How to create a sessionId based on userId and timestamp

Sorry for a newbie question.
Currently I have log files which contains fields such as: userId, event, and timestamp, while lacking of the sessionId. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT.
If the TIMEOUT value is 10, and sample DataFrame is:
scala> eventSequence.show(false)
+----------+------------+----------+
|uerId |event |timestamp |
+----------+------------+----------+
|U1 |A |1 |
|U2 |B |2 |
|U1 |C |5 |
|U3 |A |8 |
|U1 |D |20 |
|U2 |B |23 |
+----------+------------+----------+
The goal is:
+----------+------------+----------+----------+
|uerId |event |timestamp |sessionId |
+----------+------------+----------+----------+
|U1 |A |1 |S1 |
|U2 |B |2 |S2 |
|U1 |C |5 |S1 |
|U3 |A |8 |S3 |
|U1 |D |20 |S4 |
|U2 |B |23 |S5 |
+----------+------------+----------+----------+
I find one solution in R (Create a "sessionID" based on "userID" and differences in "timeStamp"), while I am not able to figure it out in Spark.
Thanks for any suggestions on this problem.
Shawn's answer regards on "How to create a new column", while my aim is to "How to create an sessionId column based on timestamp". After days of struggling, the Window function is applied in this scenario as a simple solution.
Window is introduced since Spark 1.4, it provides functions when such operations is needed:
both operate on a group of rows while still returning a single value for every input row
In order to create a sessionId based on timestamp, first I need to get the difference between user A's two immediate operations. The windowDef defines the Window will be partition by "userId" and ordered by timestamp, then diff is a column which will return a value for each row, whose value will be 1 row after the current row in the partition(group), or null if the current row is the last row in this partition
def handleDiff(timeOut: Int) = {
udf {(timeDiff: Int, timestamp: Int) => if(timeDiff > timeOut) timestamp + ";" else timestamp + ""}
}
val windowDef = Window.partitionBy("userId").orderBy("timestamp")
val diff: Column = lead(eventSequence("timestamp"), 1).over(windowDef)
val dfTSDiff = eventSequence.
withColumn("time_diff", diff - eventSequence("timestamp")).
withColumn("event_seq", handleDiff(TIME_OUT)(col("time_diff"), col("timestamp"))).
groupBy("userId").agg(GroupConcat(col("event_seq")).alias("event_seqs"))
Updated:
Then exploit the Window function to apply the "cumsum"-like operation (provided in Pandas):
// Define a Window, partitioned by userId (partitionBy), ordered by timestamp (orderBy), and delivers all rows before current row in this partition as frame (rowsBetween)
val windowSpec = Window.partitionBy("userId").orderBy("timestamp").rowsBetween(Long.MinValue, 0)
val sessionDf = dfTSDiff.
withColumn("ts_diff_flag", genTSFlag(TIME_OUT)(col("time_diff"))).
select(col("userId"), col("eventSeq"), col("timestamp"), sum("ts_diff_flag").over(windowSpec).alias("sessionInteger")).
withColumn("sessionId", genSessionId(col("userId"), col("sessionInteger")))
Previously:
Then split by ";" and get each session, create a sessionId; afterwards split by "," and explodes to final result. Thus sessionId is created with the help of string operations.
(This part should be replaced by cumulative sum operation instead, however I did not find a good solution)
Any idea or thought about this question is welcomed.
GroupConcat could be found here: SPARK SQL replacement for mysql GROUP_CONCAT aggregate function
Reference: databricks introduction
dt.withColumn('sessionId', expression for the new column sessionId)
for example:
dt.timestamp + pre-defined value TIMEOUT

How to split a column which has data in XML form to different rows of new Database as KEY VALUE in TALEND

In old DB i have a data in one column as
<ADDRESS>
<CITY>ABC</CITY>
<STATE>PQR</SERVICE>
</ADDRESS>
In my new DB i want this data to be stored in KEY VALUE fashion like:
USER_ID KEY VALUE
1 CITY ABC
1 STATE PQR
Someone please help me how to migrate this kind of data using TALEND tool.
Design job like below.
tOracleInput---tExtractXMLFiled---output.
tOracleInput component you can select XML column and make datatype as String.
tExtractXmlFiled component pass this XML column as " XML Filed" and set the Loop xpath Expression as "/ADDRESS"
Add new two Columns in output Schema of tExtractXmlFiled for city & STATE
Set XPath Query in Mapping for city "/ADDRESS/CITY" and for STATE "/ADDRESS/STATE"
Now you have both the values in output.
See the image for more details.
as I explain in your previous post you can follow the same approach for making Key value pair.
how-to-split-one-row-in-different-rows-in-talend
Or you can use tUnpivot component as you did here.
As you said source data has Special character then use below expression to replace it.
Steps: after oracle input add tMap and use this code for replacement of special symbol
row24.XMLField.replaceAll("&", "<![CDATA["+"&"+"]]>")
once that is done execute the job and see the result it should work.
I'd use tJavaFlex.
Component Settings:
tJavaFlex schema:
In the begin part, use
String input = ((String)globalMap.get("row2.xmlField")); // get the xml Fields value
String firstTag = input.substring(input.indexOf("<")+1,input.indexOf(">"));
input = input.replace("<"+firstTag+">","").replace("</"+firstTag+">","");
int tagCount = input.length() - input.replace("</", "<").length();
int closeTagFinish = -1;
for (int i = 0; i<tagCount ; i++) {
in the main part, parse the XML tag name and value, and have the output schema contain that 2 additional column. MAIN part will be like:
/*set up the output columns */
output.user_id = ((String)globalMap.get("row2.user_id"));
output.user_first_name = ((String)globalMap.get("row2.user_first_name"));
output.user_last_name = ((String)globalMap.get("row2.user_last_name"));
Then we can calculate the key-value pairs for the XML, without knowing the KEY values.
/*calculate columns out of XML */
int openTagStart = input.indexOf("<",closeTagFinish+1);
int openTagFinish = input.indexOf(">",openTagStart);
int closeTagStart = input.indexOf("<",openTagFinish);
closeTagFinish = input.indexOf(">",closeTagStart);
output.xmlKey = input.substring(openTagStart+1,openTagFinish);
output.xmlValue = input.substring(openTagFinish+1,closeTagStart);
tJavaFlex End part:
}
Output looks like:
.-------+---------------+--------------+------+--------.
| tLogRow_2 |
|=------+---------------+--------------+------+-------=|
|user_id|user_first_name|user_last_name|xmlKey|xmlValue|
|=------+---------------+--------------+------+-------=|
|1 |foo |bar |CITY |ABC |
|1 |foo |bar |STATE |PQR |
'-------+---------------+--------------+------+--------'

Resources