Spark repartitionAndSortWithinPartitions with tuples - sorting

I'm trying to follow this example to partition hbase rows: https://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/
However, I have data already stored in (String, String, String) where the first is the rowkey, second is column name, and third is column value.
I tried writing an implicit ordering to achieve the OrderedRDD implicit
implicit val caseInsensitiveOrdering: Ordering[(String, String, String)] = new Ordering[(String, String, String)] {
override def compare(x: (String, String, String), y: (String, String, String)): Int = ???
}
but repartitionAndSortWithinPartitions is still not available. Is there a way I can use this method with this tuple?

RDD must have key and value, not only values, for ex.:
val data = List((("5", "6", "1"), (1)))
val rdd : RDD[((String, String, String), Int)] = sparkContext.parallelize(data)
implicit val caseInsensitiveOrdering = new Ordering[(String, String, String)] {
override def compare(x: (String, String, String), y: (String, String, String)): Int = 1
}
rdd.repartitionAndSortWithinPartitions(..)

Related

Apache Beam Go SDK: how to convert PCollection<string> to PCollection<KV<string, string>>?

I'm using the Apache Beam Go SDK and having a hard time getting a PCollection in the correct format for grouping/combining by key.
I have multiple records per key in a PCollection of strings that look like this:
Bob, cat
Bob, dog
Carla, cat
Carla, bunny
Doug, horse
I want to use GroupByKey and CombinePerKey so I can aggregate each person's pets like this:
Bob, [cat, dog]
Carla, [cat, bunny]
Doug, [horse]
How do I convert a PCollection<string> to PCollection<KV<string, string>>?
They mention something similar here, but the code to aggregate the string values is not included.
I can use a ParDo to get the string key and string value as shown below, but I can't figure out how to convert to the KV<string, string> or CoGBK<string, string> format required as input to GroupPerKey.
pcolOut := beam.ParDo(s, func(line string) (string, string) {
cleanString := strings.TrimSpace(line)
openingChar := ","
iStart := strings.Index(cleanString, openingChar)
key := cleanString[0:iStart]
value := cleanString[iStart+1:]
// How to convert to PCollection<KV<string, string>> before returning?
return key, value
}, pcolIn)
groupedKV := beam.GroupByKey(s, pcolOut)
It fails with the following error. Any suggestions?
panic: inserting ParDo in scope root
creating new DoFn in scope root
binding fn main.main.func2
binding params [{Value string} {Value string}] to input CoGBK<string,string>
values of CoGBK<string,string> cannot bind to {Value string}
To map to KVs, you can apply MapElements and use into() to set KV types and in the via() logic, create a new KV.of(myKey, myValue), for example, to get a KV<String,String>, use something like this:
PCollection<KV<String, String>> kvPairs = linkpages.apply(MapElements.into(
TypeDescriptors.kvs(
TypeDescriptors.strings(),
TypeDescriptors.strings()))
.via(
linkpage -> KV.of(dataFile, linkpage)));
Maybe you mistake the next pardo iter type
test this code
pcolIn := beam.CreateList(s, []string{"Bob, cat",
"Bob, dog",
"Carla, cat",
"Carla, bunny",
"Doug, horse",
})
pcolOut := beam.ParDo(s, func(line string) (string, string) {
cleanString := strings.TrimSpace(line)
openingChar := ","
iStart := strings.Index(cleanString, openingChar)
key := cleanString[0:iStart]
value := cleanString[iStart+1:]
// How to convert to PCollection<KV<string, string>> before returning?
return key, value
}, pcolIn)
groupedKV := beam.GroupByKey(s, pcolOut)
beam.ParDo0(s, func(key string, iter func(*string) bool) {
vals := []string{}
val := ""
for iter(&val) {
vals = append(vals, strings.TrimSpace(val))
}
fmt.Println(key, vals)
}, groupedKV)

Allowing for a variable number of return values in method declaration

I have a function that solves the problem of Go not allowing for the setting of default values in method declarations. I want to make it just a little bit better by allowing for a variable number of return variables. I understand that I can allow for an array of interfaces as a return type and then create an interface array with all the variables to return, like this:
func SetParams(params []interface{}, args ...interface{}) (...[]interface{}) {
var values []interface{}
for i := range params {
var value interface{}
paramType := reflect.TypeOf(params[i])
if len(args) < (i + 1) {
value = params[i]
} else {
argType := reflect.TypeOf(args[i])
if paramType != argType {
value = params[i]
}
value = args[i]
}
values = append(values, value)
}
return values
}
This is an example of a method you want to define default values for. You build it as a variadic function (allowing a variable number of parameters) and then define the default values of the specific params you are looking for inside the function instead of in the declaration line.
func DoSomething(args ...interface{}) {
//setup default values
str := "default string 1"
num := 1
str2 := "default string 2"
//this is fine
values := SetParams([]interface{str, num, str2}, args)
str = values[0].(string)
num = values[1].(int)
str = values[2].(string)
//I'd rather support this
str, num, str2 = SetParams(params, args)
}
I understand that
[]interface{str, num, str2}
in the above example is not syntactically correct. I did it that way to simplify my post. But, it represents another function that builds the array of interfaces.
I would like to support this:
str, num, str2 = SetParams(params, args)
instead of having to do this:
values := SetParams([]interface{str, num, str2}, args)
str = values[0].(string)
num = values[1].(int)
str = values[2].(string)
Any advice? Help?
Please don't write horrible (and ineffective due to reflect) code to solve nonexistent problem.
As was indicated in comments, turning a language into
one of your previous languages is indeed compelling
after a switch, but this is counterproductive.
Instead, it's better to work with the idioms and approaches
and best practices the language provides --
even if you don't like them (yet, maybe).
For this particular case you can roll like this:
Make the function which wants to accept
a list of parameters with default values
accept a single value of a custom struct type.
For a start, any variable of such type, when allocated,
has all its fields initialized with the so-called "zero values"
appropriate to their respective types.
If that's enough, you can stop there: you will be able
to pass values of your struct type to your functions
by producing them via literals right at the call site --
initializing only the fields you need.
Otherwise have pre-/post- processing code which
would provide your own "zero values" for the fields
you need.
Update on 2016-08-29:
Using a struct type to simulate optional parameters
using its fields being assigned default values which happen
to be Go's native zero values for their respective data types:
package main
import (
"fmt"
)
type params struct {
foo string
bar int
baz float32
}
func myfun(params params) {
fmt.Printf("%#v\n", params)
}
func main() {
myfun(params{})
myfun(params{bar: 42})
myfun(params{foo: "xyzzy", baz: 0.3e-2})
}
outputs:
main.params{foo:"", bar:0, baz:0}
main.params{foo:"", bar:42, baz:0}
main.params{foo:"xyzzy", bar:0, baz:0.003}
As you can see, Go initializes the fields of our params type
with the zero values appropriate to their respective types
unless we specify our own values when we define our literals.
Playground link.
Providing default values which are not Go-native zero values for
the fields of our custom type can be done by either pre-
or post-processing the user-submitted value of a compound type.
Post-processing:
package main
import (
"fmt"
)
type params struct {
foo string
bar int
baz float32
}
func (pp *params) setDefaults() {
if pp.foo == "" {
pp.foo = "ahem"
}
if pp.bar == 0 {
pp.bar = -3
}
if pp.baz == 0 { // Can't really do this to FP numbers; for demonstration purposes only
pp.baz = 0.5
}
}
func myfun(params params) {
params.setDefaults()
fmt.Printf("%#v\n", params)
}
func main() {
myfun(params{})
myfun(params{bar: 42})
myfun(params{foo: "xyzzy", baz: 0.3e-2})
}
outputs:
main.params{foo:"ahem", bar:-3, baz:0.5}
main.params{foo:"ahem", bar:42, baz:0.5}
main.params{foo:"xyzzy", bar:-3, baz:0.003}
Playground link.
Pre-processing amounts to creating a "constructor" function
which would return a value of the required type pre-filled
with the default values your choice for its fields—something
like this:
func newParams() params {
return params{
foo: "ahem",
bar: -3,
baz: 0.5,
}
}
so that the callers of your function could call newParams(),
tweak its fields if they need and then pass the resulting value
to your function:
myfunc(newParams())
ps := newParams()
ps.foo = "xyzzy"
myfunc(ps)
This approach is maybe a bit more robust than post-processing but
it precludes using of literals to construct the values to pass to
your function right at the call site which is less "neat".
Recently I was playing with anonymous functions in Go and implemented an example which accepts and returns undefined parameters:
func functions() (funcArray []func(args ... interface{}) (interface{}, error)) {
type ret struct{
first int
second string
third bool
}
f1 := func(args ... interface{}) (interface{}, error){
a := args[0].(int)
b := args[1].(int)
return (a < b), nil
}
funcArray = append(funcArray , f1)
f2 := func(args ... interface{}) (interface{}, error){
return (args[0].(string) + args[1].(string)), nil
}
funcArray = append(funcArray , f2)
f3 := func(args ... interface{}) (interface{}, error){
return []int{1,2,3}, nil
}
funcArray = append(funcArray , f3)
f4 := func(args ... interface{}) (interface{}, error){
return ret{first: 1, second: "2", third: true} , nil
}
funcArray = append(funcArray , f4)
return funcArray
}
func main() {
myFirst_Function := functions()[0]
mySecond_Function := functions()[1]
myThird_Function := functions()[2]
myFourth_Function := functions()[3]
fmt.Println(myFirst_Function(1,2))
fmt.Println(mySecond_Function("1","2"))
fmt.Println(myThird_Function())
fmt.Println(myFourth_Function ())
}
I hope it helps you.
https://play.golang.org/p/d6dSYLwbUB9

Return values of function as input arguments to another

If I have
func returnIntAndString() (i int, s string) {...}
And I have:
func doSomething(i int, s string) {...}
Then I can do the following successfully:
doSomething(returnIntAndString())
However, let's say I want to add another argument to doSomething like:
func doSomething(msg string, i int, s string) {...}
Go complains when compiling if I call it like:
doSomething("message", returnIntAndString())
With:
main.go:45: multiple-value returnIntAndString() in single-value context
main.go:45: not enough arguments in call to doSomething()
Is there a way to do this or should I just give up and assign the return values from returnIntAndString to some references and pass msg and these values like doSomething(msg, code, str) ?
It's described here in the spec. It requires the inner function to return the correct types for all arguments. There is no allowance for extra parameters along with a function that returns multiple values.
As a special case, if the return values of a function or method g are
equal in number and individually assignable to the parameters of
another function or method f, then the call f(g(parameters_of_g)) will
invoke f after binding the return values of g to the parameters of f
in order. The call of f must contain no parameters other than the call
of g, and g must have at least one return value. If f has a final ...
parameter, it is assigned the return values of g that remain after
assignment of regular parameters.
func Split(s string, pos int) (string, string) {
return s[0:pos], s[pos:]
}
func Join(s, t string) string {
return s + t
}
if Join(Split(value, len(value)/2)) != value {
log.Panic("test fails")
}
If those specific conditions are not met, then you need to assign the return values and call the function separately.
I had the same question. The best solution I could come up with is creating types or structs for my desired extra parameters and writing methods for them like this:
package main
import (
"fmt"
)
type Message string
type MessageNumber struct {
Message string
Number int
}
func testfunc() (foo int, bar int) {
foo = 4
bar = 2
return
}
func (baz Message) testfunc2(foo int, bar int) {
fmt.Println(foo, bar, baz)
}
func (baz MessageNumber) testfunc3(foo int, bar int) {
fmt.Println(foo, bar, baz.Number, baz.Message)
}
func main() {
Message("the answer").testfunc2(testfunc())
MessageNumber{"what were we talking about again?", 0}.testfunc3(testfunc())
fmt.Println("Done. Have a day.")
}
The output looks like this:
user#Frodos-Atari-MEGA-STE:~/go/test$ go run main.go
4 2 the answer
4 2 0 what were we talking about again?
Done. Have a day.

Validation versus disjunction

Suppose I want to write a method with the following signature:
def parse(input: List[(String, String)]):
ValidationNel[Throwable, List[(Int, Int)]]
For each pair of strings in the input, it needs to verify that both members can be parsed as integers and that the first is smaller than the second. It then needs to return the integers, accumulating any errors that turn up.
First I'll define an error type:
import scalaz._, Scalaz._
case class InvalidSizes(x: Int, y: Int) extends Exception(
s"Error: $x is not smaller than $y!"
)
Now I can implement my method as follows:
def checkParses(p: (String, String)):
ValidationNel[NumberFormatException, (Int, Int)] =
p.bitraverse[
({ type L[x] = ValidationNel[NumberFormatException, x] })#L, Int, Int
](
_.parseInt.toValidationNel,
_.parseInt.toValidationNel
)
def checkValues(p: (Int, Int)): Validation[InvalidSizes, (Int, Int)] =
if (p._1 >= p._2) InvalidSizes(p._1, p._2).failure else p.success
def parse(input: List[(String, String)]):
ValidationNel[Throwable, List[(Int, Int)]] = input.traverseU(p =>
checkParses(p).fold(_.failure, checkValues _ andThen (_.toValidationNel))
)
Or, alternatively:
def checkParses(p: (String, String)):
NonEmptyList[NumberFormatException] \/ (Int, Int) =
p.bitraverse[
({ type L[x] = ValidationNel[NumberFormatException, x] })#L, Int, Int
](
_.parseInt.toValidationNel,
_.parseInt.toValidationNel
).disjunction
def checkValues(p: (Int, Int)): InvalidSizes \/ (Int, Int) =
(p._1 >= p._2) either InvalidSizes(p._1, p._2) or p
def parse(input: List[(String, String)]):
ValidationNel[Throwable, List[(Int, Int)]] = input.traverseU(p =>
checkParses(p).flatMap(s => checkValues(s).leftMap(_.wrapNel)).validation
)
Now for whatever reason the first operation (validating that the pairs parse as strings) feels to me like a validation problem, while the second (checking the values) feels like a disjunction problem, and it feels like I need to compose the two monadically (which suggests that I should be using \/, since ValidationNel[Throwable, _] doesn't have a monad instance).
In my first implementation, I use ValidationNel throughout and then fold at the end as a kind of fake flatMap. In the second, I bounce back and forth between ValidationNel and \/ as appropriate depending on whether I need error accumulation or monadic binding. They produce the same results.
I've used both approaches in real code, and haven't yet developed a preference for one over the other. Am I missing something? Should I prefer one over the other?
This is probably not the answer you're looking, but I just noticed Validation has the following methods
/** Run a disjunction function and back to validation again. Alias for `#\/` */
def disjunctioned[EE, AA](k: (E \/ A) => (EE \/ AA)): Validation[EE, AA] =
k(disjunction).validation
/** Run a disjunction function and back to validation again. Alias for `disjunctioned` */
def #\/[EE, AA](k: (E \/ A) => (EE \/ AA)): Validation[EE, AA] =
disjunctioned(k)
When I saw them, I couldn't really see their usefulness until I remembered this question. They allow you to do a proper bind by converting to disjunction.
def checkParses(p: (String, String)):
ValidationNel[NumberFormatException, (Int, Int)] =
p.bitraverse[
({ type L[x] = ValidationNel[NumberFormatException, x] })#L, Int, Int
](
_.parseInt.toValidationNel,
_.parseInt.toValidationNel
)
def checkValues(p: (Int, Int)): InvalidSizes \/ (Int, Int) =
(p._1 >= p._2) either InvalidSizes(p._1, p._2) or p
def parse(input: List[(String, String)]):
ValidationNel[Throwable, List[(Int, Int)]] = input.traverseU(p =>
checkParses(p).#\/(_.flatMap(checkValues(_).leftMap(_.wrapNel)))
)
The following is a pretty close translation of the second version of my code for Cats:
import scala.util.Try
case class InvalidSizes(x: Int, y: Int) extends Exception(
s"Error: $x is not smaller than $y!"
)
def parseInt(input: String): Either[Throwable, Int] = Try(input.toInt).toEither
def checkValues(p: (Int, Int)): Either[InvalidSizes, (Int, Int)] =
if (p._1 >= p._2) Left(InvalidSizes(p._1, p._2)) else Right(p)
import cats.data.{EitherNel, ValidatedNel}
import cats.instances.either._
import cats.instances.list._
import cats.syntax.apply._
import cats.syntax.either._
import cats.syntax.traverse._
def checkParses(p: (String, String)): EitherNel[Throwable, (Int, Int)] =
(parseInt(p._1).toValidatedNel, parseInt(p._2).toValidatedNel).tupled.toEither
def parse(input: List[(String, String)]): ValidatedNel[Throwable, List[(Int, Int)]] =
input.traverse(fields =>
checkParses(fields).flatMap(s => checkValues(s).toEitherNel).toValidated
)
To update the question, this code is "bouncing back and forth between ValidatedNel and Either as appropriate depending on whether I need error accumulation or monadic binding".
In the almost six years since I asked this question, Cats has introduced a Parallel type class (improved in Cats 2.0.0) that solves exactly the problem I was running into:
import cats.data.EitherNel
import cats.instances.either._
import cats.instances.list._
import cats.instances.parallel._
import cats.syntax.either._
import cats.syntax.parallel._
def checkParses(p: (String, String)): EitherNel[Throwable, (Int, Int)] =
(parseInt(p._1).toEitherNel, parseInt(p._2).toEitherNel).parTupled
def parse(input: List[(String, String)]): EitherNel[Throwable, List[(Int, Int)]] =
input.parTraverse(fields =>
checkParses(fields).flatMap(checkValues(_).toEitherNel)
)
We can switch the the par version of our applicative operators like traverse or tupled when we want to accumulate errors, but otherwise we're working in Either, which gives us monadic binding, and we no longer have to refer to Validated at all.

Map of methods in Go

I have several methods that I'm calling for some cases (like Add, Delete, etc..). However over time the number of cases is increasing and my switch-case is getting longer. So I thought I'd create a map of methods, like Go map of functions; here the mapping of functions is trivial. However, is it possible to create a map of methods in Go?
When we have a method:
func (f *Foo) Add(a string, b int) { }
The syntax below create compile-time error:
actions := map[string]func(a, b){
"add": f.Add(a,b),
}
Is it possible to create a map of methods in Go?
Yes. Currently:
actions := map[string]func(a string, b int){
"add": func(a string, b int) { f.Add(a, b) },
}
Later: see the go11func document guelfi mentioned.
There is currently no way to store both receiver and method in a single value (unless you store it in a struct). This is currently worked on and it may change with Go 1.1 (see http://golang.org/s/go11func).
You may, however, assign a method to a function value (without a receiver) and pass the receiver to the value later:
package main
import "fmt"
type Foo struct {
n int
}
func (f *Foo) Bar(m int) int {
return f.n + m
}
func main() {
foo := &Foo{2}
f := (*Foo).Bar
fmt.Printf("%T\n", f)
fmt.Println(f(foo, 42))
}
This value can be stored in a map like anything else.
I met with a similar question.
How can this be done today, 9 years later:
the thing is that the receiver must be passed to the method map as the first argument. Which is pretty unusual.
package main
import (
"fmt"
"log"
)
type mType struct {
str string
}
func (m *mType) getStr(s string) {
fmt.Println(s)
fmt.Println(m.str)
}
var (
testmap = make(map[string]func(m *mType, s string))
)
func main() {
test := &mType{
str: "Internal string",
}
testmap["GetSTR"] = (*mType).getStr
method, ok := testmap["GetSTR"]
if !ok {
log.Fatal("something goes wrong")
}
method(test, "External string")
}
https://go.dev/play/p/yy3aR_kMzHP
You can do this using Method Expressions:
https://golang.org/ref/spec#Method_expressions
However, this makes the function take the receiver as a first argument:
actions := map[string]func(Foo, string, int){
"add": Foo.Add
}
Similarly, you can get a function with the signature func(*Foo, string, int) using (*Foo).Add
If you want to use pointer to type Foo as receiver, like in:
func (f *Foo) Add(a string, b int) { }
then you can map string to function of (*Foo, string, int), like this:
var operations = map[string]func(*Foo, string, int){
"add": (*Foo).Add,
"delete": (*Foo).Delete,
}
Then you would use it as:
var foo Foo = ...
var op string = GetOp() // "add", "delete", ...
operations[op](&foo, a, b)
where GetOp() returns an operation as string, for example from a user input.
a and b are your string and int arguments to methods.
This assumes that all methods have the same signatures. They can also have return value(s), again of the same type(s).
It is also possible to do this with Foo as receiver instead of *Foo. In that case we don't have to de-reference it in the map, and we pass foo instead of &foo.

Resources