Pyspark - Lambda Expressions operating on specific columns - random

I have a pyspark dataframe that looks like:
+---------------+---+---+---+---+---+---+
| Entity| id| 7| 15| 19| 21| 27|
+---------------+---+---+---+---+---+---+
| a| 0| 0| 1| 0| 0| 0|
| b| 1| 0| 0| 0| 1| 0|
| c| 2| 0| 0| 0| 1| 0|
| d| 3| 2| 0| 0| 0| 0|
| e| 4| 0| 3| 0| 0| 0|
| f| 5| 0| 25| 0| 0| 0|
| g| 6| 2| 0| 0| 0| 0|
I want to add a random value between 0 and 1 to all elements in every column sans Entity & id. There could be any number of columns after Entity & id (in this case there's 5, but there could be 100, or a 1000 or more).
Here's what I have so far:
random_df = data.select("*").rdd.map(
lambda x, r=random: [Row(str(row)) if isinstance(row, unicode) else
Row(float(r.random() + row)) for row in x]).toDF(data.columns)
However, this will also add a random value to the id column. Normally, if I knew the number of elements before, and I knew they would be fixed I could explicitly call them out in the lambda expression with
data.select("*").rdd.map(lambda (a,b,c,d,e,f,g):
Row(a,b, r.random() + c r.random() + d, r.random() + e, r.random()
+ f, r.random() + g))
But, unfortunately, this won't work due to not knowing how many columns I"ll have ahead of time. Thoughts? I really appreciate the help!
EDIT: I should also note that 'id' is a result of calling:
data = data.withColumn("id", monotonically_increasing_id())
Adding this edit as I tried to convert the column 'id' into a StringType so that my 'isinstance(row, unicode)' would trigger, but I wasn't successful. The following code:
data = data.withColumn("id", data['id'].cast(StringType)
results in:
raise TypeError("unexpected type: %s" % type(dataType))
TypeError: unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'>

You should try .cast("string") on id column.
import random
import pyspark.sql.functions as f
from pyspark.sql.types import Row
df = sc.parallelize([
['a', 0, 1, 0, 0, 0],
['b', 0, 0, 0, 1, 0],
['c', 0, 0, 0, 1, 0],
['d', 2, 0, 0, 0, 0],
['e', 0, 3, 0, 0, 0],
['f', 0, 25,0, 0, 0],
['g', 2, 0, 0, 0, 0],
]).toDF(('entity', '7', '15', '19', '21', '27'))
df = df.withColumn("id", f.monotonically_increasing_id())
df = df.withColumn("id_string", df["id"].cast("string")).drop("id")
df.show()
random_df = df.select("*").rdd.map(
lambda x, r=random: [Row(str(row)) if isinstance(row, unicode) else
Row(float(r.random() + row)) for row in x]).toDF(df.columns)
random_df.show()

Related

How can you efficiently flip a large range of indices's values from 1 to 0 or vice versa

You're given an N sized array arr. Suppose there's a contiguous interval arr[a....b] where you want to flip all the 1s to 0s and vice versa. Now suppose that there are a large (millions or billions) of these intervals (they could have different starting and end points) that you need to process. Is there an efficient algorithm to get this done?
Note that a and b are inclusive. N can be any finite size essentially. The purpose of the question was just to practice algorithms.
Consider arr = [0,0,0,0,0,0,0]
Consider that we want to flips the following inclusive intervals [1,3], [0,4]
After process [1,3], we have arr = [0,1,1,1,0,0,0] and after processing [0,4], we have arr = [1,0,0,0,1,0,0], which is the final array.
The obvious efficient way to do that is to not do that. Instead first collect at what indices the flipping changes, and then do one pass to apply the collected flipping information.
Python implementation of a naive solution, the efficient solution, and testing:
def naive(arr, intervals):
for a, b in intervals:
for i in range(a, b+1):
arr[i] ^= 1
def efficient(arr, intervals):
flips = [0] * len(arr)
for a, b in intervals:
flips[a] ^= 1
flips[b+1] ^= 1
xor = 0
for i, flip in enumerate(flips):
xor ^= flip
arr[i] ^= xor
def test():
import random
n = 30
arr = random.choices([0, 1], k=n)
intervals = []
while len(intervals) < 100:
a = random.randrange(n-1)
b = random.randrange(n-1)
if a <= b:
intervals.append((a, b))
print(f'{arr = }')
expect = arr * 1
naive(expect, intervals)
print(f'{expect = }')
result = arr * 1
efficient(result, intervals)
print(f'{result = }')
print(f'{(result == expect) = }')
test()
Demo output:
arr = [1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0]
expect = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
result = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
(result == expect) = True
Cast to Int Array and use bitwise not if you are using C or C++. But this is an SIMD task so its parallelizable if you wish.

Constructing Binary Tree from Inorder and Postorder Traversal

I am trying to construct a binary tree from postorder and inorder traversal. I believe the recursion part is correct however I'm not sure about the base cases. Any pointers would be appreciated.
I have tried different combination of base cases but I can't seem to get it working.
class BinaryTreeNode:
def __init__(self, data=None, left=None, right=None):
self.data = data
self.left = left
self.right = right
def binary_tree_from_postorder_inorder(postorder, inorder):
node_to_inorder_idx = {data: i for i, data in enumerate(inorder)}
def helper(
postorder_start, postorder_end, inorder_start, inorder_end
):
if postorder_end >= postorder_start or inorder_end <= inorder_start:
return None
root_inorder_idx = node_to_inorder_idx[postorder[postorder_start]]
left_subtree_size = root_inorder_idx - inorder_start
root = BinaryTreeNode(postorder[postorder_start])
root.right = helper(
postorder_start - 1,
postorder_start - 1 - left_subtree_size,
root_inorder_idx + 1,
inorder_end,
)
root.left = helper(
postorder_start - 1 - left_subtree_size,
postorder_end,
inorder_start,
root_inorder_idx,
)
return root
return helper(len(postorder) - 1, -1, 0, len(inorder))
def inorder(tree):
stack = []
results = []
while stack or tree:
if tree:
stack.append(tree)
tree = tree.left
else:
tree = stack.pop()
results.append(tree.data)
tree = tree.right
return results
inorder = ["F", "B", "A", "E", "H", "C", "D", "I", "G"]
postorder = ["F", "A", "E", "B", "I", "G", "D", "C", "H"]
root_pos_in = binary_tree_from_postorder_inorder(postorder, inorder)
print(inorder(root_pos_in))
Inputs:
inorder = ["F", "B", "A", "E", "H", "C", "D", "I", "G"]
postorder = ["F", "A", "E", "B", "I", "G", "D", "C", "H"]
Actual output using inorder traversal:
["A", "B", "E", "H", "C"]
Expected output:
["F", "B", "A", "E", "H", "C", "D", "I", "G"]
It's been a while since I dealt with Python, but that looks like a lot of code for what seems a simple algorithm.
Here is a an example of the application of the algorithm:
We start with
postorder | inorder
-----------|----------
|
FAEBIGDCH | FBAEHCDIG
^ |
| |
`-+-------------- last value of postorder: 'H': this is the root value
|
FAEBIGDCH | FBAEHCDIG
| ^
| |
| `------- index of 'H' in inorder: 4
|
FAEB_.... | FBAE_....
^ | ^
| | |
| | `--------- everything before index 4
| |
`-------+-------------- everything before index 4
|
....IGDC_ | ...._CDIG
^ | ^
| | |
| | `---- everything beginning with index 5 (4 + 1)
| |
`---+-------------- everything between index 4 and the 'H' at the end
|
FAEB | FBAE
^ | ^
| | |
`-------+---+---------- recur on these if not empty: this is the left child
|
IGDC | CDIG
^ | ^
| | |
`--+--------+----- recur on these if not empty: this is the right child
This will quickly lead us to a tree like
H
|
+--------+--------+
| |
B C
| |
+-----+-----+ +-----+
| | |
F E D
| |
+---+ +---+
| |
A G
+-+
|
I
So while I can't really critique your Python, I can offer a pretty simple JS version:
const makeTree = (
postord, inord,
len = postord.length, val = postord[len - 1], idx = inord.indexOf(val)
) =>
len == 1
? {val}
: {
val,
...(idx > 0 ? {left: makeTree(postord.slice(0, idx), inord.slice(0, idx))} : {}),
...(idx < len - 1 ? {right: makeTree(postord.slice(idx, len - 1), inord.slice(idx + 1, len))} : {})
}
const postOrder = ["F", "A", "E", "B", "I", "G", "D", "C", "H"]
const inOrder = ["F", "B", "A", "E", "H", "C", "D", "I", "G"]
console .log (
makeTree (postOrder, inOrder)
)
After fiddling for a little longer, I was able to fix the problem. See my updated function below:
def binary_tree_from_postorder_inorder(postorder, inorder):
if not inorder or not postorder or len(postorder) != len(inorder):
return None
node_to_inorder_idx = {data: i for i, data in enumerate(inorder)}
def helper(postorder_start, postorder_end, inorder_start, inorder_end):
if postorder_start > postorder_end or inorder_start > inorder_end:
return None
root_index = node_to_inorder_idx[postorder[postorder_end]]
left_subtree_size = root_index - inorder_start
return BinaryTreeNode(
postorder[postorder_end],
helper(
postorder_start,
postorder_start + left_subtree_size - 1,
inorder_start,
root_index - 1,
),
helper(
postorder_start + left_subtree_size,
postorder_end - 1,
root_index + 1,
inorder_end,
),
)
return helper(0, len(postorder) - 1, 0, len(inorder) - 1)

Pandas Series correlation against a single vector

I have a DataFrame with a list of arrays as one column.
import pandas as pd
v = [1, 2, 3, 4, 5, 6, 7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame({'A': [v1, v2, v3]})
print df
Output:
A
0 [1, 0, 0, 0, 0, 0, 0]
1 [0, 1, 0, 0, 1, 0, 0]
2 [1, 1, 0, 0, 0, 0, 1]
I want to do a pd.Series.corr for each row of df.A against the single vector v.
I'm currently doing a loop on df.A and achieving it. It is very slow.
Expected Output:
A B
0 [1, 0, 0, 0, 0, 0, 0] -0.612372
1 [0, 1, 0, 0, 1, 0, 0] -0.158114
2 [1, 1, 0, 0, 0, 0, 1] -0.288675
Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -
a = np.array(df.A.tolist()) # or np.vstack(df.A.values)
df['B'] = corr2_coeff_rowwise(a, np.asarray(v)[None])
Runtime test -
Case #1 : 1000 rows
In [59]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(1000)]})
In [60]: v = np.random.randint(0,9,(7)).tolist()
# #jezrael's soln
In [61]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
10 loops, best of 3: 142 ms per loop
In [62]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
1000 loops, best of 3: 461 µs per loop
Case #2 : 10000 rows
In [63]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(10000)]})
In [64]: v = np.random.randint(0,9,(7)).tolist()
# #jezrael's soln
In [65]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
1 loop, best of 3: 1.38 s per loop
In [66]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
100 loops, best of 3: 3.05 ms per loop
Use corrwith, but if performance is important, Divakar's anwer should be faster:
df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
print (df)
A new
0 [1, 0, 0, 0, 0, 0, 0] -0.612372
1 [0, 1, 0, 0, 1, 0, 0] -0.158114
2 [1, 1, 0, 0, 0, 0, 1] -0.288675

What are the values in the Array cells_per_number refering to

This was a solution to a problem on GitHub.I was looking over the solution and I was wondering what the numbers in the array are referring to
LED Clock: You are (voluntarily) in a room that is completely dark except for
the light coming from an old LED digital alarm clock. This is one of those
clocks with 4 seven segment displays using an HH:MM time format. The clock is
configured to display time in a 24 hour format and the leading digit will be
blank if not used. What is the period of time between when the room is at its
darkest to when it is at its lightest?
def compute_brightness(units)
cells_per_number = [ 6, 2, 5, 5, 4, 5, 6, 3, 7, 6 ]
units.each_with_object({}) do |t, hash|
digits = t.split('')
hash[t] = digits.map { |d| cells_per_number[d.to_i] }.reduce(:+)
end
end
The numbers refer to the number of segments that are "on" when displaying the corresponding digit. When displaying the number "0," six segments are "on" (all except the center segment), so the number at index 0 is 6. When displaying 1, only two segments are "on," so the number at index 1 is 2. You get the idea.
_ _ _
0) | | = 6 1) | = 2 2) _| = 5 3) _| = 5
|_| | |_ _|
_ _ _
4) |_| = 4 5) |_ = 5 6) |_ = 6 7) | = 3
| _| |_| |
_ _
8) |_| = 7 9) |_| = 6
|_| _|
#Jordan has answered your specific question, but the code isn't a complete solution to the stated problem. Here's a way of doing that.
lpd = { 0=>6, 1=>2, 2=>5, 3=>5, 4=>4, 5=>5, 6=>6, 7=>3, 8=>7, 9=>6 }
def min_leds(lpd, range)
leds(lpd, range).min_by(&:last).first
end
def max_leds(lpd, range)
leds(lpd, range).max_by(&:last)
end
def leds(lpd, range)
lpd.select { |k,_| range.cover?(k) }
end
darkest =
[ *[
[max_leds(lpd, (1..1)), max_leds(lpd, (0..2))],
[[0,0], max_leds(lpd, (1..9))]
].max_by { |(_,a), (_,b)| a+b },
max_leds(lpd, (0..5)),
max_leds(lpd, (0..9))
].transpose.first.join.insert(2,':')
#=> "10:08"
lightest = [0, min_leds(lpd, (1..9)),
min_leds(lpd, (0..5)),
min_leds(lpd, (0..9))
].join.insert(2,':')
#=> "0111"
To make the solution more realistic, an array (possibly empty) of the locations of the burnt-out leds should be passed to the method.

Projecting an N-dimensional array to 1-d

I have an n-dimensional array I'd like to display in a table. Something like this:
#data = [[1,2,3],[4,5,6],[7,8,9]]
#dimensions = [{:name => "speed", :values => [0..20,20..40,40..60]},
{:name => "distance", :values => [0..50, 50..100, 100..150]}]
And I'd like the table to end up looking like this:
speed | distance | count
0..20 | 0..50 | 1
0..20 | 50..100 | 2
0..20 | 100..150 | 3
20..40 | 0..50 | 4
20..40 | 50..100 | 5
20..40 | 100..150 | 6
40..60 | 0..50 | 7
40..60 | 50..100 | 8
40..60 | 100..150 | 9
Is there a pretty way to pull this off? I have a working solution that I'm actually kind of proud of; this post is a bit humble-brag. However, it does feel overly complicated, and there's no way I or anyone else is going to understand what's going on later.
[nil].product(*#dimensions.map do |d|
(0...d[:values].size).to_a
end).map(&:compact).map(&:flatten).each do |data_idxs|
row = data_idxs.each_with_index.map{|data_idx, dim_idx|
#dimensions[dim_idx][:values][data_idx]
}
row << data_idxs.inject(#data){|data, idx| data[idx]}
puts row.join(" |\t ")
end
What about this?
first, *rest = #dimensions.map {|d| d[:values]}
puts first
.product(*rest)
.transpose
.push(#data.flatten)
.transpose
.map {|row| row.map {|cell| cell.to_s.ljust 10}.join '|' }
.join("\n")
Bent, let me first offer a few comments on your solution. (Then I will offer an alternative approach that also uses Array#product.) Here is your code, formatted to expose the structure:
[nil].product(*#dimensions.map { |d| (0...d[:values].size).to_a })
.map(&:compact)
.map(&:flatten)
.each do |data_idxs|
row = data_idxs.each_with_index.map
{ |data_idx, dim_idx| #dimensions[dim_idx][:values][data_idx] }
row << data_idxs.inject(#data) { |data, idx| data[idx] }
puts row.join(" |\t ")
end
I find it very confusing, in part because of your reluctance to define intermediate variables. I would first compute product's argument and assign it to a variable x. I say x because it's hard to come up with a good name for it. I would then assign the results of product to another variable, like so: y = x.shift.product(x) or (if you don't want x modified) y = x.first.product(x[1..-1). This avoids the need for compact and flatten.
I find the choice of variable names confusing. The root of the problem is that #dimensions and #data both begin with d! This problem would be diminished greatly if you simply used, say, #vals instead of #data.
It would be more idiomatic to write data_idxs.each_with_index.map as data_idxs.map.with_index.
Lastly, but most important, is your decision to use indices rather than the values themselves. Don't do that. Just don't do that. Not only is this unnecessary, but it makes your code so complex that figuring it out is time-consuming and headache-producing.
Consider how easy it is to manipulate the data without any reference to indices:
vals = #dimensions.map {|h| h.values }
# [["speed", [0..20, 20..40, 40..60 ],
# ["distance", [0..50, 50..100, 100..150]]
attributes = vals.map(&:shift)
# ["speed", "distance"]
# vals => [[[0..20, 20..40, 40..60]],[[0..50, 50..100, 100..150]]]
vals = vals.flatten(1).map {|a| a.map(&:to_s)}
# [["0..20", "20..40", "40..60"],["0..50", "50..100", "100..150"]]
rows = vals.first.product(*vals[1..-1]).zip(#data.flatten).map { |a,d| a << d }
# [["0..20", "0..50", 1],["0..20", "50..100", 2],["0..20", "100..150", 3],
# ["20..40", "0..50", 4],["20..40", "50..100", 5],["20..40", "100..150", 6],
# ["40..60", "0..50", 7],["40..60", "50..100", 8],["40..60", "100..150", 9]]
I would address the problem in such a way that you could have any number of attributes (i.e., "speed", "distance",...) and the formatting would dictated by the data:
V_DIVIDER = ' | '
COUNT = 'count'
attributes = #dimensions.map {|h| h[:name]}
sd = #dimensions.map { |h| h[:values].map(&:to_s) }
fmt = sd.zip(attributes)
.map(&:flatten)
.map {|a| a.map(&:size)}
.map {|a| "%-#{a.max}s" }
attributes.zip(fmt).each { |a,f| print f % a + V_DIVIDER }
puts COUNT
prod = (sd.shift).product(*sd)
flat_data = #data.flatten
until flat_data.empty? do
prod.shift.zip(fmt).each { |d,f| print f % d + V_DIVIDER }
puts (flat_data.shift)
end
If
#dimensions = [{:name => "speed", :values => [0..20,20..40,40..60] },
{:name => "volume", :values => [0..30, 30..100, 100..1000]},
{:name => "distance", :values => [0..50, 50..100, 100..150] }]
this is displayed:
speed | volume | distance | count
0..20 | 0..30 | 0..50 | 1
0..20 | 0..30 | 50..100 | 2
0..20 | 0..30 | 100..150 | 3
0..20 | 30..100 | 0..50 | 4
0..20 | 30..100 | 50..100 | 5
0..20 | 30..100 | 100..150 | 6
0..20 | 100..1000 | 0..50 | 7
0..20 | 100..1000 | 50..100 | 8
0..20 | 100..1000 | 100..150 | 9
It works as follows (with the original value of #dimensions, having just the two attributes, "speed" and "distance"):
Attributes is a list of the attributes. Being an array, it maintains their order:
attributes = #dimensions.map {|h| h[:name]}
# => ["speed", "distance"]
We pull out the ranges from #dimensions and convert them to strings:
sd = #dimensions.map { |h| h[:values].map(&:to_s) }
# => [["0..20", "20..40", "40..60"], ["0..50", "50..100", "100..150"]]
Next we compute the string formating for all columns but the last:
fmt = sd.zip(attributes)
.map(&:flatten)
.map {|a| a.map(&:size)}
.map {|a| "%-#{a.max}s" }
# => ["%-6s", "%-8s"]
Here
sd.zip(attributes)
# => [[["0..20", "20..40", "40..60"], "speed" ],
# [["0..50", "50..100", "100..150"], "distance"]]
8 in "%-8s" equals the maximum of the length of the column label, distance (8) and the length of the longest string representation of a distance range (also 8, for "100..150"). The - in the formatting string left-adjusts the strings.
We can now print the header:
attributes.zip(fmt).each { |a,f| print f % a + V_DIVIDER }
puts COUNT
speed | distance | count
To print the remaining lines, we construct an array containing the contents of the first two columns. Each element of the array corresponds to a row of the table:
prod = (sd.shift).product(*sd)
# => ["0..20", "20..40", "40..60"].product(*[["0..50", "50..100", "100..150"]])
# => ["0..20", "20..40", "40..60"].product(["0..50", "50..100", "100..150"])
# => [["0..20", "0..50"], ["0..20", "50..100"], ["0..20", "100..150"],
# ["20..40", "0..50"], ["20..40", "50..100"], ["20..40", "100..150"],
# ["40..60", "0..50"], ["40..60", "50..100"], ["40..60", "100..150"]]
We need to flaten #data:
flat_data = #data.flatten
# => [1, 2, 3, 4, 5, 6, 7, 8, 9]
The first time through the until do loop,
r1 = prod.shift
# => ["0..20", "0..50"]
# prod now => [["0..20", "50..100"],...,["40..60", "100..150"]]
r2 = r1.zip(fmt)
# => [["0..20", "%-6s"], ["0..50", "%-8s"]]
r2.each { |d,f| print f % d + V_DIVIDER }
0..20 | 0..50 |
puts (flat_data.shift)
0..20 | 0..50 | 1
# flat_data now => [2, 3, 4, 5, 6, 7, 8, 9]

Resources