DataFrame, count unique values, Java - java-8

I have a DataFrame and I want to count the uniqe lines of two columns in this Data Frame. For example:
a x
a x
a y
b y
b y
b y
should be to:
a x 2
a y 1
b y 3
I know the solution for this operation in pandas DataFrame, but now I want to do it direct in Java (the best way is Java 8).

I am not sure what input type you have, but assuming you have a List<DataFrame> list and DataFrame implements equals/hashcode as expected, you could use a combination of two collectors:
Map<DataFrame, Long> count = list.stream().collect(groupingBy(x -> x, counting()));
which requires the following static imports:
import static java.util.stream.Collectors.counting;
import static java.util.stream.Collectors.groupingBy;

I have found the next solution by myself. Copy here, if somebody has an interest....
DataFrame df2 = df.groupBy("Column_one", "Column_two").count();
df2.show();

Related

Random value from two seeds

Have a twodimensional grid and need a reproducible, random value for every integer coordinate on this grid. This value should be as unique as possible. In a grid of, let's say 1000 x 1000 it shouldn't occur twice.
To put it more mathematical: I'd need a function f(x, y) which gives an unique number no matter what x and y are as long as they are each in the range [0, 1000]
f(x, y) has to be reproducible and not have side-effects.
Probably there is some trivial solution but everything that comes to my mind, like multiplying x and y, adding some salt, etc. does not lead anywhere because the resulting number can easily occur multiple times.
One working solution I got is to use a randomizer and simply compute ALL values in the grid, but that is too computationally heavy (to do every time a value is needed) or requires too much memory in my case (I want to avoid pre-computing all the values).
Any suggestions?
Huge thanks in advance.
I would use the zero-padded concatenation of your x and y as a seed for a built-in random generator. I'm actually using something like this in some of my current experiments.
I.e. x = 13, y = 42 would become int('0013' + '0042') = 130042 to use as random seed. Then you can use the random generator of your choice to get the kind (float, int, etc) and range of values you need:
Example in Python 3.6+:
import numpy as np
from itertools import product
X = np.zeros((1000, 1000))
for x, y in product(range(1000), range(1000)):
np.random.seed(int(f'{x:04}{y:04}'))
X[x, y] = np.random.random()
Each value in the grid is randomly generated, but independently reproducible.

flatten a BlockMatrix into a Matrix in Sympy

Sympy has BlockMatrix class, but it is not a regular Matrix,
eg you can not matrix multiply a BlockMatrix.
BlockMatrix is a convenient way to build a structured matrix, but I do not see a way to use it with unstructured matrices.
Is there a way to flatten a BlockMatrix, or another convenient way to build a regular Matrix from blocks, similar to numpy.blocks?
You can use the method as_explicit() to get a flat explicit matrix, like this:
from sympy import *
n = 3
X = Identity(n)
Y = Identity(n)
Z = Identity(n)
W = Identity(n)
R = BlockMatrix([[X,Y],[Z,W]])
print (R.as_explicit())

Sorted Two-Way Tabulation of Many Values

I have a decent-sized dataset (about 18,000 rows). I have two variables that I want to tabulate, one taking on many string values, and the second taking on just 4 values. I want to tabulate the string values by the 4 categories. I need these sorted. I have tried several commands, including tabsort, which works, but only if I restrict the number of rows it uses to the first 603 (at least with the way it is currently sorted). If the number of rows is greater than this, then I get the r(134) error that there are too many values. Is there anything to be done? My goal is to create a table with the most common words and export it to LaTeX. Would it be a lot easier to try and do this in something like R?
Here's one way, via contract and texsave from SSC:
/* Fake Data */
set more off
clear
set matsize 5000
set seed 12345
set obs 1000
gen x = string(rnormal())
expand mod(_n,10)
gen y = mod(_n,4)
/* Collapse Data to Get Frequencies for Each x-y Cell */
preserve
contract x y, freq(N)
reshape wide N, i(x) j(y)
forvalues v=0/3 {
lab var N`v' "`v'" // need this for labeling
replace N`v'=0 if missing(N`v')
}
egen T = rowtotal(N*)
gsort -T x // sort by occurrence
keep if T > 0 // set occurrence threshold
capture ssc install texsave
texsave x N0 N1 N2 N3 using "tab_x_y.tex", varlabel replace title("tab x y")
restore
/* Check Calculations */
type "tab_x_y.tex"
tab x y, rowsort

Pandas pivot table Nested Sorting

Given this data frame and pivot table:
import pandas as pd
df=pd.DataFrame({'A':['x','y','z','x','y','z'],
'B':['one','one','one','two','two','two'],
'C':[7,5,3,4,1,6]})
df
A B C
0 x one 7
1 y one 5
2 z one 3
3 x two 4
4 y two 1
5 z two 6
table = pd.pivot_table(df, index=['A', 'B'],aggfunc=np.sum)
table
A B
x one 7
two 4
y one 5
two 1
z one 3
two 6
Name: C, dtype: int64
I want to sort the pivot table such that the order of 'A' is z, x, y and the order of 'B' is based on the descendingly-sorted values from data frame column 'C'.
Like this:
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
Name: C, dtype: int64
Thanks in advance!
I don't believe there is an easy way to accomplish your objective. The following solution first sorts your table is descending order based on the values of column C. It then concatenates each slice based on your desired order.
order = ['z', 'x', 'y']
table = table.reset_index().sort_values('C', ascending=False)
>>> pd.concat([table.loc[table.A == val, :].set_index(['A', 'B']) for val in order])
C
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
If you can read in column A as categorical data, then it becomes much more straightforward. Setting your categories as list('zxy') and specifying ordered=True uses your custom ordering.
You can read in your data using something similar to:
'A':pd.Categorical(['x','y','z','x','y','z'], list('zxy'), ordered=True)
Alternatively, you can read in the data as you currently are, then use astype to convert A to categorical:
df['A'] = df['A'].astype('category', categories=list('zxy'), ordered=True)
Once A is categorical, you can pivot the same as before, and then sort with:
table = table.sort_values(ascending=False).sortlevel(0, sort_remaining=False)
Solution
custom_order = ['z', 'x', 'y']
kwargs = dict(axis=0, level=0, drop_level=False)
new_table = pd.concat(
[table.xs(idx_v, **kwargs).sort_values(ascending=False) for idx_v in custom_order]
)
Alternate one liner
pd.concat([table.xs(i, drop_level=0).sort_values(ascending=0) for i in list('zxy')]
Explanation
custom_order is your desired order.
kwargs is a convenient way to improve readability (in my opinion). Key elements to note, axis=0 and level=0 might be important for you if you want to leverage this further. However, those are also the default values and can be left out.
drop_level=False is the key argument here and is necessary to keep the idx_v we are taking a xs of such that the pd.concat puts it all together in the way we'd like.
I use a list comprehension in almost the exact same manner as Alexander within the pd.concat call.
Demonstration
print new_table
A B
z two 6
one 3
x one 7
two 4
y one 5
two 1
Name: C, dtype: int64

How to carry-over the calculated value within the RDD ? -Apache spark

SOLVED: There is no good solution to this problem
I am sure that this is just a syntax-relevant question and that answer is an easy one.
What I am trying to achieve is to:
-pass a variable to RDD
-change the variable according to RDD data
-get the adjusted variable
Lets say I have:
var b = 2
val x = sc.parallelize(0 to 3)
what I want to do is to obtain the value 2+0 + 2+0+1 + 2+0+1+2 + 2+0+1+2+3 = 18
That is, the value 18 by doing something like
b = x.map(i=> … b+i...).collect
The problem is, for each i, I need to carry over the value b, to be incremented with the next i
I want to use this logic for adding the elements to an array that is external to RDD
How would I do that without doing the collect first ?
As mentioned in the comments, it's not possible to mutate one variable with the contents of an RDD as RDDs are distributed across potentially many different nodes while mutable variables are local to each executor (JVM).
Although not particularly performant, it's possible to implement these requirements on Spark by translating the sequential algorithm in a series of transformations that can be executed in a distributed environment.
Using the same example as on the question, this algorithm in Spark could be expressed as:
val initialOffset = 2
val rdd = sc.parallelize(0 to 3)
val halfCartesian = rdd.cartesian(rdd).filter{case (x,y) => x>=y}
val partialSums = halfCartesian.reduceByKey(_ + _)
val adjustedPartials = partialSums.map{case (k,v) => v+initialOffset}
val total = adjustedPartials.reduce(_ + _)
scala> total
res33: Int = 18
Note that cartesian is a very expensive transformation as it creates (m x n) elements, or in this case n^2.
This is just to say that it's not impossible, but probably not ideal.
If the amount of data to be processed sequentially would fit in the memory of one machine (maybe after filtering/reduce), then Scala has a built-in collection operation to realize exactly what's being asked: scan[Left|Right]
val arr = Array(0,1,2,3)
val cummulativeScan = arr.scanLeft(initialOffset)(_ + _)
// we remove head b/c scan adds the given element at the start of the sequence
val result = cummulativeScan.tail.sum
result: Int = 18

Resources