How do I add specific values of columns to create new columns? - format

I have a dataset which I want to format in order to perform repeated measures anova. My dataset is of the form:
set.seed(32)
library(tibble)
id<- rep(1:2,each=3)
y_0 <- rep(rnorm(2,mean=50,sd=10),each=3)
time <- rep(c(1,2,3),times=2)
c<-rep(rnorm(2,mean=10,sd=12),each=3)
data <- tibble(id,y,t,c)
I want to bring the dataset in the form of a dataset for repeated measures anova meaning I want to have only one value for id in each column and create 3 more columns. One for y+c in time 1 named y_1,y+c in time 2 named y_2 and y+c in time 3 named y_3. Can anyone provide some assistance?

Related

Fastest way to select a value from a single row table

I have a disconnected table called TableX with two columns and a single row of data:
Date | Amount
4-Jan | 10.5
There will always only be 1 row of data.
The model it is part of is huge and the amount 10.5 will be used extensively in measures.
To extract 10.5 from this table I've used the following:
AmountValue := MIN(TableX[Amount])
Is MIN the fastest way to extract this value?
Here test from DaxStudio (5 Cold + 5 Hot cache execution) for disconnected table containin only one row as in your example :
DEFINE
MEASURE 'TableX'[AmountValue3] = ALL(TableX[Amount])
evaluate
{[AmountValue2]}
DEFINE
MEASURE 'TableX'[AmountValue2] = Values(TableX[Amount])
evaluate
{[AmountValue2]}
DEFINE
MEASURE 'TableX'[AmountValue] = min(TableX[Amount])
evaluate
{[AmountValue]}

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

"Hive" max column value from multiple columns

I have a situation where I need to find the max value on 3 calculated fields and store it in another field, is it possible to do it in one SQL query? Below is the example
SELECT
Income1, Income1 * 2% as Personal_Income, Income2, Income2 * 10% as Share_Income, Income3, Income3 * 1% as Job_Income, Max(Personal_Income, Share_Income, Job_Income) From Table
one way I tried is to calculate Personal_Income, Share_Income, Job_Income in the first pass and in the second pass I used
Select Case when Personal_income > Share_Income and Personal_Income > Job_Income then Personal_income when Share_income > Job_Income then Share_income Else Job_income as the greatest_income
but this require me to do 2 scans on a billion rows table, How can I avoid this and do it in a single pass? Any help much appreciated.
You can use greatest which results in the greatest value among multiple columns on a given row.
select greatest(Income1*1.02, Income2*1.1, Income3*1.01) as greatest_Income
From Table
Note this is not an aggregate function and other columns can be included in select as needed.

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

R - Sorting and Sub-setting Maximum Values within Columns

I am trying to iteratively sort data within columns to extract N maximum values.
My data is set up with the first and second columns containing occupation titles and codes, and all of the rest of the columns containing comparative values (in this case location quotients that had to be previously calculated for each city) for those occupations for various cities:
*occ_code city1 ... city300*
occ1 5 ... 7
occ2 20 ... 22
. . . .
. . . .
occ800 20 ... 25
For each city I want to sort by the maximum values, select a subset of those maximum values matched by their respective occupations titles and titles. I thought it would be relatively trivial but...
edit for clarification: I want end to with a sorted subset of the data for analysis.
occ_code city1
occ200 10
occ90 8
occ20 2
occ95 1.5
At the same time I want to be able to repeat the sort column-wise (so I've tried lots of order commands through calling columns directly: data[,2]; just to be able to run the same analysis functions over the entire dataset.
I've been messing with plyr for the past 3 days and I feel like the setup of my dataset is just not conducive to how plyer was meant to be used.
I'm not exactly sure what your desired output is according to your example snippit. Here's how you could get a data frame like that for every city using plyr and reshape
#using the same df from nico's answer
library(reshape)
df.m <- melt(df, id = 1)
a.cities <- cast(df.m, codes ~ . | variable)
library(plyr)
a.cities.max <- aaply(a.cities, 1, function(x) arrange(x, desc(`(all)`))[1:4,])
Now, a.cities.max is an array of data frames, with the 4 largest values for each city in each data frame. To get one of these data frames, you can index it with
a.cities.max$X13
I don't know exactly what you'll be doing with this data, but you might want it back in data frame format.
df.cities.max <- adply(a.cities.max, 1)
One way would be to use order with ddply from the package plyr
> library(plyr)
> d<-data.frame(occu=rep(letters[1:5],2),city=rep(c('A','B'),each=5),val=1:10)
> ddply(d,.(city),function(x) x[order(x$val,decreasing=TRUE)[1:3],])
order can sort on multiple columns if you want that.
This will output the max for each city. Similar results can be obtained using sort or order
# Generate some fake data
codes <- paste("Code", 1:100, sep="")
values <- matrix(0, ncol=20, nrow=100)
for (i in 1:20)
values[,i] <- sample(0:100, 100, replace=T)
df <- data.frame(codes, values)
names(df) <- c("Code", paste("City", 1:20, sep=""))
# Now for each city we get the maximum
maxval <- apply(df[2:21], 2, which.max)
# Output the max for each city
print(cbind(paste("City", 1:20), codes[maxval]))

Resources