DolphinDB: pairwise correlation - correlation

I want to calculate the pairwise correlations of stock returns based on the volume-weighted average prices (vwap).
My starting point is :
priceMatrix=pivot(wavg, [t.price,t.volume],t.trade_date,t.sym)
This creates a vwap price matrix with the trade time as row label and the stock symbol as column label.
Output (the real one has 250+rows and 1576 columns):
S1 S2 S3 S4
---- ---- ---- ----
2020.10.01 38.5 29.0 9.8 7.1
2020.10.02 38 29.1 10.4 7.2
2020.10.03 37.2 29.3 10.8 7.6
....
What I need now is to do the pairwise correlations for each column like this:
S1 S2 S3 S4
--- --- --- ---
S1 1 ** ** **
S2 ** 1 ** **
S3 ** ** 1 **
S4 ** ** ** 1
Does anyone know of a way I can get a pairwise correlation matrix in DolphinDB? Thanks!

Calculate the pair-wise correlation by cross function,
for example:
priceMatrix=pivot(avg, t.open,t.trade_date,t.ts_code)
t1=each(ratios,priceMatrix)-1;
pairwisecorr=cross(corr,t1,t1);

Related

Fast calculation of probability distribution in board game Da Vinci Code

I'm interested in efficiently calculating the probability distribution over possible secret numbers given what one can observe of the opponents' hand (and your own hand) in the board game Da Vinci Code. A link to the game here: https://boardgamegeek.com/boardgame/8946/da-vinci-code
I have abstracted the problem into the following:
You are given an array A of length N and a finite set of numbers Si for each index i of the array. Now,
we are to place a number from Si at each index i to fill the entire array A;
while ensuring that the number is unique across the entire array A;
and for 3 disjoint subarrays A1, A2, A3 of A such that concat(A1, A2, A3) = A, the numbers in each subarray must follow a strictly increasing order;
given all the possible numbers to form A that satisfy the above constraints, what is the probability ditribution over each number at each index?
Here I provide an example below:
Assuming we have the following array of length 5 with each column representing Si at the index of the column
| 6 6 | 6 6 | 6 |
| 5 | 5 | |
| 4 4 | | 4 |
| | 3 3 | |
| 2 | 2 2 | |
| 1 1 | | |
| ___ | __ | _ |
| A1 | A2 | A3|
The set of all possible arrays are:
14236
14256
14356
15234
15236
15264
15364
16234
16254
16354
24356
25364
26354
45236
Therefore the probability distribution over each number [1-6] at each index is:
6 0 4/14 0 3/14 6/14
5 0 6/14 0 6/14 0
4 1/14 4/14 0 0 8/14
3 0 0 6/14 5/14 0
2 3/14 0 8/14 0 0
1 10/14 0 0 0 0
___________ __________ ______
A1 A2 A3
Brute forcing this problem is obviously doable but I have a gut feeling that there must be some more efficient algorithms for this.
The reason why I think so is due to the fact that one can derive the probability distribution from the set of all possibilities but not the other way around, so the distribution itself must contain less information than the set of all possibilities have. Therefore, I believe that we do not need to generate all possibilites just to obtain the probability distribution.
Hence, I am wondering if there is any smart matrix operation we could use for this problem or even fixed-point iteration/density evolution to approximate the end probability distribution? Some other potentially more efficient approaches to this problem are also appreciated.
Edit: By brute-force, I mean specifically enumerating all possibilities with constraint propagation like in sudoku. My hope is to obtain an accurate solution, or a approximate solution that approximates well (better than plain monte carlo), that works better than CP in terms of running time.
Edit2: The better solution I desire should have the characteristic that it does not need to generate all possibilities to obtain or approximate the probability distribution.
Did you consider Constraint Propagation?
When you assign a number to a position, that number cannot appear in any other position, so exclude that number from the remaining positions
When you assign a number in the first column of a subarray, the second column must contain a larger value, so exclude all values that are lower or equal
With a BF approach in your example the code would generate and check 4 * 4 * 3 * 4 * 2 = 384 possibilities; with the CP approach we only generate 65 possibilities.
Here is a sample Python implementation:
from dataclasses import dataclass, field
from typing import Dict, List
#dataclass
class DaVinci:
grid : List[List[int]]
top : int
lastcol : int = 0
solved : List = field(default_factory=list)
count : int = 0
distrib : List[Dict[int,int]] = field(init=False)
def __post_init__(self):
self.lastcol = len(self.grid)-1
self.distrib = [{x:0 for x in range(1,self.top+1)} for y in range(len(self.grid))]
self.solve_next(current = 0, even = True, blocked = [], minval = 0, solving = [])
self.count = len(self.solved)
def solve_next(self, current, even, blocked, minval, solving):
found = False
for n in self.grid[current]:
if n not in blocked and n > minval:
if current != self.lastcol:
self.solve_next(current + 1, not even, blocked + [n], n * even, solving + [n])
else:
for col in range(self.lastcol):
self.distrib[col][solving[col]] += 1
self.distrib[self.lastcol][n] += 1
self.solved.append(solving + [n])
def show_solved(self):
for sol in self.solved:
print(''.join(map(str,sol)))
def show_distrib(self):
for i in range(1, self.top+1):
print(i, end = ' ')
for col in range(len(self.grid)):
print(f'{self.distrib[col][i]:2d}/{self.count}', end = ' ')
print()
dv = DaVinci([[1,2,4,6],[1,4,5,6],[2,3,6],[2,3,5,6],[4,6]], 6)
dv.show_solved()
14236
14256
14356
15234
15236
15264
15364
16234
16254
16354
24356
25364
26354
45236
dv.show_distrib()
1 10/14 0/14 0/14 0/14 0/14
2 3/14 0/14 8/14 0/14 0/14
3 0/14 0/14 6/14 5/14 0/14
4 1/14 4/14 0/14 0/14 8/14
5 0/14 6/14 0/14 6/14 0/14
6 0/14 4/14 0/14 3/14 6/14
A simple idea to get an approximation for the distribution is to use a Monte Carlo approach.
Set a variable total: = 0 and a matrix M[N][Q] with all entries initially set to zero (Q is the total of numbers allowed).
Fix a positive integer K. Perform K iterations. At each iteration, for each i in [1..N], take a random element from Si and fill the array A. When the array A is all filled, verify in O(N) if it satisfies your conditions. If so, increment by one the variable total and iterate through the array, incrementing the matrix entries M[i][A[i]] by one, for i in [1..N].
In the end, iterate through all the elements of the matrix M in O(N Q) and divide its elements by total to get an approximation for the distribution.
Total time complexity is O(N (K + Q)).
You can also precalculate stuff to make the approximation more precise. For example, you can precalculate all increasing sequences in the groups A1, A2 and A3. Put them in arrays I1, I2, I3. Then, at each iteration, instead of taking random elements from each Si, you take random sequences from I1, I2 and I3 and verify if the concatenation has no repeated elements (in O(N)). If so, proceed as before. The total time complexity (apart from the expensive precalculation) remains O(N (K + Q)).
Start by converting all legal subarray selections into bitvectors.
E.g., for A2 we have [2,3], [2,5], [2,6], [3,5], [3,6]
[2,3] as a bitvector is 000110
[3,5] is 010100
Next, arrange your three subarrays by the number of bitvectors they have.
Next, put these in a hash for each subarray/member combination except the smallest subarray. Use the smallest set bit as the key.
E.g. For [2,3] in A2, we'd have {2 => 000110}
Note that the values of the map to be in an array since there will be multiple bitvectors for each index/element combo.
Finally,
For every bitvec of subarray_small:
For every non-set bit of that bitvec
Find the list that has that bit as a key in subarray_medium
For every bitvec in this list
Check if the inverse of (bitvec_small | bitvec_medium) is in the hash for subarray_large.
If it is, we have a valid arrangement; update your frequency counts.

R: How to solve Lapack routine dgesv: system is exactly singular in Mahalanobis distance

I am trying to run an Explanatory Factor Analysis on my questionnaire data.
I have data for 201 participants and 30 questions. The head of my data looks somehow like this (I am showing only the first 5 questions to give an idea of the dataset structure):
Q1 Q2 Q3 Q3 Q4 Q5
1 14 0 20 0 0 0
2 14 14 20 20 20 1
3 20 18 20 20 20 9
4 14 14 20 20 20 0
5 20 18 20 20 20 5
6 20 18 20 20 8 7
I want to find multivariate outliers ,so I am trying to calculate the Mahalanobis distance (cases with Mahalanobis Distance p values bigger than 0.001 are considered outliers).
I am using this code in R-studio (all_data_EFA is my dataset name):
distance <- as.matrix(mahalanobis(all_data_EFA, colMeans(all_data_EFA), cov = cov(all_data_EFA)))
Mah_significant <- all_data_EFA %>%
transmute(row_number = 1:nrow(all_data_EFA),
Mahalanobis_distance = distance,
Mah_p_value = pchisq(distance, df = ncol(all_data_EFA), lower.tail = F)) %>%
filter(Mah_p_value <= 0.001)
However, when I run "distance" I get the following Error:
Error in solve.default(cov, ...) :
Lapack routine dgesv: system is exactly singular: U[26,26] = 0
As far as I understood, this means that the covariance matrix of my data is singular, hence the matrix is not invertible and I cannot calculate Mahalanobis distance.
Is there an alternative way to calculate multivariate outliers or how can I solve this problem?
Many thanks.

What is the encoded variant in High-Performance Linpack Benchmark?

When I run HPL with multiple options like different problem sizes etc., the benchmark performs multiple runs on the system. In my example:
multiple NBMIN
multiple BCAST
multiple DEPTH
etc.
When I then look at the single output file of the run, I don't get how I can differentiate those outputs. In my example, how do I know which variant WR01R2C4 or WR01R2C8 or WR03R2C4 is?
The Output gives a clue with an encoded variant, but I couldn't find any info on how to decode it.
Does anybody know?
Here is a snippet of my output file...
(on another note: is there an option to highlight (i.e. make bold) text inside my codeblock on stackoverflow?)
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 9000
NB : 640
PMAP : Row-major process mapping
P : 3
Q : 3
PFACT : Crout
NBMIN : 4 8
NDIV : 2
RFACT : Right
BCAST : 1ringM 2ringM
DEPTH : 0 1
SWAP : Mix (threshold = 60)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R2C4 9000 640 3 3 9.42 5.1609e+01
HPL_pdgesv() start time Mon Nov 29 13:12:56 2021
HPL_pdgesv() end time Mon Nov 29 13:13:05 2021
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.34317645e-03 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R2C8 9000 640 3 3 9.35 5.2011e+01
HPL_pdgesv() start time Mon Nov 29 13:13:06 2021
HPL_pdgesv() end time Mon Nov 29 13:13:15 2021
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.50831382e-03 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR03R2C4 9000 640 3 3 9.32 5.2164e+01
HPL_pdgesv() start time Mon Nov 29 13:13:16 2021
HPL_pdgesv() end time Mon Nov 29 13:13:25 2021
If it isn't documented, just look into the source code. In testing/ptest/HPL_pdtest.c you'll find the following line:
HPL_fprintf( TEST->outfp,
"W%c%1d%c%c%1d%c%1d%12d %5d %5d %5d %18.2f %18.3e\n",
( GRID->order == HPL_ROW_MAJOR ? 'R' : 'C' ),
ALGO->depth, ctop, crfact, ALGO->nbdiv, cpfact, ALGO->nbmin,
N, NB, nprow, npcol, wtime[0], Gflops );
Hence, the format of the encoded variant is:
WR01R2C4
^^^^^^^^
||||||||
|||||||+--- NBMIN
||||||+---- PFACT (C = Crout, L = Left, R = Right)
|||||+----- NBDIV
||||+------ RFACT (see PFACT)
|||+------- BCAST (0 = 1ring, 1 = 1ringM, 2 = 2ring, 3 = 2ringM, 4 = long)
||+-------- DEPTH
|+--------- PMAP (R = Row-major, C = Column-major)
+---------- always W

gnuplot give wrong results from stats matrix

Suppose that I have the file data.dat with follow content:
Days 1 2 4 6 10 15 20 30
Group 01 37.80 30.67 62.88 86.06 26.24 98.49 65.42 61.28
Group 02 38.96 72.99 38.24 74.11 39.54 91.59 81.14 91.22
Group 03 82.34 75.25 82.58 28.22 39.21 81.30 41.30 42.48
Group 04 75.52 42.83 66.80 20.50 94.08 74.78 95.09 53.16
Group 05 89.32 56.78 30.05 68.07 59.18 94.18 39.77 67.56
Group 06 70.03 78.71 37.59 60.55 46.40 82.73 67.34 93.38
Group 07 67.83 88.73 48.01 62.19 49.40 67.68 25.97 58.98
Group 08 61.15 96.06 59.62 39.42 60.06 94.18 76.06 32.02
Group 09 65.61 72.39 54.07 92.79 56.58 39.14 81.81 39.16
Group 10 59.65 77.81 40.51 68.49 66.15 80.33 87.31 42.07
The final intention is create a histogram using histogram clustered.
Besides the graph, I need of some values from data.dat such as
size_x, size_y, min, max, and mean. To achieve the last task I used
set datafile separator tab
stats 'data.dat' skip 1 matrix
The summed up output was:
* MATRIX: [9 X 10]
Minimum: 0.0000 [ 0 0 ]
Maximum: 98.4900 [ 6 0 ]
Mean: 56.0549
The size_x and size_y values are correct – 9 columns and 10 rows – but the min is not.
This is due to the fact that the first column is string-type.
When I include every
set datafile separator tab
stats 'data.dat' skip 1 matrix every ::1
to skip the first column, the summed up output is:
* MATRIX: [9 X 8]
Minimum: 20.5000 [ 0 3 ]
Maximum: 98.4900 [ 5 0 ]
Mean: 63.0617
This time the min and max values are right, but the size_y (shown 8, expected 9) and index from min (expected [ 3 3 ]) is not.
What is going on? I made some mistake? I'm not noticing something?
The program tries to read a value from the first field of each row, sees "Group xx" and ends up filling in 0 for that entry. You need to tell it to skip the first column.
Amended answer
I think there is a bug here, as well as confusion between documentation and the actual implementation. The matrix rows and columns as implemented by the every selector are indexed from 0 to N-1 as they would be for C language arrays. The documentation incorrectly states or at least implies that the first row and column is matrix[1][1] rather than [0][0]. So the full command needed for your case is
gnuplot> set datafile sep tab
gnuplot> stats 'data.dat' every 1:1:1:1 matrix
warning: matrix contains missing or undefined values
* FILE:
Records: 80
Out of range: 0
Invalid: 0
Header records: 0
Blank: 10
Data Blocks: 1
* MATRIX: [9 X 8]
Mean: 63.0617
Std Dev: 20.6729
Sample StdDev: 20.8033
Skewness: -0.1327
Kurtosis: 1.9515
Avg Dev: 17.4445
Sum: 5044.9400
Sum Sq.: 352332.2181
Mean Err.: 2.3113
Std Dev Err.: 1.6343
Skewness Err.: 0.2739
Kurtosis Err.: 0.5477
Minimum: 20.5000 [ 0 3 ]
Maximum: 98.4900 [ 5 0 ]
I.e. every 1:1:1:1 tells it for both rows and columns the index increment is 1 and the submatrix starts at [1][1] rather than at the origin [0][0].
The output values are all correct, but the indices shown for the size [9 x 8] and the min/max entries are wrong. I will file a bug report for both issues.
I got sidetracked trying to characterize the bug revealed by the original answer and forgot to mention a simpler alternative. For this specific case of one row of column headers and one column of rowheaders, gnuplot provides a special syntax that works without error:
set file separator tab
stats 'data.dat' matrix rowheaders columnheaders

Using Arrays to Calculate Previous and Next Values

Is there a way I can use Clickhouse (Arrays?) to calculate sequential values that are dependent on previously calculated values.
For e.g.
On day 1, I start with 0 -- consume 5 -- Add 100 -- ending up with = 0 - 5 + 100 = 95
My day2, starts with what I ended up on day 1 which is 95 -- again consume 10 -- add 5 -- ending up with 95-10+5=90 (which will be the start for day3)
Given
ConsumeArray [5,10,25]
AddArray [100,5,10]
Calculate EndingPosition and (= StartingPosition for Next day)
-
Day1 Day2 Day3
--------------------------------------------------------------------
StartingPosition (a) = Previous Ending Position | 0 95 90 Calculate
Consumed (b) | 5 10 25
Added (c) | 100 5 10
EdingPosition (d) = a-b+c | 95 90 75 Calculate
Just finish all the add/consume operations first and then do an accumulation.
WITH [5,10,25] as ConsumeArray,
[100,5,10] as AddArray
SELECT
arrayCumSum(arrayMap((c, a) -> a - c, ConsumeArray, AddArray));

Resources