Reorder columns and rows of Holoviews Heatmap based on similarity measure (e.g. cosine similarity etc.) - sorting

I was surprised that no one seems to have asked this before.
Assuming I have a pandas dataframe (random example), I can get a heatmap with Holoviews and Bokeh renderer:
rownames = 'ABCDEFGHIJKLMNO'
df = pd.DataFrame(np.random.randint(0,20,size=(20, len(rownames))), columns=list(rownames))
hv.HeatMap({'x': df.columns, 'y': df.index, 'z': df},
kdims=[('x', 'Col Categories'), ('y', 'Row Categories')],
vdims='z').opts(cmap="viridis", width=520, height=520)
The data (x and y) is categorical, therefore the initial order of rows or columns is unimportant. I wanted to sort rows/columns based on some similarity measure.
One way is to use seaborn clustermap:
heatmap_sns = sns.clustermap(df, metric="cosine", standard_scale=1, method="ward", cmap="viridis")
The output looks like this:
Columns and rows have been ordered according to similarity (in this case, cosine based on dot product; others are available such as 'correlation' etc.).
However, I want to display the clustermap in Holoviews. How do I update ordering of the original dataframe from the seaborn matrix?

A much cleaner approach to Alex's answer (i.e. that was the accepted answer earlier) is to use the data2d property of the returned object from sns.clustermap() function. This property contains the reordered data (i.e. the data after clustering). So:
df_ro = heatmap_sns.data2d
replaces all the following lines:
# get col and row names by ID
colname_list = [df.columns[col_id] for col_id in
heatmap_sns.dendrogram_col.reordered_ind]
rowname_list = [df.index[row_id] for row_id in
heatmap_sns.dendrogram_row.reordered_ind]
# update dataframe
df_ro = df.reindex(rowname_list)
df_ro = df_ro[colname_list]

It is possible to access the indices of reordered columns/rows from the seaborn clustermap using:
> print(f'rows: {heatmap_sns.dendrogram_row.reordered_ind}')
> print(f'columns: {heatmap_sns.dendrogram_col.reordered_ind}')
rows: [5, 0, 13, 2, 18, 7, 4, 16, 12, 19, 14, 15, 10, 3, 8, 6, 17, 11, 1, 9]
columns: [7, 1, 10, 5, 9, 0, 8, 13, 2, 6, 14, 3, 4, 11, 12]
To update row/column order of the original dataframe:
# get col and row names by ID
colname_list = [df.columns[col_id] for col_id in heatmap_sns.dendrogram_col.reordered_ind]
rowname_list = [df.index[row_id] for row_id in heatmap_sns.dendrogram_row.reordered_ind]
# update dataframe
df_ro = df.reindex(rowname_list)
df_ro = df_ro[colname_list]
I've done it here by first getting the names, perhaps there's even a direct way to update columns/rows by indices.
hv.HeatMap({'x': df_ro.columns, 'y': df_ro.index, 'z': df_ro},
kdims=[('x', 'Col Categories'), ('y', 'Row Categories')],
vdims='z').opts(cmap="viridis", width=520, height=520)
Since I have used random data, there's little order in the categories, but still the picture looks a little less noisy. Note that holoviews/df y axis is simply inverse compared to the seaborn clustermap-matrix, that's why the graphic looks flipped.

Related

How to calculate the optimal time interval in multiple time series forecasts?

First thing first, I am new to the world of statistics.
Problem statement:
I have three predicted time series. These time series represent three independent scores, the sum of which is desired to be minimized over timeslot while selecting it. Length of the timeslot is already provided. I have read that there is confidence based selection of predicted interval for such problems, but I have used LSTM to predict the time series which may restrict me to use that approach, perhaps I think calculating the predicted interval is related to single time series.
e.g: Consider below arrays represent the three predicted time series.
arr1 = [23, 34, 16, 5, 45, 10, 2, 34, 56, 11]
arr2 = [123, 100, 124, 245, 125, 120, 298, 124, 175, 200]
arr3 = [1, 3, 10, 7, 2, 2, 10, 7, 8, 12]
time slot length = 3
As you could see, optimal timeslot for arr1 is [5, 7], for arr2 is [0, 2], and arr3 for is [3, 5], but I need only one timeslot for all three time series.
Questions:
Which error paradigm I should employ to select the optimal time slot?
I also have given weights(positive real number in [0, 1]) which represents the importance of particular time series in deciding timeslot. How do I employ it in error paradigm?

Is there a standard way to choose tick locations for an axis in a plot?

When marking tick locations on a plot, are there any standard solutions to how to place the tick markers? I looked at Matplotlib's MaxNLocator (https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/ticker.py#L1212) but it's not immediately clear what all the different options do, or which of them are necessary for basic tick placement.
Can someone provide pseudocode for a simple tick location function?
I think the rule of thumb for placing ticks on a plot is to use multiples of 1, 2, 5, and 10. In my experience, matplotlib seems to abide by this. If you have reason to deviate from the default ticks, I think the easiest way to set them is to use the set_ticks() method for a particular axis. The relevant documentation is here: http://matplotlib.org/api/axis_api.html.
Example
import numpy as np
import matplotlib.pyplot as plt
ax = plt.subplot() # create axes to plot into
foo = np.array([0, 4, 12, 13, 18, 22]) # awkwardly spaced data
bar = np.random.rand(6) # random bar heights
plt.bar(foo, bar) # bar chart
ax.xaxis.get_ticklocs() # check tick locations -- currently array([ 0., 5., 10., 15., 20., 25.])
ax.xaxis.set_ticks(foo) # set the ticks to be right at each bar
ax.xaxis.get_ticklocs() # array([ 0, 4, 12, 13, 18, 22])
plt.draw()
ax.xaxis.set_ticks([0, 10, 20]) # minimal set of ticks
ax.xaxis.get_ticklocs() # array([ 0, 10, 20])
plt.draw()
Of the three options in my example, I would keep the default behaviour in this case; but there are definitely times when I would override the defaults. For example, another rule of thumb is that we should minimize the amount of ink in our plots that is not data (i.e. markers and lines). So if the default tick set was [0, 1, 2, 3, 4, 5, 6], I might change that to [0, 2, 4, 6], since that's less ink for the plot ticks without losing clarity.
Edit: The ticks at [0, 10, 20] can also be accomplished with locators, as suggested in the comment. Examples:
ax.xaxis.set_major_locator(plt.FixedLocator([0,10,20]))
ax.xaxis.set_major_locator(plt.MultipleLocator(base=10))
ax.xaxis.set_major_locator(plt.MaxNLocator(nbins=3))

Order of Elements in comprehension

I have a python method that returns the squared values in a given set.
print(str({x*x for x in {1,2,3,4,5}}))
The values are correct but not in the same are as submitted, why is this?
{16, 1, 4, 25, 9}
You've created a set, rather than a list, and the order in which a set's elements are presented is arbitrary.
Compare:
>>> print([x*x for x in [1,2,3,4,5]])
[1, 4, 9, 16, 25]

Creating auto-correlated random values

We are trying to create auto-correlated random values which will be used as timeseries.
We have no existing data we refer to and just want to create the vector from scratch.
On the one hand we need of course a random process with distribution and its SD.
On the other hand the autocorrelation influencing the random process has to be described. The values of the vector are autocorrelated with decreasing strengh over several timelags.
e.g. lag1 has 0.5, lag2 0.3, lag1 0.1 etc.
So in the end the vector should look something that:
2, 4, 7, 11, 10 , 8 , 5, 4, 2, -1, 2, 5, 9, 12, 13, 10, 8, 4, 3, 1, -2, -5
and so on.

Conditional Data Manipulation in Mathematica

I am trying to prepare the best tools for efficient Data Analysis in Mathematica.
I have a approximately 300 Columns & 100 000 Rows.
What would be the best tricks to :
"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.
One of the trickiest examples I could think of is :
Given a data structure,
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
I also welcome any general advice on data manipulation.
For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:
Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];
ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules = Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
indexRules = Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
With[{selF = Apply[Function, Hold[condition] /. colRules]},
Select[{rows}, selF ## # &][[All, cnames /. indexRules]]]];
What happens here is that the function used in Select gets generated automatically from your specifications. For example (using #Yoda's example):
rows = Array[#1 #2 &, {5, 15}];
We need to define the column names (must be strings or symbols without values):
In[425]:=
colnames = "c" <> ToString[#] & /# Range[15]
Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12",
"c13", "c14", "c15"}
(in practice, usually names are more descriptive, of course). Here is the table then:
table = Prepend[rows, colnames];
Here is the select statement you need (I picked x = 4 and y=2):
select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
where["c2" == 4 && "c8" != 2]]
{{2, 4, 6, 12, 14, 16, 18, 30}}
Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as
In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]
Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}
and similar.
Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).
EDIT
On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.
data = RandomInteger[{1, 20}, {40, 20}]
x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]
==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}
Some useful commands to get pieces of matrices and list are Span (;;), Drop, Take, Select, Cases and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,
Part ([[...]]) in combination with ;; can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).
Select can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part ([[..]]) function.
To pull out columns (or rows) you can do it by part indexing
data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten#{Range#3, Range ## {6, 9}, -1}]]
MatrixForm#%
The last line is just to view it pretty.
As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the Span (;;) command. If you are joining multiple disjoint ranges, using Flatten to combine the separate ranges created with Range is easier than entering them by hand.
I read:
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
as meaning that we want:
elements 1-3 and 6-9 from each row
AND
the last element from rows wherein [[2]] == x && [[8]] != y.
This is what I hacked together:
a = RandomInteger[5, {20, 10}]; (*define the array*)
x = 4; y = 0; (*define the test values*)
Join ## Range ### {1 ;; 3, 6 ;; 9}; (*define the column ranges*)
#2 == x && #8 != y & ### a; (*test the rows*)
Append[%%, #] & /# % /. {True -> -1, False :> Sequence[]}; (*complete the ranges according to the test*)
MapThread[Part, {a, %}] // TableForm (*extract and display*)

Resources