Graphviz straight lines - graphviz

I am trying to get straight line edges that exit a node on the right and enter on the left.
I have tried to use splines='line but it does not appear to create straight lines. Code below, as executed in jupyter notebook.
from graphviz import Digraph
g = Digraph('G', filename='cluster.gv')
with g.subgraph(name='cluster_0') as c:
c.attr(style='filled', color='lightgrey')
c.node_attr.update(style='filled', color='white')
c.edges([('a0', 'a1'), ('a1', 'a2'), ('a2', 'a3')])
c.attr(label='process #1')
with g.subgraph(name='cluster_1') as c:
c.attr(color='blue')
c.node_attr['style'] = 'filled'
c.edges([('b0', 'b1'), ('b1', 'b2'), ('b2', 'b3')])
c.attr(label='process #2')
g.edge('a1', 'b3',splines='line',tailport="e", headport="w", constraint='false')
g.edge('a2', 'b0',splines='line',tailport="e", headport="w", constraint='false')
g.view()
This is the graph that is produced by the code:
Graph

I have solved this. The placement of splines='line' was incorrect. Correct code below in case any future users have the same problem.
g = Digraph('G', filename='cluster.gv')
with g.subgraph(name='cluster_0') as c:
c.attr(style='filled', color='lightgrey')
c.node_attr.update(style='filled', color='white')
c.edges([('a0', 'a1'), ('a1', 'a2'), ('a2', 'a3')])
c.attr(label='process #1')
with g.subgraph(name='cluster_1') as c:
c.attr(color='blue')
c.node_attr['style'] = 'filled'
c.edges([('b0', 'b1'), ('b1', 'b2'), ('b2', 'b3')])
c.attr(label='process #2')
g.attr(splines='false')
g.edge('a1', 'b3',tailport="e", headport="w", constraint='false')
g.edge('a2', 'b0',tailport="e", headport="w", constraint='false')
g.view()

Related

getting 6 equally spaced points from a sequence of linestring in python

I have data whose geometry is a sequence of line strings. From this geometry, we intend to obtain 6 equally spaced points. I have tried using the np.linspace and the np.arrange functions but I can't get what I am looking for. Kindly help me with the algorithm for working out this. Below is my geometry:
0 0 [(34.220545045730496, 4.442393531636864), (34.2224155889151, 4.441156322612342), (34.223315935853314, 4.440427900224906), (34.224077028342684, 4.4399257458670185), (34.224766879680814, 4.439278243976972), (34.22556405912324, 4.438776319912429), (34.226398022458866, 4.438165968807749), (34.22734024644533, 4.437556318146236), (34.22943852729713, 4.4365111090884435), (34.23209935027103, 4.435703837269645), (34.234239592861904, 4.434838866278886), (34.23690044724009, 4.434031568657816), (34.238982889400376, 4.4332242366641825), (34.24054473957206, 4.4327052342468), (34.244651946024526, 4.431263497605144), (34.246271748538476, 4.430629103905764), (34.24783374809697, 4.429764000807084), (34.24904866823832, 4.429014226798339), (34.25043718779403, 4.428091408213787), (34.2521728964003,
(34.352324364376926, 4.208738505912754), (34.35256435592366, 4.208668509240028)]
Name: geometry, type: object
I tried this code but still can't get the 25th ,75th pts right.
first_coord = N_df["geometry"].apply(lambda g: g.coords[0])
Point_25th = N_df['geometry'].agg(lambda g: np.percentile(g, 25))
center_point = N_df['geometry'].centroid
point_75th = N_df['geometry'].agg(lambda g: np.percentile(g, 75))
last_coord = N_df["geometry"].apply(lambda g: g.coords[-1])
N_df["start_coord"] = first_coord
N_df["25th percentile"] = Point_25th
N_df["Midpoint"] = center_point
N_df["75th percentile"] = point_75th
N_df["last_coord"] = last_coord
# N_df ...?

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?
What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

Line fitting python

I have these two arrays x and y (pink scatter points), and I want to fit a line as shown in the figure (blue line). However, I have not been able to find the proper fit or a proper equation. I have tried exponentials, polyfits and nothing. Maybe someone here has an idea. Original x and y data are here.
x=[50.56457351, -25.94178241, 10.20492002, -4.42553174,
-27.33151148, -29.9721279 , 36.06962759, 7.56321785,
52.94172778, 1.99279668, 70.60954309, -41.28236289,
-28.56116707, -43.98117436, 19.52197933, -13.99493533,
-35.4383463 , -44.3050207 , -3.96679614, 14.88981502]
y=[-3.39506291, 5.07105136, 21.0870528 , 17.67702032, 0.03026157,
6.45900452, -0.19873009, 7.50795551, -1.82804242, -3.65200329,
-2.93492827, -5.86174941, -2.50507783, -3.43619693, -2.66194664,
-3.59860008, -0.19628881, -2.94505151, -2.3179664 , -2.29120004]

How to get the correlation matrix of a pyspark data frame? NEW 2020

I have the same question from this topic:
How to get the correlation matrix of a pyspark data frame?
"I have a big pyspark data frame. I want to get its correlation matrix. I know how to get it with a pandas data frame.But my data is too big to convert to pandas. So I need to get the result with pyspark data frame.I searched other similar questions, the answers don't work for me. Can any body help me? Thanks!"
df4 is my dataset, he has 9 columns and all of them are integers:
reference__YM_unix:integer
tenure_band:integer
cei_global_band:integer
x_band:integer
y_band:integer
limit_band:integer
spend_band:integer
transactions_band:integer
spend_total:integer
I have first done this step:
# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df4.columns, outputCol=vector_col)
df_vector = assembler.transform(df4).select(vector_col)
# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)
And got the following output:
(matrix.collect()[0]["pearson({})".format(vector_col)].values)
Out[33]: array([ 1. , 0.0760092 , 0.09051543, 0.07550633, -0.08058203,
-0.24106848, 0.08229602, -0.02975856, -0.03108094, 0.0760092 ,
1. , 0.14792512, -0.10744735, 0.29481762, -0.04490072,
-0.27454922, 0.23242408, 0.32051685, 0.09051543, 0.14792512,
1. , -0.03708623, 0.13719527, -0.01135489, 0.08706559,
0.24713638, 0.37453265, 0.07550633, -0.10744735, -0.03708623,
1. , -0.49640664, 0.01885793, 0.25877516, -0.05019079,
-0.13878844, -0.08058203, 0.29481762, 0.13719527, -0.49640664,
1. , 0.01080777, -0.42319841, 0.01229877, 0.16440178,
-0.24106848, -0.04490072, -0.01135489, 0.01885793, 0.01080777,
1. , 0.00523737, 0.01244241, 0.01811365, 0.08229602,
-0.27454922, 0.08706559, 0.25877516, -0.42319841, 0.00523737,
1. , 0.32888075, 0.21416322, -0.02975856, 0.23242408,
0.24713638, -0.05019079, 0.01229877, 0.01244241, 0.32888075,
1. , 0.53310864, -0.03108094, 0.32051685, 0.37453265,
-0.13878844, 0.16440178, 0.01811365, 0.21416322, 0.53310864,
1. ])
I've tried to insert this result on arrays or an excel file but it didnt work.
I did:
matrix2 = (matrix.collect()[0]["pearson({})".format(vector_col)])
Then I got the following error when I tried to display this info:
display(matrix2)
Exception: ML model display does not yet support model type <class 'pyspark.ml.linalg.DenseMatrix'>.
I was expecting to insert the name of the columns back from df4 but it didnt succeed, I've read that I need to use df4.columns but I have no idea how does it works.
Finally, I was expecting to print the following graph that I've seen from medium article
https://medium.com/towards-artificial-intelligence/feature-selection-and-dimensionality-reduction-using-covariance-matrix-plot-b4c7498abd07
But also it didn't work:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_std = stdsc.fit_transform(df4.iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 12},
cmap='coolwarm',
yticklabels=cols,
xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients', size = 18)
plt.tight_layout()
plt.show()
AttributeError: 'DataFrame' object has no attribute 'iloc'
I've tried to replace df4 to matrix2 and didn't work too
You can use the following to get the correlation matrix in a form you can manipulate:
matrix = matrix.toArray().tolist()
From there you can convert to a dataframe pd.DataFrame(matrix) which would allow you to plot the heatmap, or save to excel etc.

Gnuplot: data normalization of multiple dataset in one file

Image one file with 250 datasets with varying length (2000 +-500) lines and 11 columns. Here a comprehensive small example:
file.sum:
0.00000e+00 9.51287e-09
1.15418e-04 8.51287e-09
4.16445e-04 7.51287e-09
8.53721e-04 6.51287e-09
1.42697e-03 5.51287e-09
1.70302e-03 4.51287e-09
2.27189e-03 3.51287e-09
2.54732e-03 1.51287e-09
3.11304e-03 0.51287e-09
0.00000e+00 13.28378e-09
1.15418e-04 12.28378e-09
3.19663e-04 11.28378e-09
5.78178e-04 10.28378e-09
8.67479e-04 09.28378e-09
1.20883e-03 08.28378e-09
1.58817e-03 07.28378e-09
1.75840e-03 06.28378e-09
2.21069e-03 05.28378e-09
I wanted to display every 10 datasets and normalize it to the first element. The first value to normalize is 9.51287e-09 and the second would be 13.28378e-09. Of course with this massive dataset, I can not do it manually or even split the file.
So far I got every ten'th dataset but with the normalization, I do have my problems.
#!/usr/bin/gnuplot
reset
set xrange [0:0.1]
plot for [val=1:250:10] 'file.sum' i val u 1:11 w l
Working of this example:
plot.gp:
#!/usr/bin/gnuplot
reset
set xrange [0:0.01]
plot for [val=1:2:1] 'file.sum' i val u 1:2 w l
Some hints I found in:
Gnuplot: data normalization
I guess you can write a awk script to handle this, but there may be a more gnuplot friendlier way. Any suggestions are appreciated.
Assuming you have one file with data sections each separated by two or more empty lines you can use the script below.
In gnuplot console check help pseudocolumns. column(-2) tells you in which block you are and column(0) tells you wich line of this block you are (counting starts from 0).
Define a function Normalized(n) which does the following: if you are in the first line of a subblock put the value of column(n) into the variable y0. All values of this block will now be divided by y0. Also check help ternary.
In case you want a legend for the blocks you can plot a dummy plot, actually plotting NaN (i.e. nothing) but place an entry for the key.
Code:
### normalize each block by its first value
reset session
set colorsequence classic
$Data <<EOD
0.00000e+00 9.51287e-09
1.15418e-04 8.51287e-09
4.16445e-04 7.51287e-09
8.53721e-04 6.51287e-09
1.42697e-03 5.51287e-09
1.70302e-03 4.51287e-09
2.27189e-03 3.51287e-09
2.54732e-03 1.51287e-09
3.11304e-03 0.51287e-09
0.00000e+00 13.28378e-09
1.15418e-04 12.28378e-09
3.19663e-04 11.28378e-09
5.78178e-04 10.28378e-09
8.67479e-04 09.28378e-09
1.20883e-03 08.28378e-09
1.58817e-03 07.28378e-09
1.75840e-03 06.28378e-09
2.21069e-03 05.28378e-09
EOD
Normalized(n) = column(n)/(column(0)==0 ? y0=column(n) : y0)
plot $Data u 1:(Normalized(2)):(myBlocks=column(-2)+1) w lp pt 7 lc var notitle, \
for [i=0:myBlocks-1] '' u 1:(NaN) w lp pt 7 lc i+1 ti sprintf("Block %d",i)
### end of code
Result:

Resources