cross entropy loss not equivalent to binary log loss in lgbm - lightgbm

problem trying to solve:
compressing training instances by aggregating label (mean of weighed average) and summing weight based on same feature while keeping binary log loss same as cross entropy loss. Here is an example and test cases of log_loss shows that binary log loss is equivalent to weighted log loss.
original data: compressed_data
feature, label, weight, prediction feature, label, weight, prediction
x1, 1, 1, 0.8 x1, 1/3, 3, 0.8
x1, 0, 2, 0.8 -->
x2, 1, 2, 0.1 x2, 2/3, 3, 0.1
x2, 0, 1, 0.1
x3, 1, 1, 0.9 x3, 1, 1, 0.9
issue: binary log loss is not always equivalent to cross entropy loss in lgbm, model performance change (such as log loss, average precision and ROC_AUC) is mild but actual prediction and prediction distribution are quite significant. Experiment 1 shows that they are equivalent in binary label case, while Experiment 2 shows there are certain cases binary log loss does not align with cross entropy (check out examples for more details).
first, verify binary log loss is same as cross entropy loss by numpy
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from lightgbm.sklearn import LGBMRegressor, LGBMClassifier
import lightgbm
# use X of cancer data as training feature for both experiment 1 and 2
X, _ = load_breast_cancer(return_X_y=True)
def logloss(y_true, y_pred, weight):
l = np.mean((-(y_true * np.log(y_pred))-((1-y_true)*np.log(1-y_pred)))*weight)
# normalize loss
l = l*y_true.shape[0]/weight.sum()
return l
"""
feature, label, weight, prediction feature, label, weight, prediction
x1, 1, 1/3, 0.7
x1, 1, 1/3, 0.7 --> x1, 2/3, 1, 0.7
x1, 0, 1/3, 0.7
"""
l1 = logloss(np.array([1,1,0]), np.array([0.7,0.7,.7]), np.array([1/3,1/3,1/3]))
l2 = logloss(np.array([2/3]), np.array([0.7]), np.array([1]))
"""
feature, label, weight, prediction feature, label, weight, prediction
x1, 1, 1, 0.8 x1, 1/3, 3, 0.8
x1, 0, 2, 0.8 -->
x2, 1, 2, 0.1 x2, 2/3, 3, 0.1
x2, 0, 1, 0.1
x3, 1, 1, 0.9 x3, 1, 1, 0.9
"""
l3 = logloss(np.array([1,0,1,0,1]),
np.array([0.8,0.8,0.1,0.1,0.9]),
np.array([1,2,2,1,1]))
l4 = logloss(np.array([1/3,2/3,1]), np.array([0.8,0.1,0.9]), np.array([3,3,1]))
np.testing.assert_almost_equal(l1, l2, decimal=4)
np.testing.assert_almost_equal(l3, l4, decimal=4)
experiment 1 (binary log loss is equivalent to cross entropy loss in binary label case):
######## data for experiment 1
np.random.seed(42)
n = X.shape[0]
y_binary = np.random.randint(0,2,size=(n))
eps = 1e-2
y_float = np.random.uniform(eps,1-eps,size=(n))
lgbm_params = {
'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree':1,
'importance_type': 'split',
'learning_rate': 0.06472914709339864,
'max_depth': 46,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 20,
'n_jobs': 1,
'num_leaves': 178,
'random_state': 1574094090,
'reg_alpha': 0.4894283599023894,
'reg_lambda': 0.09743058458885945,
'silent': True,
'subsample':1,
# 'subsample_for_bin': 200000, # try larger values (10M+)
# 'subsample_freq': 252,
'min_data_in_bin':1,
'min_child_samples':1,
}
X_train_array, X_test_array, y_train_binary, y_test_binary, y_train_float, y_test_float = \
train_test_split(X, y_binary, y_float, test_size=0.3, random_state=1)
##### binary label case in sklearn API that binary objective is equivalent to cross_entropy objective
binary_model1 = LGBMClassifier(objective='binary')
binary_model1.set_params(**lgbm_params)
binary_model1.fit(
X_train_array,
y_train_binary,
sample_weight=np.ones(X_train_array.shape[0])
)
binary_model2 = LGBMRegressor(objective='cross_entropy')
binary_model2.set_params(**lgbm_params)
binary_model2.fit(
X_train_array,
y_train_binary,
sample_weight=np.ones(X_train_array.shape[0])
)
binary_pred_1 = binary_model1.predict_proba(X_test_array)[:,1]
binary_pred_2 = binary_model2.predict(X_test_array)
binary_y_pred_diff = binary_pred_1-binary_pred_2
# binary log loss and cross_entropy loss are same given binary labels
np.testing.assert_almost_equal(binary_pred_1, binary_pred_2, decimal=4)
experiment 2: cross entropy loss can be different from log loss (not sure why)
######## data for experiment 2
def make_compressed_df(X, fixed_ratio=None):
"""
this function stimulates compressed data that instances with same feature will be deduped
and label becomes mean of these instance labels and weight becomes sum of these instance weight
ex.
args:
fixed_ratio: int or None, if int, raito of pos_count/neg_count is consistent (key of the experiment!)
original_data: compressed_data:
feature, label, weight feature, label, pos_count, neg_count, weight,
x1, 1, 1
x1, 1, 1 --> x1, 2/3, 2, 1, 3
x1, 0, 1
-------------------------------------------------
x2, 0, 1
x2, 1, 1 --> x2, 1/2, 1, 1, 2
-------------------------------------------------
x3, 1, 1
x3, 1, 1 --> x3, 2/2, 2, 0, 2
"""
compressed_df = pd.DataFrame(X)
pos_count = np.random.randint(1,3,size=(X.shape[0]))
compressed_df['pos_count'] = pos_count
if fixed_ratio:
compressed_df['neg_count'] = int(fixed_ratio)*compressed_df['pos_count']
else:
neg_count = np.random.randint(1,3,size=(X.shape[0]))
compressed_df['neg_count'] = neg_count
compressed_df['total_count'] = compressed_df['pos_count']+compressed_df['neg_count']
compressed_df['weight'] = compressed_df['pos_count']+compressed_df['neg_count']
compressed_df['label'] = compressed_df['pos_count']/compressed_df['total_count']
return compressed_df
def restore_data(df):
"""
restore original features, labels and weight based on pos_count and neg_count.
instances with same feature will repeat (pos_count+neg_count) times, labels will become
[1]*pos_count+[0]*neg_count, and weight becomes weight/(pos_count+neg_count)
ex.
compressed_data: original_data:
feature, label, pos_count, neg_count, weight feature, label, weight
x1, 1, 1
x1, 2/3, 2, 1, 3 --> x1, 1, 1
x1, 0, 1
-------------------------------------------------
x2, 0, 1
x2, 1/2, 1, 1, 2 --> x2, 1, 1
-------------------------------------------------
x3, 1, 1
x3, 2/2, 2, 0, 2 --> x3, 1, 1
"""
pos_df = df.loc[df.index.repeat(df['pos_count'])]
pos_df['label'] = 1
neg_df = df.loc[df.index.repeat(df['neg_count'])]
neg_df['label'] = 0
df = pd.concat([pos_df, neg_df], axis=0)
del pos_df, neg_df
df['weight'] = df['weight']/df['total_count']
df = df.drop(['pos_count', 'neg_count', 'total_count'], axis=1)
return df
def make_compressed_and_restored_data(X, fixed_ratio):
np.random.seed(42)
compressed_df = make_compressed_df(X, fixed_ratio)
compressed_train_df, compressed_test_df = train_test_split(
compressed_df, test_size=0.3, random_state=1)
restored_train_df = restore_data(compressed_train_df)
restored_test_df = restore_data(compressed_test_df)
return (compressed_train_df, compressed_test_df), (restored_train_df, restored_test_df)
# when ratio of pos_count/neg_count is not fixed, objectives are different
(compressed_train_random_ratio_df, compressed_test_df), \
(restored_train_random_ratio_df, restored_test_random_ratio_df) = \
make_compressed_and_restored_data(X, fixed_ratio=None)
model1 = LGBMClassifier(objective='binary')
model1.set_params(**lgbm_params)
model1.fit(
restored_train_random_ratio_df.iloc[:,:30],
restored_train_random_ratio_df['label'],
sample_weight=restored_train_random_ratio_df['weight']
)
model2 = LGBMRegressor(objective='cross_entropy')
model2.set_params(**lgbm_params)
model2.fit(
compressed_train_random_ratio_df.iloc[:,:30],
compressed_train_random_ratio_df['label'],
sample_weight=compressed_train_random_ratio_df['weight']
)
y1 = model1.predict_proba(compressed_test_df.iloc[:,:30])[:,1]
y2 = model2.predict(compressed_test_df.iloc[:,:30])
# this assertion fails
np.testing.assert_almost_equal(y1, y2, decimal=4)
# when ratio of pos_count/neg_count is fixed, objectives are same
(compressed_train_fixed_ratio_df, compressed_test_fixed_ratio_df), \
(restored_train_fixed_ratio_df, restored_test_fixed_ratio_df) = \
make_compressed_and_restored_data(X, fixed_ratio=2)
model3 = LGBMClassifier(objective='binary')
model3.set_params(**lgbm_params)
model3.fit(
restored_train_fixed_ratio_df.iloc[:,:30],
restored_train_fixed_ratio_df['label'],
sample_weight=restored_train_fixed_ratio_df['weight']
)
model4 = LGBMRegressor(objective='cross_entropy')
model4.set_params(**lgbm_params)
model4.fit(
compressed_train_fixed_ratio_df.iloc[:,:30],
compressed_train_fixed_ratio_df['label'],
sample_weight=compressed_train_fixed_ratio_df['weight']
)
y3 = model3.predict_proba(compressed_test_fixed_ratio_df.iloc[:,:30])[:,1]
y4 = model4.predict(compressed_test_fixed_ratio_df.iloc[:,:30])
# this assertion passes
np.testing.assert_almost_equal(y3, y4, decimal=4)

It looks like this question was cross-posted here and in the official LightGBM repo.
LightGBM maintainers have provided an answer there: https://github.com/microsoft/LightGBM/issues/3576.

Related

SymPy: Extract the lower triangular part of a matrix

I am trying to extract the lower triangular part of a SymPy matrix. Since I could not find a tril method in SymPy, I defined:
def tril (M):
m = M.copy()
for row_index in range (m.rows):
for col_index in range (row_index + 1, m.cols):
m[row_index, col_index] = 0
return (m)
It seems to work:
Is there a more elegant way to extract the lower triangular part of a SymPy matrix?
Is .copy() the recommended way to ensure the integrity of the original matrix?
In SymPy, M.lower_triangular(k) will give the lower triangular elements below the kth diagonal. The default is k=0.
In [99]: M
Out[99]:
⎡a b c⎤
⎢ ⎥
⎢d e f⎥
⎢ ⎥
⎣g h i⎦
The other answer suggest using the np.tril function:
In [100]: np.tril(M)
Out[100]:
array([[a, 0, 0],
[d, e, 0],
[g, h, i]], dtype=object)
That converts M into a numpy array - object dtype because of the symbols. And the result is also a numpy array.
Your function returns a sympy.Matrix.
In [101]: def tril (M):
...: m = M.copy()
...: for row_index in range (m.rows):
...: for col_index in range (row_index + 1, m.cols):
...: m[row_index, col_index] = 0
...: return (m)
...:
In [102]: tril(M)
Out[102]:
⎡a 0 0⎤
⎢ ⎥
⎢d e 0⎥
⎢ ⎥
⎣g h i⎦
As a general rule mixing sympy and numpy leads to confusion, if not errors. numpy is best for numeric work. It can handle non-numeric objects like symbols, but the math is hit-or-miss.
The np.tri... functions are built on the np.tri function:
In [114]: np.tri(3).astype(int)
Out[114]:
array([[1, 0, 0],
[1, 1, 0],
[1, 1, 1]])
We can make a symbolic Matrix from this:
In [115]: m1 = Matrix(np.tri(3).astype(int))
In [116]: m1
Out[116]:
⎡1 0 0⎤
⎢ ⎥
⎢1 1 0⎥
⎢ ⎥
⎣1 1 1⎦
and do element-wise multiplication:
In [117]: M.multiply_elementwise(m1)
Out[117]:
⎡a 0 0⎤
⎢ ⎥
⎢d e 0⎥
⎢ ⎥
⎣g h i⎦
np.tri works by comparing a column array with a row:
In [123]: np.arange(3)[:,None]>=np.arange(3)
Out[123]:
array([[ True, False, False],
[ True, True, False],
[ True, True, True]])
In [124]: _.astype(int)
Out[124]:
array([[1, 0, 0],
[1, 1, 0],
[1, 1, 1]])
Another answer suggests lower_triangular. It's interesting to look at its code:
def entry(i, j):
return self[i, j] if i + k >= j else self.zero
return self._new(self.rows, self.cols, entry)
It applies an i>=j test to each element. _new must be iterating on the rows and columns.
You can simply use numpy function:
import numpy as np
np.tril(M)
*of course, as noted below, you should convert back to sympy.Matrix(np.tril(M)). But it depends on what you're going to do next.

Is it possible to get principal point from a projection matrix?

Is it possible to get principal point (cx, cy) from a 4x4 projection matrix? This is the same matrix asked in this question: Getting focal length and focal point from a projection matrix
(SCNMatrix4)
s = (m11 = 1.83226573,
m12 = 0,
m13 = 0,
m14 = 0,
m21 = 0,
m22 = 2.44078445,
m23 = 0,
m24 = 0,
m31 = -0.00576340035,
m32 = -0.0016724075,
m33 = -1.00019991,
m34 = -1,
m41 = 0,
m42 = 0,
m43 = -0.20002,
m44 = 0)
The values I'm trying to calculate in this 3x3 camera matrix is x0 and y0.
I recently confronted this problem, and quite astonished I couldn't find a relevant solution on Internet, because it seems to be a simple mathematics problem.
After a few days of struggling with matrices, I found a solution.
Let's define two Cartesian coordinate system, the camera coordinate system with x', y', z' axes, and the world coordinate system with x, y, z axes. The camera(or the eye) is positioned at the origin of the camera coordinate system and the image plane(a plane containing the screen) is z' = -n, where n is the focal length and the focal point is the position of the camera. I am using the convention of OpenGL and n is the nearVal argument of the glFrustum().
You can define a 4x4 transformation matrix M in a homogeneous coordinate system to deal with a projection. The M transforms a coordinate (x, y, z) in the world coordinate system into a coordinate (x', y', z') in the camera coordinate system like the following, where # means a matrix multiplication.
[
[x_prime_h],
[y_prime_h],
[z_prime_h],
[w_prime_h],
] = M # [
[x_h],
[y_h],
[z_h],
[w_h],
]
[x, y, z] = [x_h, y_h, z_h] / w_h
[x_prime, y_prime, z_prime] = [x_prime_h, y_prime_h, z_prime_h] / w_prime_h
Now assume you are given M = P V, where P is a perspective projection matrix and V is a view transformation matrix. The theoretical projection matrix is like the following.
P_theoretical = [
[n, 0, 0, 0],
[0, n, 0, 0],
[0, 0, n, 0],
[0, 0, -1, 0],
]
In OpenGL, an augmented matrix like the following is used to cover the normalization and nonlinear scaling on z coordinates, where l, r, b, t, n, f are the left, right, bottom, top, nearVal, farVal arguments of the glFrustum().(The resulting z' coordinate is not actually the coordinate of a projected point, but a value used for Z-buffering.)
P = [
[2*n/(r-l), 0, (r+l)/(r-l), 0],
[0, 2*n/(t-b), (t+b)/(t-b), 0],
[0, 0, -(f+n)/(f-n), -2*n*f/(f-n)],
[0, 0, -1, 0],
]
The transformation V is like the following, where r_ij is the element at i-th row and j-th column of the 3x3 rotational matrix R and (c_0, c_1, c_2) is the coordinate of the camera.
V = [
[r_00, r_01, r_02, -(r_00*c_0 + r_01*c_1 + r_02*c_2)],
[r_10, r_11, r_12, -(r_10*c_0 + r_11*c_1 + r_12*c_2)],
[r_20, r_21, r_22, -(r_20*c_0 + r_21*c_1 + r_22*c_2)],
[0, 0, 0, 1],
]
The P and V can be represented with block matrices like the following.
C = [
[c_0],
[c_1],
[c_2],
]
A = [
[2*n/(r-l), 0, (r+l)/(r-l)],
[0, 2*n/(t-b), (t+b)/(t-b)],
[0, 0, -(f+n)/(f-n)],
]
B = [
[0],
[0],
[-2*n*f/(f-n)],
]
P = [
[A,B],
[[0, 0, -1], [0]],
]
V = [
[R, -R # C],
[[0, 0, 0], [1]],
]
M = P # V = [
[A # R, -A # R # C + B],
[[0, 0, -1] # R, [0, 0, 1] # R # C],
]
Let m_ij be the element of M at i-th row and j-th column. Taking the first element of the second row of the above block notation of M, you can solve for the elementary z' vector of the camera coordinate system, the opposite direction from the camera point to the intersection point between the image plane and its normal line passing through the focal point.(The intersection point is the principal point.)
e_z_prime = [0, 0, 1] # R = -[m_30, m_31, m_32]
Taking the second column of the above block notation of M, you can solve for C like the following, where inv(X) is an inverse of a matrix X.
C = - inv([
[m_00, m_01, m_02],
[m_10, m_11, m_12],
[m_30, m_31, m_32],
]) # [
[m_03],
[m_13],
[m_33],
]
Let p_ij be the element of P at i-th row and j-th column.
Now you can solve for p_23 = -2nf/(f-n) like the following.
B = [
[m_03],
[m_13],
[m_23],
] + [
[m_00, m_01, m_02],
[m_10, m_11, m_12],
[m_20, m_21, m_22],
] # C
p_23 = B[2] = m_23 + (m_20*c_0 + m_21*c_1 + m_22*c_2)
Now using the fact p_20 = p_21 = 0, you can get p_22 = -(f+n)/(f-n) like the following.
p_22 * e_z_prime = [m_20, m_21, m_22]
p_22 = -(m_20*m_30 + m_21*m_31 + m_22*m_32)
Now you can get n and f from p_22 and p_23 like the following.
n = p_23/(p_22-1)
= -(m_23 + m_20*c_0+m_21*c_1+m_22*c_2) / (m_20*m_30+m_21*m_31+m_22*m_32 + 1)
f = p_23/(p_22+1)
= -(m_23 + m_20*c_0+m_21*c_1+m_22*c_2) / (m_20*m_30+m_21*m_31+m_22*m_32 - 1)
From the camera position C, the focal length n and the elementary z' vector e_z_prime, you can get the principal point, C - n * e_z_prime.
As a side note, you can prove the input matrix of inv() in the formula for getting C is nonsingular. And you can also find elementary x' and y' vectors of the camera coordinate system, and find the l, r, b, t using these vectors.(There will be two valid solutions for the (e_x_prime, e_y_prime, l, r, b, t) tuple, due to the symmetry.) And finally this solution can be expanded when the transformation matrix is mixed with the world transformation which does an anisotropic scaling, that is when M = P V W and W can have unequal eigenvalues.

pure ruby: calculate sparse matrix rank fast(er)

How do I speed up the rank calculation of a sparse matrix in pure ruby?
I'm currently calculating the rank of a matrix (std lib) to determine the rigidity of a graph.
That means I have a sparse matrix of about 2 rows * 9 columns to about 300 rows * 300 columns.
That translates to times of several seconds to determine the rank of the matrix, which is very slow for a GUI application.
Because I use Sketchup I am bound to Ruby 2.0.0.
I'd like to avoid the hassle of setting up gcc on windows, so nmatrix is (I think) not a good option.
Edit:
Example matrix:
[[12, -21, 0, -12, 21, 0, 0, 0, 0],
[12, -7, -20, 0, 0, 0, -12, 7, 20],
[0, 0, 0, 0, 14, -20, 0, -14, 20]]
Edit2:
I am using integers instead of floats to speed it up considerably.
I have also added a fail fast mechanism earlier in the code in order to not call the slow rank function at all.
Edit3:
Part of the code
def rigid?(proto_matrix, nodes)
matrix_base = Array.new(proto_matrix.size) { |index|
# initialize the row with 0
arr = Array.new(nodes.size * 3, 0.to_int)
proto_row = proto_matrix[index]
# ids of the nodes in the graph
node_ids = proto_row.map { |hash| hash[:id] }
# set the values of both of the nodes' positions
[0, 1].each { |i|
vertex_index = vertices.find_index(node_ids[i])
# predetermined vector associated to the node
vec = proto_row[i][:vec]
arr[vertex_index * 3] = vec.x.to_int
arr[vertex_index * 3 + 1] = vec.y.to_int
arr[vertex_index * 3 + 2] = vec.z.to_int
}
arr
}
matrix = Matrix::rows(matrix_base, false)
rank = matrix.rank
# graph is rigid if the rank of the matrix is bigger or equal
# to the amount of node coordinates minus the degrees of freedom
# of the whole graph
rank >= nodes.size * 3 - 6
end

MATLAB - Sort 2D points ensuring adjacent points differ by one coordinate? [duplicate]

I have 2 vectors that are x and y coordinates of the 8 vertexes of a polygon
x=[5 5 7 7 9 9 5 7]
y=[8 6 6 8 6 8 10 10]
I wanna sort them (clockwise) to obtain the right vectors (to draw the polygon correctly)
x=[5 7 9 9 7 7 5 5]
y=[6 6 6 8 8 10 10 8]
Step 1: Find the unweighted mean of the vertices:
cx = mean(x);
cy = mean(y);
Step 2: Find the angles:
a = atan2(y - cy, x - cx);
Step 3: Find the correct sorted order:
[~, order] = sort(a);
Step 4: Reorder the coordinates:
x = x(order);
y = y(order);
Python version (numpy) for Ben Voigt's algorithm:
def clockwise(points):
x = points[0,:]
y = points[1,:]
cx = np.mean(x)
cy = np.mean(y)
a = np.arctan2(y - cy, x - cx)
order = a.ravel().argsort()
x = x[order]
y = y[order]
return np.vstack([x,y])
Example:
In [281]: pts
Out[281]:
array([[7, 2, 2, 7],
[5, 1, 5, 1]])
In [282]: clockwise(pts)
Out[282]:
array([[2, 7, 7, 2],
[1, 1, 5, 5]])
I tried the solutions by #ben-voight and #mclafee, but I think they are sorting the wrong way.
When using atan2 the angles are stated in the following way:
Matlab Atan2
The angle is positive for counter-clockwise angles (upper half-plane,
y > 0), and negative for clockwise angles (lower half-plane, y < 0).
Wikipedia Atan2
This means that using ascending sort() of Numpy or Matlab will progress counterclockwise.
This can be verified using the Shoelace equation
Wikipedia Shoelace
Python Shoelace
So, adjusting the answers mentioned above to use descending sorting the correct solution in Matlab is
cx = mean(x);
cy = mean(y);
a = atan2(y - cy, x - cx);
[~, order] = sort(a, 'descend');
x = x(order);
y = y(order);
The solution in numpy is
import numpy as np
def clockwise(points):
x = points[0,:]
y = points[1,:]
cx = np.mean(x)
cy = np.mean(y)
a = np.arctan2(y - cy, x - cx)
order = a.ravel().argsort()[::-1]
x = x[order]
y = y[order]
return np.vstack([x,y])
pts = np.array([[7, 2, 2, 7],
[5, 1, 5, 1]])
clockwise(pts)
pts = np.array([[1.0, 1.0],
[-1.0, -1.0],
[1.0, -1.0],
[-1.0, 1.0]]).transpose()
clockwise(pts)
Output:
[[7 2 2 7]
[5 1 5 1]]
[[2 7 7 2]
[5 5 1 1]]
[[ 1. -1. 1. -1.]
[ 1. -1. -1. 1.]]
[[-1. 1. 1. -1.]
[ 1. 1. -1. -1.]]
Please notice the [::-1] used to invert arrays / lists.
This algorithm does not apply to non-convex polygons.
Instead, consider using MATLAB's poly2cw()

Affine transformation algorithm

Does anyone know of any standard algorithms to determine an affine transformation matrix based upon a set of known points in two co-ordinate systems?
Affine transformations are given by 2x3 matrices. We perform an affine transformation M by taking our 2D input (x y), bumping it up to a 3D vector (x y 1), and then multiplying (on the left) by M.
So if we have three points (x1 y1) (x2 y2) (x3 y3) mapping to (u1 v1) (u2 v2) (u3 v3) then we have
[x1 x2 x3] [u1 u2 u3]
M [y1 y2 y3] = [v1 v2 v3].
[ 1 1 1]
You can get M simply by multiplying on the right by the inverse of
[x1 x2 x3]
[y1 y2 y3]
[ 1 1 1].
A 2x3 matrix multiplied on the right by a 3x3 matrix gives us the 2x3 we want. (You don't actually need the full inverse, but if matrix inverse is available it's easy to use.)
Easily adapted to other dimensions. If you have more than 3 points you may want a least squares best fit. You'll have to ask again for that, but it's a little harder.
I'm not sure how standard it is, but there is a nice formula especially for your case presented in "Beginner's guide to mapping simplexes affinely" and "Workbook on mapping simplexes affinely".
Putting it into code should look something like this (sorry for bad codestyle -- I'm mathematician, not programmer)
import numpy as np
# input data
ins = [[1, 1, 2], [2, 3, 0], [3, 2, -2], [-2, 2, 3]] # <- points
out = [[0, 2, 1], [1, 2, 2], [-2, -1, 6], [4, 1, -3]] # <- mapped to
# calculations
l = len(ins)
B = np.vstack([np.transpose(ins), np.ones(l)])
D = 1.0 / np.linalg.det(B)
entry = lambda r,d: np.linalg.det(np.delete(np.vstack([r, B]), (d+1), axis=0))
M = [[(-1)**i * D * entry(R, i) for i in range(l)] for R in np.transpose(out)]
A, t = np.hsplit(np.array(M), [l-1])
t = np.transpose(t)[0]
# output
print("Affine transformation matrix:\n", A)
print("Affine transformation translation vector:\n", t)
# unittests
print("TESTING:")
for p, P in zip(np.array(ins), np.array(out)):
image_p = np.dot(A, p) + t
result = "[OK]" if np.allclose(image_p, P) else "[ERROR]"
print(p, " mapped to: ", image_p, " ; expected: ", P, result)
This code recovers affine transformation from given points ("ins" transformed to "outs") and tests that it works.

Resources