inverse_transform MinMaxScaler from scikit_learn does not provide the correct inverted value - sklearn-pandas

have fitted like this*
values2 = raw.values
# ensure all data is float
values2 = values2.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values2)
The values shape is (10220, 47) i.e I have 47 columns of data. After fitting I divided my dataset to train, valid and test. I trained an ENCODER_DECODER LSTM model for multivariate multi-step prediction. After training the model I predict on test data and try to calculate error between predicted and ground truth values after being transformed or unscaled by inverese_transform method. the code is shown below....
y_test_predicted = model_encoder_training.predict([test_X2,test_decoder_input_data])
pred= tf.reshape(y_test_predicted, [1022, 3])
orig=tf.reshape(test_y2, [1022, 3])
test_X3 = test_X2.reshape((test_X2.shape[0], n_hours*n_features))
inv_yhat2 = concatenate((test_X3[:, :44],pred), axis=1)
inv_yhat2 = scaler.inverse_transform(inv_yhat2)
inv_yhat2 = inv_yhat2[:, [44,45,46]]
inv_y2 = concatenate((test_X3[:, :44],orig), axis=1)
inv_y2 = scaler.inverse_transform(inv_y2)
inv_y2 = inv_y2[:, [44,45,46]]
however surprisingly the inverted values for both predicted and ground truth are not correct except for the last column as shown below.
array([[3.5567944 , 0.5624023 , 3.922 ],
[3.5567944 , 0.5624023 , 3.7129998 ],
[3.5567944 , 0.5324324 , 3.922 ],
...,
[4.550709 , 0.72759545, 5.074 ],
[4.550709 , 0.72759545, 5.074 ],
[4.550709 , 0.72759545, 5.074 ]], dtype=float32)
What will be the reason for such incorrect values of inverse transformed values for the columns 0 and 1 in this case.*

Related

Power Bi - Add Total Average column in Matrix

Hi I am trying to add a AVERAGE column in a matrix, but when I put my metric added the average per column, but I need a total AVERAGE and total at the end just once
What I have:
What I need:
Group
Maria
Pedro
average
total
First
4
6
5
10
Second
5
10
7.5
15
Regards
Following the example detailed in the sample data table, to get the Total you could add the following measure;
Total By Group = CALCULATE( SUM(AverageExample[Maria]) + SUM(AverageExample[Pedro]))
and to average
Average By Group = [Total By Group] / 2
Based on the first three columns, this will provide
You have to build a DAX table (or Power Query) and a designated measure.
Matrix Table =
UNION(
DATATABLE("Detail", STRING, "Detail Order", INTEGER, "Type", STRING, {{"Average", 1000, "Agregate"}, {"Total", 1001, "Agregate"}}),
SUMMARIZE('Your Names Table', 'Your Names Table'[Name], 'Your Names Table'[Name Order], "Type", "Names")
)
This should give you a table with the list of people and 2 more lines for the agregations.
After that, you create a measure using variables and a switch function.
Matrix Measure =
var ft = FIRSTNONBLANK('Matrix Table'[Type], 0)
var fd = FIRSTNONBLANK('Matrix Table'[Detail], 0)
return SWITCH(TRUE,
ft = "Names", CALCULATE([Total], KEEPFILTERS('Your Names Table'[Name] = fd)),
fd = "Total", [Your Total Measure],
fd = "Average", [Your Averagex Measure]
)
The rest is up to you to fiddle with orders, add any agregate measures and whatnot.
Note that the Matrix Table should have no relation with any table from your model.
You can also hide it and the Matrix measure.

DAX IF measure - return fixed value

This should be a very simple requirement. But it seems impossible to implement in DAX.
Data model, User lookup table joined to many "Cards" linked to each user.
I have a measure setup to count rows in CardUser. That is working fine.
<measureA> = count rows in CardUser
I want to create a new measure,
<measureB> = IF(User.boolean = 1,<measureA>, 16)
If User.boolean = 1, I want to return a fixed value of 16. Effectively, bypassing measureA.
I can't simply put User.boolean = 1 in the IF condition, throws an error.
I can modify measureA itself to return 0 if User.boolean = 1
measureA> =
CALCULATE (
COUNTROWS(CardUser),
FILTER ( User.boolean != 1 )
)
This works, but I still can't find a way to return 16 ONLY if User.boolean = 1.
That's easy in DAX, you just need to learn "X" functions (aka "Iterators"):
Measure B =
SUMX( VALUES(User.boolean),
IF(User.Boolean, [Measure A], 16))
VALUES function generates a list of distinct user.boolean values (1, 0 in this case). Then, SUMX iterates this list, and applies IF logic to each record.

Extracting vectors from Doc2Vec

I am trying to extract the documents vector to feed into a regression model for prediction.
I have fed around 1 400 000 of labelled sentences into doc2vec for training, however I was only able to retrieve only 10 vectors using model.docvecs.
This is a snapshot of the labelled sentences I used to trained the doc2vec model:
In : documents[0]
Out: TaggedDocument(words=['descript', 'yet'], tags='0')
In : documents[-1]
Out: TaggedDocument(words=['new', 'tag', 'red', 'sparkl', 'firm', 'price', 'free', 'ship'], tags='1482534')
These are the code used to train the doc2vec model
model = gensim.models.Doc2Vec(min_count=1, window=5, size=100, sample=1e-4, negative=5, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples =len(documents), epochs=1)
This is the dimension of the documents vectors:
In : model.docvecs.doctag_syn0.shape
Out: (10, 100)
On which part of the code did I mess up?
Update:
Adding on to the comment from sophros, it appear that i have made a mistake when I am creating the TaggedDocument prior to training which resulted in 1.4 mil Documents appearing as 10 Documents.
Courtesy of Irene Li on your tutorial on Doc2vec, I have made some slightly edit to the class she used to generate TaggedDocument
def get_doc(data):
tokenizer = RegexpTokenizer(r'\w+')
en_stop = stopwords.words('english')
p_stemmer = PorterStemmer()
taggeddoc = []
texts = []
for index,i in enumerate(data):
# for tagged doc
wordslist = []
tagslist = []
i = str(i)
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# remove numbers
number_tokens = [re.sub(r'[\d]', ' ', i) for i in stopped_tokens]
number_tokens = ' '.join(number_tokens).split()
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in number_tokens]
# remove empty
length_tokens = [i for i in stemmed_tokens if len(i) > 1]
# add tokens to list
texts.append(length_tokens)
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
taggeddoc.append(td)
return taggeddoc
The mistake was fixed when I made the change from
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
to this
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),[str(index)])
It appear that the index of the TaggedDocument must be in the form of the list for TaggedDocument to work properly. For more details as to why, please refer to this answer by gojomo.
The gist of the error was: the tags for each individual TaggedDocument were being provided as plain strings, like '101' or '456'.
But, tags should be a list-of-separate tags. By providing a simple string, it was treated as a list-of-characters. So '101' would become ['1', '0', '1'], and '456' would become ['4', '5', '6'].
Across any number of TaggedDocument objects, there were thus only 10 unique tags, single digits ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']. Every document just caused some subset of those tags to be trained.
Correcting tags to be a list-of-one tag, eg ['101'], allows '101' to be seen as the actual tag.

h2o.auc( perf , xval =TRUE) - what does this call return?

My code is as follows
gbm.fit.hex = h2o.gbm(x= xcols , y =1865 , training_frame = tr.hex , distribution = "bernoulli", model_id = "gbm.model" , key = "gbm.model.key" , ntrees = gbm.trees , max_depth = gbm.depth , min_rows = gbm.min.rows , learn_rate = gbm.learn.rate , nbins = 20 , balance_classes = gbm.balance , nfolds = gbm.folds )
perf <- h2o.performance(gbm.fit.hex , tr.hex)
a = h2o.auc(perf , xval = TRUE)
What does the auc call return? does it return the AUC on training dataset or on the crossvalidation results?
It retrieves the cross-validated AUC.
Since you set the nfolds argument to something non-zero, the h2o.gbm function also performs k-fold cross-validation in addition to training a GBM model on the full training set. In your command, you did not specify a validation set, so the AUC values you can retrieve are training AUC, h2o.auc(perf, train = TRUE), and cross-validated AUC (as above).
If you want to evaluate performance on a separate validation (or test) set, you can pass that frame using the validation_frame argument and retrieve the validation AUC using h2o.auc(perf, valid = TRUE).

Slicing in PyTables

What is the fastest way to slice arrays that are saved in h5 using PyTables?
The scenario is the following:
The data was already saved (no need to optimize here):
filters = tables.Filters(complib='blosc', complevel=5)
h5file = tables.open_file(hd5_filename, mode='w',
title='My Data',
filters=filters)
group = h5file.create_group(h5file.root, 'Data', 'Data')
X_atom = tables.Float32Atom(shape=[50,50,50])
X = h5file.create_carray(group, 'X', atom=X_atom, title='XData',
shape=(1000,), filters=filters)
The data is opened :
h5file = tables.openFile(hd5_filename, mode="r")
node = h5file.getNode('/', data_node)
X = getattr(node, X_str)
This is where I need optimization, I need to make a lot of the following kind of array slicing that cannot be sorted, for many many indexes and different min/max locations:
for index, min_x, min_y, min_z, max_x, max_y, max_z in my_very_long_list:
current_item = X[index][min_x:max_x,min_y:max_y,min_z:max_z]
do_something(current_item)
The question is:
Is this the fastest way to do the task?

Resources