combining one hot encoded category data and continuous data examples in TensorFlow JS - categorical-data

When using both one-hot encoded category data and numerically variable data how do you combine these different types of data to fit the model?
Previously using TensorflowJS, and this is just my side project that I am using to learn more about google's tf, I have used only one-hot encoded data for both the feature and label values. All the features were the same shape as were the labels. I am simply building a prediction model for features that are one-hot encoded names such as
"name1" = [1,0,0]
"name2" = [0,1,0]
...
with outcomes that are one-hot encoded too such as
"outcome1" = [1,0,0]
"outcome2" = [0,1,0]
...
This worked pretty well and I was able to build a lovely model that when tested worked quite well.
However once I wanted to do some feature engineering and use a numerically continuous piece of data I could not see how to combine the category and continuous data.
What I am wanting to do is use an additional non category piece of variable data. So my features would be
"name1" = [1,0,0]
"name2" = [0,1,0]
variablePieceOfInformation = 10
I do not initially see a way within tensor flow of using these pieces of data together.
Initially all my features were identically sized one hot encoded category names and my labels were one hot encoded outcomes that I wanted to predict.
Now I want to use an additional piece of information that is continuously variable and I do not see examples on how to combine the two types.
What I was expecting was that there may be a way of combining category (onehot encoded) and continuously variable data.
my git repo
import getTrainingData, { TrainingData, getNames } from "./getModelData";
import save from "./saveModel";
import { numAllTimeTeams } from "#gvhinks/epl-constants";
const createModel = async (): Promise<tf.Sequential> => {
const model: tf.Sequential = tf.sequential({
name: "predict"
});
model.add(tf.layers.dense({inputShape: [2,numAllTimeTeams], units: 3, useBias: true, name:"teams_layer"}));
model.add(tf.layers.flatten());
model.add(tf.layers.dense({units: 3, useBias: true, name: "results_layer"}));
model.compile({optimizer: tf.train.adam(0.001), loss: 'meanSquaredError'});
const { labelValues, featureValues } = await getTrainingData();
const numTeamsInLeague: number = (featureValues[0][0]).length;
const featureTensors = tf.tensor3d(featureValues, [featureValues.length, 2, numTeamsInLeague], 'int32');
const labelTensors = tf.tensor2d(labelValues, [labelValues.length, 3], 'int32');
const fitArgs = {
batchSize: 500,
epochs: 200,
verbose: 0
};
await model.fit(featureTensors, labelTensors, fitArgs);
return model;
};
export { getTrainingData, createModel as default, TrainingData, getNames, save };```

Combining a tensor to have both categorical and continuous data is in itself not a complicated task. One needs to concat (using tf.concat) two tensors one with a categorical data and the second with a numerical data.
With such an input, the prediction can only be either a categorical or a numerical data, but not both - the reason being that the type of the data will influence the choice of the error function.

Related

How Should Complex ReQL Queries be Composed?

Are there any best practices or ReQL features that that help with composing complex ReQL queries?
In order to illustrate this, imagine a fruits table. Each document has the following structure.
{
"id": 123,
"name": "name",
"colour": "colour",
"weight": 5
}
If we wanted to retrieve all green fruits, we might use the following query.
r
.db('db')
.table('fruits')
.filter({colour: 'green'})
However, in more complex cases, we might wish to use a variety of complex command combinations. In such cases, bespoke queries could be written for each case, but this could be difficult to maintain and could violate the Don't Repeat Yourself (DRY) principle. Instead, we might wish to write bespoke queries which could chain custom commands, thus allowing complex queries to be composed in a modular fashion. This might take the following form.
r
.db('db')
.table('fruits')
.custom(component)
The component could be a function which accepts the last entity in the command chain as its argument and returns something, as follows.
function component(chain)
{
return chain
.filter({colour: 'green'});
};
This is not so much a feature proposal as an illustration of the problem of complex queries, although such a feature does seem intuitively useful.
Personally, my own efforts in resolving this problem have involved the creation of a compose utility function. It takes an array of functions as its main argument. Each function is called, passed a part of the query chain, and is expected to return an amended version of the query chain. Once the iteration is complete, a composition of the query components is returned. This can be viewed below.
function compose(queries, parameters)
{
if (queries.length > 1)
{
let composition = queries[0](parameters);
for (let index = 1; index < queries.length; index++)
{
let query = queries[index];
composition = query(composition, parameters);
};
return composition;
}
else
{
throw 'Must be two or more queries.';
};
};
function startQuery()
{
return RethinkDB;
};
function filterQuery1(query)
{
return query.filter({name: 'Grape'});
};
function filterQuery2(query)
{
return query.filter({colour: 'Green'});
};
function filterQuery3(query)
{
return query.orderBy(RethinkDB.desc('created'));
};
let composition = compose([startQuery, filterQuery1, filterQuery2, filterQuery3]);
composition.run(connection);
It would be great to know whether something like this exists, whether there are best practises to handle such cases, or whether this is an area where ReQL could benefit from improvements.
In RethinkDB doc, they state it clearly: All ReQL queries are chainable
Queries are constructed by making function calls in the programming
language you already know. You don’t have to concatenate strings or
construct specialized JSON objects to query the database. All ReQL
queries are chainable. You begin with a table and incrementally chain
transformers to the end of the query using the . operator
You do not have to compose another thing which just implicit your code, which gets it more difficult to read and be unnecessary eventually.
The simple way is assign the rethinkdb query and filter into the variables, anytime you need to add more complex logic, add directly to these variables, then run() it when your query is completed
Supposing I have to search a list of products with different filter inputs and getting pagination. The following code is exposed in javascript (This is simple code for illustration only)
let sorterDirection = 'asc';
let sorterColumnName = 'created_date';
var buildFilter = r.row('app_id').eq(appId).and(r.row('status').eq('public'))
// if there is no condition to start up, you could use r.expr(true)
// append every filter into the buildFilter var if they are positive
if (escapedKeyword != "") {
buildFilter = buildFilter.and(r.row('name').default('').downcase().match(escapedKeyword))
}
// you may have different filter to add, do the same to append them into buildFilter.
// start to make query
let query = r.table('yourTableName').filter(buildFilter);
query.orderBy(r[sorterDirection](sorterColumnName))
.slice(pageIndex * pageSize, (pageIndex * pageSize) + pageSize).run();

Mlib RandomForest (Spark 2.0) predict a single vector

After training a RandomForestRegressor in PipelineModel using mlib and DataFrame (Spark 2.0)
I loaded the saved model into my RT environment in order to predict using the model, each request
is handled and transform through the loaded PipelineModel but in the process I had to convert the
single request vector to a one row DataFrame using spark.createdataframe all of this takes around 700ms!
comparing to 2.5ms if I uses mllib RDD RandomForestRegressor.predict(VECTOR).
Is there any way to use the new mlib to predict a a single vector without converting to DataFrame or do something else to speed things up?
The dataframe based org.apache.spark.ml.regression.RandomForestRegressionModel also takes a Vector as input. I don't think you need to convert a vector to dataframe for every call.
Here is how I think you code should work.
//load the trained RF model
val rfModel = RandomForestRegressionModel.load("path")
val predictionData = //a dataframe containing a column 'feature' of type Vector
predictionData.map { row =>
Vector feature = row.getAs[Vector]("feature")
Double result = rfModel.predict(feature)
(feature, result)
}

Ember - How to sort by transform?

I am trying to sort a model in Ember by a field that has a transform. The original field is an IntegerField (served by an API), which is ordered appropriately. When the field is deserialized by the transform into the string representation, ember sorts by the string, as opposed to the original order.
For instance, if this is the deserialization:
{
10: 'Built',
20: 'Started',
30: 'Finished',
};
I want this to appear as Built, Started, Finished when sorted, according to the original enum. However, it will actually be sorted as Built, Finished, Started, per the alphabetized strings.
Is this possible when using Ember.computed.sort?
i know it is not what you want, but after many search i just do this type of sort on the ember
var sortArray = Ember.ArrayProxy.extend(Ember.SortableMixin).create();
sortArray.set("content", []);
sortArray.addObject({"id":10 , "name": 'Built'});
sortArray.addObject({"id":20 , "name": 'Started'});
sortArray.addObject({"id":30 , "name": 'Finished'});
sortArray.set('sortProperties', ["id"]).set("sortAscending", true);

How to export dc.js filtered data

I'm using dc.js library to generate graphs and I would like to be able to retrieve the filtered data when filters are applied.
Create another dimension and then call dimension.top(Infinity) on it.
https://github.com/square/crossfilter/wiki/API-Reference#dimension_top
You will need the extra dimension because dimensions do not observe their own filters, only the filters on other dimensions.
Then you can use e.g. d3.csv.format to produce text, if you need to.
https://github.com/mbostock/d3/wiki/CSV#format
In the version 4 of d3.js d3.csv.format doesn't exist, you must use d3.csvFormat instead.
const cf = crossfilter(data);
csvDimension = cf.dimension( x => x );
csvContent = d3.csvFormat(csvDimension.top(Infinity), [field, field2, ...]);
As Gordon said, csvDimension must be a new dimension in order for the filters to be applied.

backbone.js: Retrieve a smaller version of model building a collection

I'm trying to build an api to create a collection in backbone. My Model is called log and has this (shortened) properties (format for getLog/<id>):
{
'id': string,
'duration': float,
'distance': float,
'startDate': string,
'endDate': string
}
I need to create a collection, because I have many logs and I want to display them in a list. The api for creating the collection (getAllLogs) takes 30 sec to run, which is to slow. It returns the same as the format as the api getLog/<id>, but in an array, one element for each log on the database.
To speed things up, I rebuild the api several times and optimize it to it's limits, but now I came to 30 sec, which is still to slow.
My question is if it is possible to have a collection filled with instances of a model without ALL the information in the model, just a part of it needed to display the list. This will increase the speed of loading the collection and displaying the list, while in the background I could continue loading all other properties, or load them only for the elements I really need.
In my case, the model would load only with this information:
{
'id': string,
'distance': float
}
and all other properties could be loaded later.
How can I do it? is it a good idea anyway?
thanks.
One way to do this is to use map to get the shortened model. Something like this will convert a Backbone.Collection "collection" with all properties to one with only "id" and "distance":
var shortCollection = new Backbone.Collection(collection.toJSON().map(function(x) {
return { id: x.id, distance: x.distance };
}));
Here's a Fiddle illustration.

Resources