If statement based on Rapidminer clustering results - pseudocode

After say, a k-means clustering process is run on a set of points and the result is 5 clusters, is it possible to write to a database based on the majority of points within that separate cluster?
ie. pseudo:
if majority of points within cluster have attribute category == 'state'
add record in database with attribute description == 'state'
else attribute decription == 'private'
Hope my explanation was clear !

A relatively complex process but here's a worked example you can copy.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.0.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.0.000" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
<parameter key="k" value="10"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
<list key="function_descriptions">
<parameter key="category" value="if(rand()>0.5, "state", "notstate")"/>
<parameter key="categoryNumeric" value="if(category=="state", 1, 0)"/>
</list>
</operator>
<operator activated="true" class="aggregate" compatibility="7.0.000" expanded="true" height="82" name="Aggregate" width="90" x="246" y="238">
<list key="aggregation_attributes">
<parameter key="categoryNumeric" value="average"/>
</list>
<parameter key="group_by_attributes" value="cluster"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="380" y="340">
<list key="function_descriptions">
<parameter key="description" value="if ([average(categoryNumeric)]>0.5, "state","private")"/>
</list>
</operator>
<operator activated="true" class="join" compatibility="7.0.000" expanded="true" height="82" name="Join" width="90" x="514" y="238">
<parameter key="join_type" value="left"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="cluster" value="cluster"/>
</list>
</operator>
<operator activated="true" class="jdbc_connectors:write_database" compatibility="7.0.000" expanded="true" height="68" name="Write Database" width="90" x="715" y="238">
<parameter key="connection" value="LocalMYSQL"/>
<parameter key="schema_name" value="ascom"/>
<parameter key="table_name" value="joinresult"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
<connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Write Database" to_port="input"/>
<connect from_op="Write Database" from_port="through" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
The main points are
Create an attribute corresponding to category called categoryNumeric which is set to 1 if category is state and 0 otherwise.
Aggregate by cluster and take the average of categoryNumeric. If any aggregation value is greater than 0.5, it means the majority of the examples for a cluster have category equal to state.
Create a new attribute in the aggregation result called description based on the majority determination.
Each cluster now has additional data and it can be joined to the original data using the cluster identifier as a key.
Write to a database (I used MySQL)
Hope this helps as a start.

Related

Missing predicted label in performance evaluation of test data

I have trained a model using neural set operator, now I want to apply that model and evaluate its performance on test data (with out label attribute). For this, I used apply model operator with its first input is my trained modeled data's output that contains (predicted and confidence values) and the second input of apply model operator is my Unlabelled test data, for referencce (How to test on testset using Rapidminer? ). Below is the screenshot of my original model before execution:
When I execute the process, it throws, Input example set must have special attribute label, see the below screenshot:
When I click link to Help me solve the problem , it adds set role operator where I set my label attribute, after execution it displays missing predicted label attribute,
UPDATED:
please see the XML below:
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve" width="90" x="246" y="34">
<parameter key="repository_entry" value="../data/neural"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role (2)" width="90" x="380" y="34">
<parameter key="attribute_name" value="Elective1"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.2.000" expanded="true" height="103" name="Nominal to Numerical" width="90" x="514" y="34">
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="neural_net" compatibility="8.2.000" expanded="true" height="82" name="Neural Net" width="90" x="648" y="34">
<list key="hidden_layers"/>
</operator>
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve (2)" width="90" x="246" y="136">
<parameter key="repository_entry" value="../data/testing neural"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="187">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="648" y="187">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="916" y="85">
<parameter key="attribute_name" value="prediction(Elective1)"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="performance" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="1184" y="136"/>
<connect from_op="Retrieve" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Neural Net" to_port="training set"/>
<connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Neural Net" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Any suggestions??
as much as i know u don't need two "Apply model" operators . . .
try it with one apply model and connect the testing data to unl and training data to the mod

Rapid Miner Row Maximum

Sorry I'm totally new to RapidMiner and only made the basic tutorial.
I have a dataset like
MatchID Value1 Value2 Value3
1 5 1 2
1 4.5 1.5 2
...
and would like to know if there is a possibilty to get the highest value per column (for example Value1) and make further calculations with it (generate attributes).
Thank you.
There are lots of ways as it happens. Here's one using the Aggregate operator to find the maxima, Join to join this to the original and Generate Attributes to do some calculating.
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.2.003" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="aggregate" compatibility="7.2.003" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
<parameter key="use_default_aggregation" value="true"/>
<parameter key="default_aggregation_function" value="maximum"/>
<list key="aggregation_attributes"/>
</operator>
<operator activated="true" class="join" compatibility="7.2.003" expanded="true" height="82" name="Join" width="90" x="313" y="34">
<parameter key="join_type" value="outer"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="34">
<list key="function_descriptions">
<parameter key="deltaA1" value="[maximum(a1)]-a1"/>
<parameter key="deltaA2" value="[maximum(a2)]-a2"/>
<parameter key="deltaA3" value="[maximum(a3)]-a3"/>
<parameter key="deltaA4" value="[maximum(a4)]-a4"/>
</list>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Another way is to use the Extract Macro operator with the statistics setting max. This stores the maximum for a given attribute as a macro value, which then can be used, e.g. in Generate Attributes.
The advantage is that you don't modify the original dataset and don't have to use a join or multiply operator.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="7.5.000" expanded="true" height="68" name="Extract Macro" width="90" x="179" y="34">
<parameter key="macro" value="maxA1"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="statistics" value="max"/>
<parameter key="attribute_name" value="a1"/>
<list key="additional_macros"/>
<description align="center" color="transparent" colored="false" width="126">extract maximum of attribute a1 and store it in a macro</description>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.5.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
<list key="function_descriptions">
<parameter key="DifferenceA1" value="parse(%{maxA1})-a1"/>
</list>
<description align="center" color="transparent" colored="false" width="126">calculate the difference of a1 from the maximum using the macro value</description>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Hint: since macro values are stored as text, you first have to parse them to use their numerical value.
A third option is to Sort the example set and only keep the example with the maximum value with a Filter Example Range operator. This comes in handy, if you are mostly interested in the values of other attributes, when a certain attribute is maximal.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="sort" compatibility="7.5.000" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
<parameter key="attribute_name" value="a1"/>
<parameter key="sorting_direction" value="decreasing"/>
<description align="center" color="transparent" colored="false" width="126">sorting the example set on a1 decreasing</description>
</operator>
<operator activated="true" class="filter_example_range" compatibility="7.5.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="1"/>
<description align="center" color="transparent" colored="false" width="126">only keeping the first example, which has the maximum for a1</description>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Rapidminer's Multilayer Perceptron strange results

I have a dataset of 1000 examples, 500 positive and 500 negative. I am validating them with 0.7 split ratio, and then put them on rapidminers MP with default parameter except having two layers of 25 nodes.
However when I validate it all my prediction are negative I have no idea why? Even with poor optimized MP (like in this very example) I should have getting at least a single positive prediction.
Well, it's the first time I am doing this on rapidminer and probably it's a very basic mistake but I can't find it.
XML code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="split_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation (6)" width="90" x="112" y="255">
<process expanded="true">
<operator activated="true" class="neural_net" compatibility="5.3.008" expanded="true" height="76" name="Neural Net" width="90" x="69" y="30">
<list key="hidden_layers">
<parameter key="Layer" value="25"/>
<parameter key="Layer2" value="25"/>
</list>
<parameter key="training_cycles" value="100"/>
<parameter key="shuffle" value="false"/>
</operator>
<connect from_port="training" to_op="Neural Net" to_port="training set"/>
<connect from_op="Neural Net" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (6)" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (6)" width="90" x="147" y="30"/>
<connect from_port="model" to_op="Apply Model (6)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (6)" to_port="unlabelled data"/>
<connect from_op="Apply Model (6)" from_port="labelled data" to_op="Performance (6)" to_port="labelled data"/>
<connect from_op="Performance (6)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
So far you process looks quite good. The interesting thing is: What happens to your data? To investigate this you could set some breakpoints and examine your samples. A breakpoint set before the NN-learner will show you how the training set looks like, another one set before the model applier lets you inspect the test set.
To ensure a proper class distribution you may enable stratified sampling for the validation operator. The shuffle option of the NN-learner allows the operator to shuffle the training set before training the model. This is useful just in case your data items are already sorted, which can lead to an inappropriate model.

Guaranteeing the same subset for several techniques in Rapidminer's X-Validation

I am in the feature selection stage of a class data mining project, the main objective of it is to compare several data mining techniques (Naive Baiyes, SVM,etc...). In this stage I am using a wrapper with X-Validation,like in the example below:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="optimize_selection" compatibility="5.3.008" expanded="true" height="94" name="Optimize Selection (3)" width="90" x="179" y="120">
<parameter key="generations_without_improval" value="100"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="maximum_number_of_generations" value="-1"/>
<process expanded="true">
<operator activated="true" class="x_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation" width="90" x="179" y="75">
<process expanded="true">
<operator activated="true" class="naive_bayes" compatibility="5.3.008" expanded="true" height="76" name="Naive Bayes (4)" width="90" x="119" y="30"/>
<connect from_port="training" to_op="Naive Bayes (4)" to_port="training set"/>
<connect from_op="Naive Bayes (4)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (8)" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (8)" width="90" x="209" y="30"/>
<connect from_port="model" to_op="Apply Model (8)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (8)" to_port="unlabelled data"/>
<connect from_op="Apply Model (8)" from_port="labelled data" to_op="Performance (8)" to_port="labelled data"/>
<connect from_op="Performance (8)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
The issue is that if I want to compare the several techniques I must guarantee that the sets generated in the Cross Validation phase are identical for all the techniques so that I know the accuracy of the results generated were made under the exact same conditions. However inside the X-Validation operator I can't put more than one model creating operator, so I don't know how to guarantee that.
The Optimize Selection operator uses the performance of the inner operators to determine which attributes to retain or remove during forward or backward selection. This means the attribute order will be determined by the performance returned by the inner learner. A different inner learner will yield a different ordering in general. If this is what you want to do then it would be possible to take a copy of the example set inside the Optimize Selection operator using the Multiply operator and pass this to another validation block containing the other learner. You could then use the Log operator to record performance values for this learner and the original one that is driving the attribute ordering. The Optimize Selection operator also can have its progress logged and it is possible to record the feature names currently being considered.

How to test on testset using Rapidminer?

I'm using Rapidminer to do an analysis. I used cross-validation on several models to get the best working model. Now I want to use this model to test on a separate testset that I made using Split Data to estimate the performance.
How do I use the test set? As far as I can tell, all the validation modules use the training set that the model was made on. Which performance measure can I use that takes in a model and my test set?
Use the "Apply Model" operator with your model as the first input and your test set as the second input. This operator will return a labelled data set which is your data input with some additional special attributes, e.g. the prediction and the confidence. The "Performance" operator needs this attributes to measure the performance of the model applied on your test set.
Here is one small example which uses the a training and test set from the "Samples" repository.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.007" expanded="true" height="60" name="Golf" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="decision_tree" compatibility="5.3.007" expanded="true" height="76" name="Decision Tree" width="90" x="179" y="30"/>
<operator activated="true" class="retrieve" compatibility="5.3.007" expanded="true" height="60" name="Golf-Testset" width="90" x="179" y="120">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
<operator activated="true" breakpoints="before,after" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="313" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.007" expanded="true" height="76" name="Performance" width="90" x="447" y="30"/>
<connect from_op="Golf" from_port="output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Golf-Testset" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Resources