apath statement for rapidminer - xpath

In rapid miner i am trying to take data using xpath from an xml page, i have tried a number of different statements but no success. Below is the data that im trying to retrieve, i want all the features from the un-ordered list.
enter code here
<div id="features">
<h3>Features:</h3>
<ul><li>Front garden</li>
<li>Rear Large Shed</li>
<li>Superb condition and tastefully decorated</li>
<li>Energy Efficent with a B2 Ber rating</li>
<li>Gravel & driveway</li>
</ul></div>

Assuming you want a sequence of strings there is the quick way:
//li/string()
And the specific way:
/div[#id='features']/ul/li/string()

You didn't really specify how you need to pull this data or where it is going, but perhaps this will help:
I saved your example xml as follows:
<?xml version="1.0" encoding="utf-8" ?>
<div id="features">
<h3>Features:</h3>
<ul><li>Front garden</li>
<li>Rear Large Shed</li>
<li>Superb condition and tastefully decorated</li>
<li>Energy Efficent with a B2 Ber rating</li>
<li>Gravel & driveway</li>
</ul></div>
I then created the following RapidMiner process which pulls each list element as an individual attribute:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_xml" compatibility="5.3.005" expanded="true" height="60" name="Read XML" width="90" x="112" y="165">
<parameter key="file" value="path/to/Test.xml"/>
<parameter key="xpath_for_examples" value="//h3"/>
<enumeration key="xpaths_for_attributes">
<parameter key="xpath_for_attribute" value="//li[1]/text()"/>
<parameter key="xpath_for_attribute" value="//li[2]/text()"/>
<parameter key="xpath_for_attribute" value="//li[3]/text()"/>
<parameter key="xpath_for_attribute" value="//li[4]/text()"/>
<parameter key="xpath_for_attribute" value="//li[5]/text()"/>
</enumeration>
<list key="namespaces"/>
<parameter key="use_default_namespace" value="false"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<connect from_op="Read XML" from_port="output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I think the XPATH query you are looking for is "//li[n]/text()" where n is the number of the node from which you are trying to pull data. I hope this helps!

Related

Rapid Miner Row Maximum

Sorry I'm totally new to RapidMiner and only made the basic tutorial.
I have a dataset like
MatchID Value1 Value2 Value3
1 5 1 2
1 4.5 1.5 2
...
and would like to know if there is a possibilty to get the highest value per column (for example Value1) and make further calculations with it (generate attributes).
Thank you.
There are lots of ways as it happens. Here's one using the Aggregate operator to find the maxima, Join to join this to the original and Generate Attributes to do some calculating.
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.2.003" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="aggregate" compatibility="7.2.003" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
<parameter key="use_default_aggregation" value="true"/>
<parameter key="default_aggregation_function" value="maximum"/>
<list key="aggregation_attributes"/>
</operator>
<operator activated="true" class="join" compatibility="7.2.003" expanded="true" height="82" name="Join" width="90" x="313" y="34">
<parameter key="join_type" value="outer"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="34">
<list key="function_descriptions">
<parameter key="deltaA1" value="[maximum(a1)]-a1"/>
<parameter key="deltaA2" value="[maximum(a2)]-a2"/>
<parameter key="deltaA3" value="[maximum(a3)]-a3"/>
<parameter key="deltaA4" value="[maximum(a4)]-a4"/>
</list>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="left"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Another way is to use the Extract Macro operator with the statistics setting max. This stores the maximum for a given attribute as a macro value, which then can be used, e.g. in Generate Attributes.
The advantage is that you don't modify the original dataset and don't have to use a join or multiply operator.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="7.5.000" expanded="true" height="68" name="Extract Macro" width="90" x="179" y="34">
<parameter key="macro" value="maxA1"/>
<parameter key="macro_type" value="statistics"/>
<parameter key="statistics" value="max"/>
<parameter key="attribute_name" value="a1"/>
<list key="additional_macros"/>
<description align="center" color="transparent" colored="false" width="126">extract maximum of attribute a1 and store it in a macro</description>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.5.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
<list key="function_descriptions">
<parameter key="DifferenceA1" value="parse(%{maxA1})-a1"/>
</list>
<description align="center" color="transparent" colored="false" width="126">calculate the difference of a1 from the maximum using the macro value</description>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Hint: since macro values are stored as text, you first have to parse them to use their numerical value.
A third option is to Sort the example set and only keep the example with the maximum value with a Filter Example Range operator. This comes in handy, if you are mostly interested in the values of other attributes, when a certain attribute is maximal.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.5.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="sort" compatibility="7.5.000" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
<parameter key="attribute_name" value="a1"/>
<parameter key="sorting_direction" value="decreasing"/>
<description align="center" color="transparent" colored="false" width="126">sorting the example set on a1 decreasing</description>
</operator>
<operator activated="true" class="filter_example_range" compatibility="7.5.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="1"/>
<description align="center" color="transparent" colored="false" width="126">only keeping the first example, which has the maximum for a1</description>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

If statement based on Rapidminer clustering results

After say, a k-means clustering process is run on a set of points and the result is 5 clusters, is it possible to write to a database based on the majority of points within that separate cluster?
ie. pseudo:
if majority of points within cluster have attribute category == 'state'
add record in database with attribute description == 'state'
else attribute decription == 'private'
Hope my explanation was clear !
A relatively complex process but here's a worked example you can copy.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.0.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.0.000" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
<parameter key="k" value="10"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
<list key="function_descriptions">
<parameter key="category" value="if(rand()>0.5, "state", "notstate")"/>
<parameter key="categoryNumeric" value="if(category=="state", 1, 0)"/>
</list>
</operator>
<operator activated="true" class="aggregate" compatibility="7.0.000" expanded="true" height="82" name="Aggregate" width="90" x="246" y="238">
<list key="aggregation_attributes">
<parameter key="categoryNumeric" value="average"/>
</list>
<parameter key="group_by_attributes" value="cluster"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="380" y="340">
<list key="function_descriptions">
<parameter key="description" value="if ([average(categoryNumeric)]>0.5, "state","private")"/>
</list>
</operator>
<operator activated="true" class="join" compatibility="7.0.000" expanded="true" height="82" name="Join" width="90" x="514" y="238">
<parameter key="join_type" value="left"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="cluster" value="cluster"/>
</list>
</operator>
<operator activated="true" class="jdbc_connectors:write_database" compatibility="7.0.000" expanded="true" height="68" name="Write Database" width="90" x="715" y="238">
<parameter key="connection" value="LocalMYSQL"/>
<parameter key="schema_name" value="ascom"/>
<parameter key="table_name" value="joinresult"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
<connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Write Database" to_port="input"/>
<connect from_op="Write Database" from_port="through" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
The main points are
Create an attribute corresponding to category called categoryNumeric which is set to 1 if category is state and 0 otherwise.
Aggregate by cluster and take the average of categoryNumeric. If any aggregation value is greater than 0.5, it means the majority of the examples for a cluster have category equal to state.
Create a new attribute in the aggregation result called description based on the majority determination.
Each cluster now has additional data and it can be joined to the original data using the cluster identifier as a key.
Write to a database (I used MySQL)
Hope this helps as a start.

Rapidminer's Multilayer Perceptron strange results

I have a dataset of 1000 examples, 500 positive and 500 negative. I am validating them with 0.7 split ratio, and then put them on rapidminers MP with default parameter except having two layers of 25 nodes.
However when I validate it all my prediction are negative I have no idea why? Even with poor optimized MP (like in this very example) I should have getting at least a single positive prediction.
Well, it's the first time I am doing this on rapidminer and probably it's a very basic mistake but I can't find it.
XML code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="split_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation (6)" width="90" x="112" y="255">
<process expanded="true">
<operator activated="true" class="neural_net" compatibility="5.3.008" expanded="true" height="76" name="Neural Net" width="90" x="69" y="30">
<list key="hidden_layers">
<parameter key="Layer" value="25"/>
<parameter key="Layer2" value="25"/>
</list>
<parameter key="training_cycles" value="100"/>
<parameter key="shuffle" value="false"/>
</operator>
<connect from_port="training" to_op="Neural Net" to_port="training set"/>
<connect from_op="Neural Net" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (6)" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (6)" width="90" x="147" y="30"/>
<connect from_port="model" to_op="Apply Model (6)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (6)" to_port="unlabelled data"/>
<connect from_op="Apply Model (6)" from_port="labelled data" to_op="Performance (6)" to_port="labelled data"/>
<connect from_op="Performance (6)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
So far you process looks quite good. The interesting thing is: What happens to your data? To investigate this you could set some breakpoints and examine your samples. A breakpoint set before the NN-learner will show you how the training set looks like, another one set before the model applier lets you inspect the test set.
To ensure a proper class distribution you may enable stratified sampling for the validation operator. The shuffle option of the NN-learner allows the operator to shuffle the training set before training the model. This is useful just in case your data items are already sorted, which can lead to an inappropriate model.

Guaranteeing the same subset for several techniques in Rapidminer's X-Validation

I am in the feature selection stage of a class data mining project, the main objective of it is to compare several data mining techniques (Naive Baiyes, SVM,etc...). In this stage I am using a wrapper with X-Validation,like in the example below:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="optimize_selection" compatibility="5.3.008" expanded="true" height="94" name="Optimize Selection (3)" width="90" x="179" y="120">
<parameter key="generations_without_improval" value="100"/>
<parameter key="limit_number_of_generations" value="true"/>
<parameter key="maximum_number_of_generations" value="-1"/>
<process expanded="true">
<operator activated="true" class="x_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation" width="90" x="179" y="75">
<process expanded="true">
<operator activated="true" class="naive_bayes" compatibility="5.3.008" expanded="true" height="76" name="Naive Bayes (4)" width="90" x="119" y="30"/>
<connect from_port="training" to_op="Naive Bayes (4)" to_port="training set"/>
<connect from_op="Naive Bayes (4)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (8)" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (8)" width="90" x="209" y="30"/>
<connect from_port="model" to_op="Apply Model (8)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (8)" to_port="unlabelled data"/>
<connect from_op="Apply Model (8)" from_port="labelled data" to_op="Performance (8)" to_port="labelled data"/>
<connect from_op="Performance (8)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
</process>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
The issue is that if I want to compare the several techniques I must guarantee that the sets generated in the Cross Validation phase are identical for all the techniques so that I know the accuracy of the results generated were made under the exact same conditions. However inside the X-Validation operator I can't put more than one model creating operator, so I don't know how to guarantee that.
The Optimize Selection operator uses the performance of the inner operators to determine which attributes to retain or remove during forward or backward selection. This means the attribute order will be determined by the performance returned by the inner learner. A different inner learner will yield a different ordering in general. If this is what you want to do then it would be possible to take a copy of the example set inside the Optimize Selection operator using the Multiply operator and pass this to another validation block containing the other learner. You could then use the Log operator to record performance values for this learner and the original one that is driving the attribute ordering. The Optimize Selection operator also can have its progress logged and it is possible to record the feature names currently being considered.

How to test on testset using Rapidminer?

I'm using Rapidminer to do an analysis. I used cross-validation on several models to get the best working model. Now I want to use this model to test on a separate testset that I made using Split Data to estimate the performance.
How do I use the test set? As far as I can tell, all the validation modules use the training set that the model was made on. Which performance measure can I use that takes in a model and my test set?
Use the "Apply Model" operator with your model as the first input and your test set as the second input. This operator will return a labelled data set which is your data input with some additional special attributes, e.g. the prediction and the confidence. The "Performance" operator needs this attributes to measure the performance of the model applied on your test set.
Here is one small example which uses the a training and test set from the "Samples" repository.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.007" expanded="true" height="60" name="Golf" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
<operator activated="true" class="decision_tree" compatibility="5.3.007" expanded="true" height="76" name="Decision Tree" width="90" x="179" y="30"/>
<operator activated="true" class="retrieve" compatibility="5.3.007" expanded="true" height="60" name="Golf-Testset" width="90" x="179" y="120">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
<operator activated="true" breakpoints="before,after" class="apply_model" compatibility="5.3.007" expanded="true" height="76" name="Apply Model" width="90" x="313" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.3.007" expanded="true" height="76" name="Performance" width="90" x="447" y="30"/>
<connect from_op="Golf" from_port="output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Golf-Testset" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Resources