I want to use Random Forest using Mahout for my work. I have started from this tutorial. But I don't know how can I do with my own data?
Please, anyone, help me or suggest me how can I do that
See this tutorial. It is very helpful in my case.
you just change data description and input output path according own your data set.
data description for example.
data given below
ID name age weight target
1 john 18 70 Y
....
description will be
-d L C 2 N L
and also you can modify source code BuildForest.java or TestForest.java in mahout examples.
Related
Hi I am new bee in python and we are trying to find the country ,cities name from geotext library of python but it is not picking every name correctly. could anyone please suggest what should be wrong.
While reading the data from email it is picking up "Mobile" as city which is in SIgnature of email
from geotext import GeoText
places = GeoText("Hi , We need to book a flight from Mumbai to London on 13 Aug throuigh shivaji terminal.
Regards,
xyz
Mobile : 5368536
")
Output : ['Mumbai' ,'Moble']
please help
There are three cities named 'Mobile' in various states the US. You cannot avoid picking it up (unless you decide to block that specific word as being a city - but there could easily be other cities with names that match common words).
I need a numerical example which demonstrates the working of clustering using CURE algorithm.
https://www.cs.ucsb.edu/~veronika/MAE/summary_CURE_01guha.pdf
The pyclustering library has a number of clustering algorithims with examples, and example code on their Github. Here is a link the CURE example.
Googling Cure algorithim example also came up with a fair bit.
Hopefully that helps!
Using pyclustering library you can extract information about representatives points and means using corresponding methods (link to CURE pyclustering generated documentation):
# create instance of the algorithm
cure_instance = cure(<algorithm parameters>);
# start processing
cure_instance.process();
# get allocated clusteres
clusters = cure_instance.get_clusters();
# get representative points
representative = cure_instance.get_representors();
Also you can modify source code of the CURE algorithm to display changes after each step, for example, print them to console or even visualize. Here is an example how to modify code to display changes on each step clustering (after line 219) where star means representative point, small points - points itself and big points - means:
# New cluster and updated clusters should relocated in queue
self.__insert_cluster(merged_cluster);
for item in cluster_relocation_requests:
self.__relocate_cluster(item);
#
# ADD FOLLOWING PEACE OF CODE TO DISPLAY CHANGES ON EACH STEP
#
temp_clusters = [ cure_cluster_unit.indexes for cure_cluster_unit in self.__queue ];
temp_representors = [ cure_cluster_unit.rep for cure_cluster_unit in self.__queue ];
temp_means = [ cure_cluster_unit.mean for cure_cluster_unit in self.__queue ];
visualizer = cluster_visualizer();
visualizer.append_clusters(temp_clusters, self.__pointer_data);
for cluster_index in range(len(temp_clusters)):
visualizer.append_cluster_attribute(0, cluster_index, temp_representors[cluster_index], '*', 7);
visualizer.append_cluster_attribute(0, cluster_index, [ temp_means[cluster_index] ], 'o');
visualizer.show();
You will see sequence of images, something like that:
Thus, you can display any information that you need.
Also I would like to add that you can use C++ implementation of the algorithm for visualization (that is also part of pyclustering): https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/cure.cpp
I tried to solve the problem reading other answers but did not get the solution.
I am performing a lmer model:
MODHET <- lmer(PERC ~ SITE + TREAT + HET + TREAT*HET + (1|PINE), data = PRESU).
Perc is the percentage of predation. Site is a categorical variable that I am using as blocking factor. It is site identity where I performed the experiement. TREAT is categorical variable of 2 levels. HET is a continuous variable. The number of observation is 56 divided in 7 sites
Maybe the problem is how I expressed the random factor. In every site I selected 8 pines among 15 to perform the experiment. I included the pine identity as categorical random factor. For instance in Site 1 pines are called a1,a3,a7 ecc, while in site 2 are called b1,b4,b12 ecc...
The output of the model is
Error: number of levels of each grouping factor must be < number of observations
I don´t understand where is the mistake. Could it be how I called the pines?
I tried also
MODHET <- lmer(PERC ~ SITE + TREAT + HET + TREAT*HET + (1|SITE:PINE), data = PRESU)
but the output is the same.
I hope that I explained well my problems. I read on this forum similar questions about it but I still do not get the solution.
Thank you for your help
Use argument control = lmerControl(check.nobs.vs.nRE = "ignore") in your lmer-call to suppress this error. However, I guess this does not solve the actual problem. It seems to me that your grouping level contains no "groups", probably "SITE" is your random intercept?
If you consider PINES nested as "subjects" within SITES, then I would suggest following formula:
MODHET <- lmer(PERC ~ TREAT*HET + (1|SITE), data = PRESU)
or,
MODHET <- lmer(PERC ~ TREAT*HET + (1 | SITE / PINE), data = PRESU)
But my answer may be wrong, I'm not sure whether I have enough information to fully understand what you're aiming at.
edit:
Sorry, nesting was not correctly specified, I fixed it in the above formula. See also this answer .
I have SS-US-ELM algorithm Matlab codes. I want to use hog/hof on a data set like KTH and use its output as input of SS-ELM algorithm. but I don't know how to save the output so that the SS-ELM algorithm can get it correctly.
there is a "g50c.mat" file in the demo file that contains 4 variables ,x, y, idxLabs, idxUnls. how can I make a ".mat" file of the KTH data set and use it as input instead of "g50c.mat". any help would be greatly appreciated.
here is the code for loading the "g50c.mat" and using its variables:
% Semi-supervised ELM (US-ELM) for semi-supervised classification.
% Ref: Huang Gao, Song Shiji, Gupta JND, Wu Cheng, Semi-supervised and
% unsupervised extreme learning machines, IEEE Transactions on Cybernetics, 2014
format compact;
clear;
addpath(genpath('functions'))
% load data
trial=1;
load g50c;
l=size(idxLabs,2);
u=ceil(size(y,1)*3/4)-2*l;
Xl=X(idxLabs(trial,:),:);
Yl=y(idxLabs(trial,:),:);
% Creat validation set
labels=unique(y);
idx_V=[];
for i=1:size(labels)
idx_V=[idx_V;find(y(idxUnls(trial,:))==labels(i),l/length(labels),'first')];
end
Xv=X(idxUnls(trial,idx_V),:);
Yv=y(idxUnls(trial,idx_V));
% Creat unlabeled and testing set
idxSet=1:size(idxUnls,2);
idx_UT=setdiff(idxSet,idx_V);
idx_rand=randperm(size(idx_UT,2));
Xu=X(idxUnls(trial,idx_UT(idx_rand(1:u))),:);
Yu=y(idxUnls(trial,idx_UT(idx_rand(1:u))),:);
Xt=X(idxUnls(trial,idx_UT(idx_rand(u+1:end))),:);
Yt=y(idxUnls(trial,idx_UT(idx_rand(u+1:end))),:);
When I ran J48 of weka with binary split option, such decision tree was built.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
Input explanation variable is 1 nominal data which was made by question id + answer id.
1 nominal data, 1 transaction.
I'm wondering why the tree is at only one side.
Is it caused by my data set or table definition or original binary splits way?
I'd like the tree to have node on both sides.
If you know such a option please show me.
!Sample Data! Please ignore dot '・'
usr,qa,class
A,11,1
A,21,1
A,31,1
B,12,2
B,22,2
B,32,2
C,13,3
C,23,3
C,33,3
D,11,4
D,22,4
D,31,4
E,11,1
E,23,1
E,31,1
F,12,2
F,22,2
F,33,2
G,13,3
G,22,3
G,32,3
H,12,4
H,21,4
H,33,4
There's no error in the tree built and no option would really modify it. If your question is related to your same Akinator project, please reformat your data to get all questions (ie. 11,21,31) on the same instance/line and the answer as target class.
PS: if you import those data as CSV, Weka will take those data as numerical (not as as nominal). You should then add a non digit character (ie. #1,#2,#3...) so that Weka will take those data as nominal.