Weka- Importing .csv file error into Weka - algorithm

I'm using Weka to implement a unsupervised algorithms like LOF to detect outliers within a given dataset but of course the dataset has to contain binary attributes as LOF is dependent on this (according Weka compatibility information) and it does (0,0,1,0,1). However, when I try to import the dataset from the preprocessing section on Weka I get back an error outling this:
File: C:\Users\JohnDoe\Downloads\labelled_Dataset.csv
Reason: index 13 out of bounds for length 13 Problem encountered on line: 293
Does anyone here have any ideas on how to overcome this, I've never encountered this error and a google search doesn't particularly help in this context too as it comes back with different results. I've been at this for 4 days, any advice will be taken into consideration.

Related

Interpret Google AutoML Online Prediction Results

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

Generative Adversial Networks, divergence of the discriminator and pictures artifacts (PIX2PIX)

I'm currently trying to implement the Pix2Pix algorithm which is a GAN Structure, but i have some issues with the convergence of the discriminator and the output pictures of the generator...
1) Convergence Problem :
It seems that the discriminator doesn't converge at all. When i print the loss of the generator, it seems to work very well :
But when i print the loss of the discriminator, i have the following plot :
Or more precisely :
Do you know what are the possible reasons of such a behavior ?
How can i stabilize the learning of the discriminator ?
2) Chromatic aberrations
I have also some problems with the generated pictures. Indeed i have often a total saturation of the colors, the printed objects have only one color like that :
The solution seems to train the discriminator every 200 steps for example, in this case i obtain something like that :
But it is not satisfying at all...
(I precise that the first column is the input of the generator, the second one is the output of the generator and the third one is the target picture, for the moment i'm only trying my network to reproduce the same picture... it should be easy...)
NB : the initialization seems also to play a really important role on the colors, indeed, with the exact same parameters, i obtained after thousands of steps really different results.
Has someone an idea to explain these phenomena ?
Thanks a lot for your reading and your potential help !

SystemML Decision Tree - "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10"

I am trying to run a decision tree on SystemML standalone version on Windows (https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/decision-tree.dml) but I keep receiving the error "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10. THIS NODE IS DECLAR ED AS LEAF!". It seems like the code is not computing any split, although I am able to perform tree via R. Has anyone used this algorithm before and has some tips on how to solve the error?
Thank you
This message generally indicates that a split on the best categorical or scale features would not give any additional gain.
I would recommend to
Investigate the computed gain (best_cat_gain, best_scale_gain)
Double check that the meta data (num_cat_features,
num_scale_features) is correctly recognised.
You could simply put additional print statements into the script to do that. In case the meta data is invalid, you might want to check that the optional input R has the right layout as described in the header of the script.
If this does not help, please share the input arguments, format of input data, etc and we'll have a closer look.

How do Statistica's %75 and %25 Data Sampling & 10 fold Cross Validation works together?

I made an analysis on some data using Dell's Statistica software. I am using this analysis in a scientific paper. Although data mining is not my primary topic I took Data Mining class before and have some knowledge.
I know that data is either separated as %75 %25 (numbers may change) training and test parts or n fold cross validation is used to test the model performance.
In Statistica SVM modeling prior to execution of model there are tabs to make configurations. In data sampling tab I entered %75, %25 separation and in cross-validation tab I entered 10 -fold cross validation. In the output, I see that the data was actually separated as training and test (model predictions are given for test values).
There is also a cross-validation error. I will copy results below. I have difficulty in the understanding and in the interpretation of this output. I hope someone who know better statistics compared to me and/or who is more experienced to this tools may explain how it works to me?
Ferda
Support Vector machine results SVM type:
Regression type 1 (capacity=9.000, epsilon=0.100) Kernel type:
Radial Basis Function (gamma=0.053) Number of support vectors = 705
(674 bounded) Cross-validation error = 0.244
Mean error squared = 1.830(Train), 0.193(Test), 1.267(Overall) S.D. ratio =
0.952(Train), 37076026627971.336(Test), 0.977(Overall) Correlation coefficient = 0.314(Train), -0.000(Test), 0.272(Overall)
I found out that Statistica website has an answer for my misunderstanding. In Sampling tab data may be separated into training and test sets and in cross- validation tab, if for example 10 is selected then 10-fold cross validation is used to decide the proper ni, epsilon etc. like SVM parameters for the execution of the SVM modeling.
This explanation cleared out my problem. I hope it helps to people in similar situations...
Ferda

Cross product and reading headers in hadoop

i have some hadoop document similarity project that i'm working on, and i'm stuck in some part. The situation looks like this(I have a document term index table stored in a csv file
"", t1,t2,t3,t4,....
doc1,f11,f12,f13,f14,....
doc2,f21,f22,f23,f24,....
doc3,f31,f32,f33,f34,....
.
.
.
where f12 means the frequency of term2(t2) in document1(doc1)
On the other hand, I have a query file contains the queries that need to be searched for their nearest or similar documents
"", t1,t3,t122,t34,....
q1,f11,f12,f13,f14,....
q2,f21,f22,f23,f24,....
q3,f31,f32,f33,f34,....
.
.
.
but here the terms here may contains different terms, so i need to find the cross product of these two (term index and query) in order to find the distances between the query and th existing document
The problem contains two parts:
first, how to read the headers of each of these csv files to store them in some termvector considering the file will be splitted into different machines.
second, how to make the cross product on these two files, in order to find the similartiy(create a new document that can have all the possible terms(dimensions) in order to find the similarity)
I'm planning to write some K-nearest neighbour algorithm to find the similarity
Which tool or tools should i use, Pig,Hive,Mahout.
there is a separate chapter on the book MapReduce Design Patterns on Cartesian product, with source code given.
Yes for #vefthym answer and I have been reading the same chapter in same book!
HOWEVER, the runtime is incredibly long! Following the way in the book, for a 600kb datasets containing 20,000 records, running cartesian product takes over 10hrs! Although I know for 20,000 records, the computation would be nearly 200m times and the I/O access would be 400m times, which is huge, I feel it is impractical for it to work on a big dataset amounting to GB or TB.
I am contacting the author to see if he got the same runtime. Will let you guys know

Resources