How Split a dataset into two random halves in weka? - random

I want to split my dataset into two random halves in weka.
How can I do it?

I had same question and the answer is too simple. First, you need to randomly shuffle the order of instances with weka filter (Unsupervised-> instances) and then split data set into two parts. You can find a complete explanation at below link:
http://cs-people.bu.edu/yingy/intro_to_weka.pdf

you can use first randomize data set in filter , to make it randomly, secondly use, the Remove percentage filter, use first for 30% for testing and save it then reuse it but check the INVERT box so will be the other 70% and save it
so u will have the testing, and training sets randomized and splitted

I have an idea but not using Weka native api. How about use Random Number Generator? Math.random() generates numbers from 0 to 1.
Suppose that we want to split dataset into set1 and set2.
for every instance in dataset
{
if Math.random() < 0.5
put the instance into set1
else
put the instance into set2
}
I think that this method may generate similar number of instances for the two subset. If you want to generate exactly the same quantities, you may add additional conditions to if-else.
Hope this may offer you some inspiration.

Related

Can't distribute items between arrays

Imagine you have a list of objects. Each object looks like:
{'itemName':'name',
'totalItemAppearance':100,
'appearancePerList': 20}
and some number X which stands for number of lists that can contain such items.
What i need to do is randomly picking an item put them into lists with respecting item parameters.
In the end I expect X number of lists whit item which is used(in all lists) exactly 'totalItemAppearance' times but in each list it should be less or equal than 'appearancePerList'
It looks simple but i don't know how to build an algorithm properly and I can't classify the type of "distribution problem" I need for this issue so i could properly ask Google.
Thank you for replies!
First of all, you need not consider all different types of objects at the same time: There are no relations between different kinds of objects. So I will only consider the case where there is only one type of object.
What you want to do is to pick a uniform random sample from a set of objects satisfying some condition. The objects here are all possible distributions of the objects to the lists, and the condition is that the total number of objects should be 'totalItemAppearance' and that no list contains more than 'appearancePerList' objects.
If 'appearancePerList' is not too small then you can apply the following algorithm (and not wait for an eternity):
--> Pick a uniform random distribution of 'totalItemAppearance' items to lists (much easier to do)
--> If there are at most 'appearancePerList' objects in each list accept
--> Otherwise repeat
This algorithm will produce the uniform samples you wanted. I do not know if this sampling technique has a name (maybe a special case of rejection sampling?).

algorithm: is there a map-reduce way to merge a group of sets by deleting all the subsets

The problem is: Suppose we have a group of Sets: Set(1,2,3) Set(1,2,3,4) Set(4,5,6) Set(1,2,3,4,6), we need to delete all the subsets and finally get the Result: Set(4,5,6) Set(1,2,3,4,6). (Since both Set(1,2,3) and Set(1,2,3,4) are the subsets of Set(1,2,3,4,6), both are removed.)
And suppose that the elements of the set have order, which can be Int, Char, etc.
Is it possible to do it in a map-reduce way?
The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.
My problem is:
I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.
I don't know when the process should be finished, that all the subsets have been removed.
EDIT:
The size of the data will keep growing larger in the future.
The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set], I firstly do data.collect(), which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]], which will give me multiple lines with each line containing a group of sets.
The elements in each set can be sorted by modifying other parts of the program.
I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:
in this problem each set has to essentially be compared to all of the sets of larger cardinality whose min and max elements lie around the min and max of the smaller set.
unlike sorting and other problems amenable to map-reduce, we don't have a transitivity relation, i.e., if A is not-a-subset-of B and B is-not-subset-of C, we cannot make any statement about A w.r.t. C.
Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.
Since subset-of is a transitive relation (proof), you could take advantage of that and design an iterative algorithm that eliminates subsets in each iteration.
The logic is the following:
Mapper:
eliminate local subsets and emit only the supersets. Let the key be the first element of each superset.
Reducer:
eliminate local subsets and emit only the supersets.
You could also use a combiner with the same logic.
Each time, the number of reducers should decrease, until, in the last iteration, a single reducer is used. This way, you can define from the beginning the number of iterations. E.g. by setting initially 8 reducers, and each time using half of them in the next iteration, your program will terminate after 4 iterations (8 reducers, then 4, then 2 and then 1). In general, it will terminate in logn + 1 iterations (log base 2), where n is the initial number of reducers, so n should be a power of 2 and of course less than the number of mappers. If this feels restrictive, you can think of more drastic decreases in the number of reducers (e.g. decrease by 1/4, or more).
Regarding the choice of the key, this can create balancing issues, if, for example, most of the sets start with the same element. So, perhaps you could also make use of other keys, or define a partitioner to better balance the load. This policy makes sure, though, that sets that are equal will be eliminated as early as possible.
If you have MapReduce v.2, you could implement the aforementioned logic like that (pseudocode):
Mapper:
Set<Set> superSets;
setup() {
superSets = new HashSet<>();
}
map(inputSet){
Set toReplace = null;
for (Set superSet : superSets) {
if (superSet.contains(inputSet) {
return;
}
if (inputSet.contains(superSet)) {
toReplace = superSet;
break;
}
}
if (toReplace != null) {
superSets.remove(toReplace);
}
superSets.add(inputSet);
}
close() {
for (Set superSet : superSets) {
context.write(superSet.iterator.next(), superSet);
}
}
You can use the same code in the reducer and in the combiner.
As a final note, I doubt that MapReduce is the right environment for this kind of computations. Perhaps Apache Spark, or Apache Flink offer some better alternatives.
If I understand:
your goal is to detect and remove subset of set in a large set of sets
there are too sets to be managed altogether (memory limit)
strategy is map and reduce (or some sort of)
What I take into account:
main problem is that you can not managed everything at same time
usual method map/reduce supposes to split datas, and treat each part. This is not done totally like that
(because each subset can intersect with each subset).
If I make some calculations:
suppose you have a large set : 1000 000 of 3 to 20 numbers from 1 to 100.
you should have to compare 1000 Billions couples of sets
Even with 100 000 (10 billions), it takes too much times (I stopped).
What I propose (test with 100 000 sets) :
1 define a criterion to split in more little compatible sets. Compatible sets are packages of sets, and you are sure subsets of sets are at least in one same package: then you are sure to find subsets to remove with that method. Say differently: if set A is a subset of set B, then A and B will reside in one (or several) packages like that.
I just take: every subset which contains one defined element (1, 2, 3, ...) => it gives approximately 11 500 sets with precedent assumptions.
It become reasonable to compare (120 000 comparisons).
It takes 180 seconds on my machine, and it found 900 subsets to remove.
you have to do it 100 times (then 18 000 seconds).
and of course, you can find duplicates (but not too many: some %, and the gool is to eliminate).
2 At end it is easy and fast to agglomerate. Duplicate work is light.
3 bigger filters:
with a filter with 2 elements, you reduce to 1475 sets => you get approximately 30 sets to delete, it takes 2-3 seconds
and you have to do that 10 000
Interest of this method:
the selection of sets on the criterion is linear and very simple. It is also hiearchical:
split on one element , on a second, etc.
it is stateless: you can filter millions of set. You only have to keep the good one. The more datas you have,
the more filter you have to do => solution is scalable.
if you want to treat little clouds, you can takes 3, 4 elements in common, etc.
like that, you can spread your treatment among multiple machines (as many as you have).
At the end, you have to reconciliate all your datas/deleting.
This solution doesnt keep a lot of time overall (you can do calculations), but it suits the need of splitting the work.
Hope it helps.

Building a histogram faster

I am working with a large dataset that I need to build a histogram of. I feel like my method of just going through the entire list and marking in a second array the frequency is a slow approach. Any suggestions on how to speed the process up?
Given that a histogram is a graph containing the counts of all items in each bin, you can't make one without visiting all the items.
However, you can:
Create the histogram as you collect the data. Then it takes no time to generate.
Break up the data into N parts, and work on each part in parallel. When each part is done counting, just sum the results for each bin. (You can also combine this with #1)
Sample the data. In theory, looking at a fraction of your data, you should be able to estimate the rest of it. The Math.

Algorithm for grouping RESTful routes

Given a list of URLs known to be somewhat "RESTful", what would be a decent algorithm for grouping them so that URLs mapping to the same "controller/action/view" are likely to be grouped together?
For example, given the following list:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
It would group them as follows:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
Nothing is known about the order or structure of the URLs ahead of time. In my example, it would be somewhat easy since the IDs are obviously numeric. Ideally, I'd like an algorithm that does a good job even if IDs are non-numeric (as in http://www.example.com/products/rocket and http://www.example.com/products/ufo).
It's really just an effort to say, "Given these URLs, I've grouped them by removing what I think it he 'variable' ID part of the URL."
Aliza has the right idea, you want to look for the 'articulation points' (in REST, basically where a parameter is being passed). Looking only for a single point of change gets tricky
Example
http://www.example.com/foo/1/new
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/bar/1/new
These can be grouped several equally good ways since we have no idea of the URL semantics. This really boils down to the question of this - is this piece of the URL part of the REST descriptor or a parameter. If we know what all the descriptors are, the rest are parameters and we are done.
Give a sufficiently large dataset, we'd want to look at the statistics of all URLs at each depth. e.g., /x/y/z/t/. We would count the number of occurrences in each slot and generate a large joint probability distribution table.
We can now look at the distribution of symbols. A high count in a slot means it's likely a parameter. We would start from the bottom, look for conditional probability events, ie., What is the probability of x being foo, then what is the probability y being something given x, etc. etc. I'd have to think more to determine a systematic way to extracting these, but it seems like a promisign start
split each url to an array of strings with the delimiter being '/'
e.g. http://www.example.com/foo/1/edit will give the array [http:,www.example.com,foo,1,edit]
if two arrays (urls) share the same value in all indecies except for one, they will be in the same group.
e.g. http://www.example.com/foo/1/edit = [http:,www.example.com,foo,1,edit] and
http://www.example.com/foo/2/edit = [http:,www.example.com,foo,2,edit]. The arrays match in all indices except for #3 which is 1 in the first array and 2 in the second array. Therefore, the urls belong to the same group.
It is easy to see that urls like http://www.example.com/foo/3 and http://www.example.com/foo/1/edit will not belong to the same group according to this algorithm.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Resources