Mahout Clustering with one dim K-means [duplicate] - hadoop

This question already has answers here:
1D Number Array Clustering
(6 answers)
Closed 8 years ago.
Can I cluster data with one variable instead of many (What I had already test) using mahout K-means Algorithm ? if yes (I hope so :) )could you give me an Example of clustering and thinks

How big is your data? If it is not exabytes, you would be better off without Mahout.
If it is exabytes, use sampling, and then process it on a single machine.
See also:
Cluster one-dimensional data optimally?
1D Number Array Clustering
Which clustering algorithm is suitable for one-dimensional Lists without knowing k?
and many more.
Mahout is not your general go-to place for data anlysis. It only shines when you have Google scale data. Otherwise, the overhead is too large.

Related

Design and Analysis of Algorithms? [duplicate]

This question already has answers here:
Sort with the limited memory
(6 answers)
Closed 5 years ago.
Now, to sort some limited number of registers, we often use RAM to hold elements under process. The problem is when we are asked to sort millions of random registers where each register contains set of elements. This huge file cannot be sorted using traditional sorting algorithms . how i can solve this problem.
You need to look an efficient algorithm for sorting data that is not completely read into memory. A few adaptations to Merge-Sort can achieve this.
Here is the Java Implementation of merge sort that sorts very large files:
Take a look into these too:
http://en.wikipedia.org/wiki/Merge_sort
http://en.wikipedia.org/wiki/External_sorting

Clustering in High Dimensions + some basic stuff [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've been studying Support Vector Machines(SVM) for a while, and recently started reading articles on Clustering. When using SVM, we did not need to worry about the dimension size of the data, however, I learned that in clustering, due to the "Curse of Dimensionality", the dimension size is of big issue. Furthermore, the sparsity and the data size greatly affects the clustering algorithms you choose as well. So I kind of understand that there is no "best algorithm" for clustering, and it all depends on the nature of the data.
Having said that, I want to ask some really basic questions on Clustering.
When people say "High Dimension", what do they mean specifically?? Is 100d a high dimension?? Or does this depend on the type of data you have?
I've seen answers on this website that said something like, "using k-means on data with 100's dimensions is very usual", and if this is true, does this hold true for other clustering algorithms that uses the same distance metric as k-means??
In pp.649 of the paper, "Survey of Clustering Algorithms"(http://goo.gl/WQyuxo), by Rui Xu et al., the table shows that CURE has "the capability of tackling high dimensional data", and was wondering if anybody has any ideas on how high of dimension they are talking about.
If I wanted to perform clustering on high dimensional datas with adequate size, which was randomly sampled from the initial big data, what kind of algorithms would be appropriate to use?? I understand that density based algorithms such as DBSCAN does not perform well under random sampling.
Can anybody tell me how well/bad CURE performs on high dimensional datas?? Intuitively, I guess CURE does not perform well considering the "Cure of Dimensionality", however, it would be great if there were some detailed results.
Are there any websites/papers/textbooks on explaining the pros and cons of clustering algorithms?? I've seen some papers on the pros/cons of basic algorithms, i.e, k-means, hierarchal clustering, DBSCAN, etc., but wanted to know more on other algorithms such as CURE, CLIQUE, CHAMELEON, etc.
Sorry for asking so much questions all at once!!
It will be awesome if anybody could answer any one of my questions. Also, if I had ill-stated a question or asked a completely pointless question, don't hesitate to tell me.
And if anybody knows a great textbook/survey paper on Clustering that elaborates on these subjects, please tell me!!
Thank you in advance.
You may be interested in this survey:
Kriegel, H. P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.
one of the authors wrote DBSCAN, so it will likely help you shed some light in your DBSCAN questions.
100 dimensional data can be high-dimensional data. If it isn't sparse. For the NLP people, 100d is laughably little, but their data is special. It is derived essentially from a binary nature (word present, or not present), so it has actually less than 1 bit of information in each dimension... if you have dense 100 dimensional data, you usually are in trouble.
There are some nice figures in a related / follow up survey by the same authors:
Zimek, A., Schubert, E., & Kriegel, H. P. (2012). A survey on unsupervised outlier detection in high‐dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363-387.
They have analyzed the behavior of distance functions nicely for such data. The essence is:
high-dimensional data can be hard - or easy; it all depends on the signal to noise ratio. If you only have dimensions carrying signal, additional dimensions can make your problems actually easier. If the additional dimensions are distracting, things can break down.
Which may also explain why the "kernel trick" with SVMs works - it does not really add information content; the increased dimensionality is only virtual, not intrinsic. You have a larger search and solution space; but your data is still on a lower-dimensional manifold within this space.
k-means results in high-dimensional data tend to get meaningless. In many cases, they still work "good enough"; because often quality does not really matter a lot, and any convex partitioning will do (e.g. bag-of-words approaches for image similarity don't seem to improve substantially with "better" k-means clusterings)
CURE, which also seems to use sum-of-squares (like k-means) should suffer from the same problems. For large data, all sum-of-squares values become increasingly similar (i.e. any partitioning is as good as any other).
Yes, there are plenty of textbooks, surveys, and studies that tried to compare clustering algorithms. But in the end, there are too many factors involved: what does your data look like, how did you preprocess it, do you have a well-chosen and appropriate distance measure, how good is your implementation, do you have index acceleration to speed up some algorithms, etc. - there is no rule of thumb; you will have to try out things.

Sample data for Hadoop [duplicate]

This question already has answers here:
Download large data for Hadoop [closed]
(7 answers)
Closed 9 years ago.
For education purpose I am looking for a large set of data. Data from social networks could be interesting but difficult to obtain. Data from scientific experiments could lead to write very difficult algorithm to have interesting results. Does any one have an idea how / where can I generate / find a large interesting data set ?
Here are some public data sets I have gathered over time
http://wiki.gephi.org/index.php/Datasets
Download large data for Hadoop
http://datamob.org/datasets
http://konect.uni-koblenz.de/
http://snap.stanford.edu/data/
http://archive.ics.uci.edu/ml/
https://bitly.com/bundles/hmason/1
http://www.inside-r.org/howto/finding-data-internet
http://goo.gl/Jecp6
http://ftp3.ncdc.noaa.gov/pub/data/noaa/1990/
http://data.cityofsantacruz.com/
http://bitly.com/bundles/hmason/1
Here Amazon has a list of some huge public datasets you may try out :
http://aws.amazon.com/publicdatasets/

Optimal vector data structure? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
A data structure supporting O(1) random access and worst-case O(1) append?
I saw an answer a while ago on StackOverflow regarding a provably optimal vector ("array list") data structure, which, if I remember correctly, lazily copied elements onto a larger vector so that it wouldn't cause a huge pause every time the vector reallocated.
I remember it needed O(sqrt(n)) extra space for bookkeeping, and that the answer linked to a published paper, but that's about it... I'm having a really hard time searching for it (you can imagine that searches like optimal vector are getting me nowhere).
Where can I find the paper?
I think that the paper you are referring to is "Resizable Arrays in Optimal Time and Space" by Brodnik et al. Their data structure uses the lazy copying dynamic array you mentioned in your question as a building block to assemble this structure. There is this older question on Stack Overflow describing the lazy-copying data structure, which might be useful to get a better feel for how it works.
Hope this helps!

Advantages of BTree+ over BTree [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
B- trees, B+ trees difference
What are the advantages/disadvantages of BTree+ over BTree? When should I prefer one over other? I'm also interested in knowing any real world examples where one has been preferred over other.
According to the Wikipedia article about BTree+, this kind of data structure is frequently used for indexing block-oriented storage. Apparently, BTree+ stored keys (and not values) are stored in the intermediate nodes. This would mean that you would need fewer intermediate node blocks and would increase the likelihood of a cache hit.
Real world examples include various file systems; see the linked article.

Resources