how to link Weka with Hadoop? - hadoop

I have a dataset for which I need to process PCA (Principal Component Analysis, A dimentionality reduction program) which is easy to proceed using Weka.
And Since the dataset is large in size, Weka shows memory issues, which can be resolved if I link Weka with Hadoop. To run the algorithm using weka in a server. Could anyone help me regarding the same. How can I connect Weka with Hadoop to deal with larger dataset? Please help!
Thankyou..

Weka 3.7 has new packages for distributed processing in Hadoop. One of the jobs provided by these packages will compute a correlation (or covariance) matrix in Hadoop. The user can optionally have the job use the correlation matrix as input to a PCA analysis (this part runs outside of Hadoop) and produce a "trained" Weka PCA filter. This scales Weka's PCA analysis in the number of instances (but not in the number of original features since the PCA computation still happens locally on the client machine).
For more info on the Hadoop packages see:
http://markahall.blogspot.co.nz/2013/10/weka-and-hadoop-part-1.html
The distributedWekaHadoop package can be installed via the package manager in Weka 3.7.
Cheers,
Mark.

Depending on the algorithm, it may be quite complex to rewrite it to use Hadoop.
You can use Apache Mahout instead. It does have support for PCA.

Related

Data analytics using python or hadoop?

Which technology is efficient to analyze the data hadoop or python? and which technology is speed between the above two?
So Hadoop mostly uses spark. If the underlying framework you are using to analyse or crunch your data contains spark, you are good to go with either Scala, PySpark or maybe R. Using alone python won't give you benefits of Spark which makes data analysis faster and also various transformations on Big Data. So whichever you use, its about using spark.
Scala or PySpark : both contains almost all of these features.
Whenever analyzing data and considering speed as a criteria, two key components determine the speed: The amount of data you have and where the data is located.
If you have Big Data, consider using Hadoop or Spark to analyze it. This will make it much faster and you will not be dependent of load time. If you have a few gigabytes of data it maybe best to use python but it still may slow down your machine.
Now to address where the data is, if you have your data on premise then python is the best approach. If your data is located in cloud server, then Azure, GCP, or even AWS have big data tools available to make this data exploration easier. All three cloud systems have big data tools available for use.
So in terms of speed, it really depends on the two constraints. If you have Big Data and your data is located in a cloud system. Consider using Hadoop to analyze your data. If you have only a few gigabytes of data and on-premise, use python to analyze your data.

Apache Hadoop vs Google Bigdata

Can any one explain me the key difference between Apache Hadoop vs
Google Bigdata
Which one is better(hadoop or google big data).
Simple answer would be.. it depends on what you want to do with your data.
Hadoop is used for massive storage of data and batch processing of that data. It is very mature, popular and you have lot of libraries that support this technology. But if you want to do real time analysis, queries on your data hadoop is not suitable for it.
Google's Big Query was developed specially to solve this issue. You can do real time processing on your data using google's big query.
You can use Big Query in place of Hadoop or you can also use big query with Hadoop to query datasets produced from running MapReduce jobs.
So, it entirely depends on how you want to process your data. If batch processing model is required and sufficient you can use Hadoop and if you want real time processing you have to choose Google's.
Edit: You can also explore other technologies that you can use with Hadoop like Spark, Storm, Hive etc.. (and choose depending on your use case)
Some useful links for more exploration:
1: gavinbadcock's blog
2: cloudacademy's blog

Hadoop - CPU intensive application - Small data

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

Mahout: Visualizing cluster trough command line

I am doing some experiments with clustering but now I want to visualize the data. Like in https://cwiki.apache.org/confluence/display/MAHOUT/Visualizing+Sample+Clusters , is there a way to run the classes with arguments that accept custom cluster data ? What is the best way to see cluster data?
The command i am using is: mvn -q exec:java -Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
Thank you
PS: I am using Mahout 0.9
Any realistic data that is visualizable in 2 dimensions (and I don't think these classes can do much more than that) will easily fit into main memory. And if I'm not mistaken, these classes will load all the data into your memory, because they are only for demonstration.
Then you may as well use any non-Hadoop tool such as ELKI or WEKA or SciPy. Mahout really only pays off when you have more data than fits into your main memory. Otherwise, it will be a lot slower than a good single-host solution.
See e.g. this G+ post:
If your data is small enough to fit in main memory, don't run Hadoop.

Mahout recommendation engine: going distributed

Does anybody know how I could transform the code found on the Mahout in Action book, regarding the recommendation engines, so that it is consistent with a Ηadoop fully-distributed environment? My main difficulty is to transform my code (that currently reads and writes from a local disk) so that it runs in a pseudo-distributed environment (such Cloudera). Is the solution to my problem as simple as this one, or I should expect something more complex than that?
A truly distributed computation is quite different than a non-distributed computation, even when computing the same result. The structure is not the same, and the infrastructure it uses is not the same.
If you are just asking how the pseudo-distributed solution works regarding local files: you would ignore the Hadoop input/output mechanism and write a Mapper that reads your input from somewhere on HDFS and copies to local disk.
If you are asking how you actually distribute the computation, then you would have to switch to use the (completely-different) distributed implementations in the project. These actually use Hadoop to split up the computation. The process above is a hack that just runs many non-distributed tasks within a Hadoop container. These implementations are however completely off-line.
If you mean that you want a real-time recommender like in the Mahout .cf.taste packages, but also want to actually use Hadoop's distributed computing power, then you need more than Mahout. It's either one or the other in Mahout; there is code that does one or the other but they are not related.
This is exactly what Myrrix is, by the way. I don't mind advertising it here since it sounds like exactly what you may be looking for. It's an evolution of the work I began in this Mahout code. Among other things, it's a 2-tier architecture that has the real-time elements of Taste but can also transparently offload the computation to a Hadoop cluster.

Resources