Are there any useful parallel data mining applications? - parallel-processing

I am working on a parallel data mining software development project.
I want to find some kind of a data mining software, which can run in a parallel environment with a Weka-like GUI.
Are there any useful parallel data mining applications?

What did you try? There are tons of data mining applications around. Isn't rapidminer parallel and uses Weka?
Try these applications:
http://www.kdnuggets.com/software/index.html

Related

How to Apply customize analytical model in OBIEE12c

I would like to integrate machine learning model developed by python with obiee 12c dashboard.
You should use In-Database Machine Learning then present it in the obiee.
Oracle Machine Learning for Python (OML4Py) is a Python API for performing statistical and machine learning analysis on data in your Oracle Autonomous Database.
With Embedded Python Execution, you can run user-defined Python functions in Python engines spawned by the Autonomous Database environment.
Automated Machine Learning (AutoML) enhances user productivity and machine learning results through automated algorithm and feature selection, as well as model selection and tuning.
For more information visit this link.

Qlikview is ETL testing tool or development tool?

As I am a manual tester I wanted to know Qlikview is ETL testing tool or development tool?
And is there any advantage of this tool in Automation of ETL testing?
Because QlikView offers both scripting and front end tools it could be, and is, used for both.
The scripting engine can do any ETL you might require and via the front end visualisation engine you could build tests for any cases you want to test for. The nature of QliView would also allows you easy access, and filtering of any individual data lines causing problems.
I've used it this way for large data migration and management projects.
There are no doubt more specialised tools, i.e. there are no inbuilt tools for common scenario testing, you have to build it all yourself and export via excel or csv, but it can be done.

ETL Tool Which is most configurable

I am looking for best suited ETL Tool for the following criteria.
Supports MongoDB
Accepts Metadata as input (Or accepts file and builds its metadata on the fly)
provides configurable Mapping. (mapping can be defined from outside development, using some file ot table)
Please suggest the tool which caters to the above needs.
Hmm, your questing is looking most configurable ETL tool. From past years of experience in ETL process, I can inform you that you will never find such tool that meets all your demands. Especially when you have Enterprise level data warehouse (needed because of high and complex reporting needs), the only one software solution is to build your own custom project based ETL software, which is often ungrateful.
But (big BUT), you can achieve at least 80% of needs with existing tools. Plugins, smart usage of scripts, good data-flow design and (if needed) small custom software in pair with scheduling could help you out to fulfill imagined process. ETL process doesn't seem to be different in compare to any other work - 80% of the work is done in 20% of time, and the rest of work (20%) is done in 80% of time.
My suggestion for you:
Pentaho Data Integration - free and open source
PDI is powerfull ETL tool, and surley can meet your demands. There is a plenty of plugins, solid level community and fine API if you're going to develop more plugins.
Pentaho Data Integration + Integration Server - Enterprise Edition - "cheap enough" for almost every medium size projects
Enterprise edition has everything like free edition, including more plugins (JMS producer for example), version control system, instaview's and ect.
Beside, it has it own Server so scheduling is software based (not OS based), logging, better management and most important thing - support!
Informatica or Microsoft SSIS - expensive and brilliant
I would not wasting words for this tools. Informatica is primary ETL oriented company that using Informatica on high level require deep understanding of DB/DWH design, ETL process, PL/SQL, dimensional modeling ect.
SSIS is primary constructed for SQL Server, so I don't see high usage needs if at least one of your source db or target db (DWH) is not running on SQL Server.
Conclusion
This is just a scratch of plenty tools that market provide to us. Someone else would probably not even mention these tools. Please look one of the lists.
Almost each BI system has it own ETL tool. Maybe the good choice would be to use it together, in that way you will be in possibility to use maximum from both.
Note: Good ETL project manager, or ETL developer can extend tool advantages to level that better/more expensive have!

Is Pentaho ETL and Data Analyzer good choice?

I was looking for ETL tool and on google found lot about Pentaho Kettle.
I also need a Data Analyzer to run on Star Schema so that business user can play around and generate any kind of report or matrix. Again PentaHo Analyzer is looking good.
Other part of the application will be developed in java and the application should be database agnostic.
Is Pentaho good enough or there are other tools I should check.
Pentaho seems to be pretty solid, offering the whole suite of BI tools, with improved integration reportedly on the way. But...the chances are that companies wanting to go the open source route for their BI solution are also most likely to end up using open source database technology...and in that sense "database agnostic" can easily be a double-edged sword. For instance, you can develop a cube in Microsoft's Analysis Services in the comfortable knowledge that whatver MDX/XMLA your cube sends to the database will be intrepeted consistently, holding very little in the way of nasty surprises.
Compare that to the Pentaho stack, which will typically end interacting with Postgresql or Mysql. I can't vouch for how Postgresql performs in the OLAP realm, but I do know from experience that Mysql - for all its undoubted strengths - has "issues" with the types of SQL that typically crops up all over the place in an OLAP solution (you can't get far in a cube without using GROUP BY or COUNT DISTINCT). So part of what you save in licence costs will almost certainly be used to solve issues arising from the fact the Pentaho doesn't always know which database it is talking to - robbing Peter to (at least partially) pay Paul, so to speak.
Unfortunately, more info is needed. For example:
will you need to exchange data with well-known apps (Oracle Financials, Remedy, etc)? If so, you can save a ton of time & money with an ETL solution that has support for that interface already built-in.
what database products (and versions) and file types do you need to talk to?
do you need to support querying of web-services?
do you need near real-time trickling of data?
do you need rule-level auditing & counts for accounting for every single row
do you need delta processing?
what kinds of machines do you need this to run on? linux? windows? mainframe?
what kind of version control, testing and build processes will this tool have to comply with?
what kind of performance & scalability do you need?
do you mind if the database ends up driving the transformations?
do you need this to run in userspace?
do you need to run parts of it on various networks disconnected from the rest? (not uncommon for extract processes)
how many interfaces and of what complexity do you need to support?
You can spend a lot of time deploying and learning an ETL tool - only to discover that it really doesn't meet your needs very well. You're best off taking a couple of hours to figure that out first.
I've used Talend before with some success. You create your translation by chaining operations together in a graphical designer. There were definitely some WTF's and it was difficult to deal with multi-line records, but it worked well otherwise.
Talend also generates Java and you can access the ETL processes remotely. The tool is also free, although they provide enterprise training and support.
There are lots of choices. Look at BIRT, Talend and Pentaho, if you want free tools. If you want much more robustness, look at Tableau and BIRT Analytics.

How can i connect two or more machines via tcp cable to form a network grid?

How can i connect two or more machines to form a network grid and how can i distribute work load to the two machines?
What operating systems do i need to run on the machines, and what application should i use to manage the load balancing?
NB: I read somewhere that google uses cheap machines to perform this fete, how do they connect two network cards( 'Teaming' ) and distribute load across the machines?
Good practical examples would serve me good, with actual code samples.
Pointers to some good site i might read this stuff will be highly appreciated.
An excellent place to start is with the Beowulf project. Basically an opensource cluster built on the Linux OS.
There are several software solutions in this expanding market. The term "cloud computing" is certainly gaining traction to describe what you want to do. Are you wanting a service, or do you want to run it in house?
I'm most familiar with Appistry EAF - Runs on commodity based hardware. Its available as a free download. Runs on windows or linux.
Another is GoGrid - I believe this is only available as a service, but I'm not as familiar with it.
There are many different approaches to parallel processing, and many types of system architectures you could use.
For commodity systems, there are clusters and grids, or you can even form a single system image from several pieces of commodity hardware. There is of course also load balancing, high availability, failover, etc.
It's pretty much impossible to answer this question without more detail. What exactly do you want to with these systems? The answer is very highly dependent on the application.
You might want to have a look at some of the stuff to do with Folding At Home and the SETI project, and some of their participants blogs, here is a pretty amazing cluster that a guy built in an IKEA cabinet:
http://helmer.sfe.se/
Might give you some ideas.
The question is too abstract.
One of the (imaginable) ways is to use MPI - a framework for parallel programming, the Wikipedia page includes examples in C++, and there are bindings for other languages.

Resources