Need your advice in choosing distributed file system.
So, I need distributed file system for storing many backups (regular files, sql dumps, etc).
Ideal will be:
distributed
actively maintained (at least not dead)
quick failover (for geographically distributed nodes)
large community
Open Source
So far I've two choices: XtreemFS and GlusterFS. First seems to be cool, but it hasn't large community and in generally develops slow (also it's Java-based).
Gluster - RedHat and other nice things, but the are some negative reviews.
Need help with this :)
Have you considered Ceph
Its maintained by RedHat
Github link: OpenSource
For more quick info check their FAQs.
for WAN specifically : Multiple datacentres
Related
I am looking for best suited ETL Tool for the following criteria.
Supports MongoDB
Accepts Metadata as input (Or accepts file and builds its metadata on the fly)
provides configurable Mapping. (mapping can be defined from outside development, using some file ot table)
Please suggest the tool which caters to the above needs.
Hmm, your questing is looking most configurable ETL tool. From past years of experience in ETL process, I can inform you that you will never find such tool that meets all your demands. Especially when you have Enterprise level data warehouse (needed because of high and complex reporting needs), the only one software solution is to build your own custom project based ETL software, which is often ungrateful.
But (big BUT), you can achieve at least 80% of needs with existing tools. Plugins, smart usage of scripts, good data-flow design and (if needed) small custom software in pair with scheduling could help you out to fulfill imagined process. ETL process doesn't seem to be different in compare to any other work - 80% of the work is done in 20% of time, and the rest of work (20%) is done in 80% of time.
My suggestion for you:
Pentaho Data Integration - free and open source
PDI is powerfull ETL tool, and surley can meet your demands. There is a plenty of plugins, solid level community and fine API if you're going to develop more plugins.
Pentaho Data Integration + Integration Server - Enterprise Edition - "cheap enough" for almost every medium size projects
Enterprise edition has everything like free edition, including more plugins (JMS producer for example), version control system, instaview's and ect.
Beside, it has it own Server so scheduling is software based (not OS based), logging, better management and most important thing - support!
Informatica or Microsoft SSIS - expensive and brilliant
I would not wasting words for this tools. Informatica is primary ETL oriented company that using Informatica on high level require deep understanding of DB/DWH design, ETL process, PL/SQL, dimensional modeling ect.
SSIS is primary constructed for SQL Server, so I don't see high usage needs if at least one of your source db or target db (DWH) is not running on SQL Server.
Conclusion
This is just a scratch of plenty tools that market provide to us. Someone else would probably not even mention these tools. Please look one of the lists.
Almost each BI system has it own ETL tool. Maybe the good choice would be to use it together, in that way you will be in possibility to use maximum from both.
Note: Good ETL project manager, or ETL developer can extend tool advantages to level that better/more expensive have!
I am working with massive data, my input data is about 100 GB.I want to choose one of the hadoop distributions, but i don't know to choose mapr cluster or cloudera cluster. i want to use free versions(mapr M3 and cloudera CDH4 that uses hadoop 0.20).
which of them is better? which configurations do i use that they work the best?
Thanks.
Actually speaking, answer to this question is the most common answer in this world, it depends. It's totally upto you and your requirements. One might find one particular flavor more suitable for his/her needs, and you might find the same flavor less useful. Moreover it's all about personal choice, like I personally like Apache's Hadoop. All are good. It's just that which one fits into your needs.
Which of them is better? is a controversial topic. Questions like this often end up as heated arguments. See this question for example. So, i'm not going to list down advantages of any one over the other. But there are certain differences among these different flavors of Hadoop which could probably help you during your thought process.
The major difference between CDH(Apache Hadoop as well) and MapR is that MapR uses its own proprietary file system, MapRFS instead of HDFS. The M3 Edition is free and available for unlimited production use. Support is provided on a community basis and through MapR's Forums. CDH is 100% open source and you can use the "Standard" version of Cloudera Manager without any charges. And Apache, well it's Apache :). Do what ever you feel like.
MapR has even partnered recently with Canonical, the organization behind the Ubuntu operating system, in an effort to make Hadoop available as an integrated part of Ubuntu through its repositories. The partnership announced that MapR's M3 Edition for Apache Hadoop will be packaged and made available for download as an integrated part of the Ubuntu operating system(see this if you need more info on this). The source code is available on Github. CDH codebase is same as Apache's, with some patches of their own.
But the free edition lacks some good features like JobTracker HA, NameNode HA, Mirroring, Snapshot etc. CDH4, being based on Hadoop-2.x provides you the HA features though. By virtue of its design MapR doesn't have any SPOF, like CDH3(or Hadoop-1.x) does. The MapRFS stores data in volumes, conceptually in a set of containers distributed across a cluster. Each container includes its own metadata, eliminating the central NameNode single point of failure. Still the API is Apache Hadoop compatible. MapR setup requirements differ from Apache/CDH. Like MapR requires raw volumes to be available for installation for example. Once you have the correct hardware & OS pre-requisites, setup times and eval times should be on the same order of magnitude as Apache/CDH.
IMHO, M3 is not gonna give you huge advantages over Apache/CDH as some of the catchy MapR features are not present in M3 free edition, like NFS-HA, Snapshots etc.
Being the first one Cloudera definitely has an extra edge in terms of experience and a solid customer base. But MapR has gone more innovative in terms of significant changes to the MapReduce and HDFS components to improve performance.
I'll write some more after sometime, as i'm on a call and you are waiting for the answer ;)
I work in a research group doing a lot of Machine Learning and Computational Biology.
We currently have a cluster, but it is poorly maintained, suffers from low I/O throughput, and most critically doesn't have any setup for scheduling or load-balancing. Therefore, to use it, you have to find a free node yourself, ssh into that node, run your script on the command line, and manually collect your results.
What is the best software stack to implement an easy to use scheduler and load-balancer, such that users can submit their job to a central queue, have it run automatically when resources are available, and easily get their results back?
There's a number of scheduler/resource manager options that are open source and well thought of:
Torque/Maui, descendants of the venerable PBS, now maintained by adaptive computing
Slurm, a newer project out of LLNL, which has the advantage that it scales very well
Open Grid Engine, née Sun Grid Engine
But there's also a number of entire software stacks that aim to make managing a cluster easier:
Warewulf, out of LBL
Rocks
I'm making this a community wiki for others who have suggestions.
I want to test the performance of a filesystem under different conditions.
Specifically I want to test the performance of Windows virtual machines without compression and with compression both on "normal harddisk" and on USB-disk as it would be interesting to see exactly what the difference is.
What I need is a program that can test different aspects of filesystem (random access, sequential read/write, etc) and make pretty graphs that go well with my blog. Preferrably the application should be automated so I can add it to startup, this way the timing is the same for each run and I can repeat the runs for verification.
I can post a link to the results here when I get around to testing it. Right now its just in the planning phase.
Iometer is the I/O measurement tool. And it's free. From the website:
Iometer is an I/O subsystem
measurement and characterization tool
for single and clustered systems. It
was originally developed by the Intel
Corporation and announced at the Intel
Developers Forum (IDF) on February 17,
1998 - since then it got wide spread
within the industry.
Meanwhile Intel has discontinued to
work on Iometer and it was given to
the Open Source Development Lab
(OSDL). In November 2001, a project
was registered at SourceForge.net and
an initial drop was provided. Since
the relaunch in February 2003, the
project is driven by an international
group of individuals who are
continuesly improving, porting and
extend the product.
The tool (Iometer and Dynamo
executable) is distributed under the
terms of the Intel Open Source
License. The iomtr_kstat kernel module
as well as other future independent
components are distributed under the
terms of the GNU Public License.
You said you'd like pretty graphs for your blog. In my use of IOMeter, I've never seen it produce a graph. However, it is possible that I overlooked an existing feature.
Alternatively, (from the look of its website) iozone might give you graphs:
http://www.iozone.org/
Yet, it could be that iozone only collected the data used to create those graphs shown on its web site.
Regardless, this is still another option for I/O Benchmarking.
Additional server oriented disk benchmarks:
Diskspd
fio
vdbench
I've been thinking about getting a little bit greener with my computers and using some lower power, mini-itx boards in my next computer. Some can generate under 10 watts and are pretty inexpensive.
So I thought, if one is such low cost and low power, why not try to make a cluster out of them? However, I'm not really sure what I would need to do in terms of Operating System or management software to make this happen?
Can anyone provide advice on existing software to do this or any ideas as to how to design my own?
What do you want to actually do with your cluster kind of decides what software you will need.
Do you need job scheduling?
Monitoring tools?
Do you need to deploy software across all nodes at once seamlessly?
One file system across all nodes (recommended).
You could just as easily install a linux or *BSD on the boards and just use ssh to manage and run jobs across all the nodes. No other software really required.
Software you might find useful:
PBS (mostly job scheduling, google)
Kerrighed (Single System Image based, Linux distro)
Rocks (cluster based distribution)
Mosix ( cluster Management, openMosix also )
Ganglia (Monitoring, probably over kill for you)
Lustre (Super fast, opensource cluster filesytem from Sun)
Take a look at beowulf to get started.
That being said, the best advice I can give is to carefully measure whether you are actually being more green with your cluster. I've been a little way down this road before, and in my experience, the losses involved in having many separate computers end up wiping out any energy savings. Keep in mind that every computer needs a power supply, which converts your household voltage down to a level that the computer wants. The conversion is inefficient, and wastes heat (this is why the power supplies have fans). The same can be said for each hard drive, RAM bank and motherboard that you need.
This isn't meant to discourage you from the project. Just be sure to profile. Exactly like writing software! :)
You can use Beowulf to run a cluster.
There's a lot to this question.
First, if you just want to get a cluster up and running, there are many suggestions listed already here. Once you have the cluster up and running, though, you're just starting.
At that point, you need to have software that will work correctly across the cluster. If you are working on your own software, you'll need to design it to be parallelized across a cluster, using something like MPI.
Without software written to run across the cluster, though, the cluster is nothing but a highly customized box that doens't do anything special...