This is not a technical question, but want to have suggestions from more experienced people regarding my career.
I have been working as UNIX admin from past 13 years, majority of Solaris and couple of years on Linux. Now, I want to learn something more which can excel my career. I have been hearing a lot about Hadoop/Big Data from quite sometime. I do not have any programming or scripting knowledge, neither have knowledge of apache or any database.
- I am assuming that there are two different job profile, Developer and Admin. Am I understanding it correctly ?
- Do I need to learn apache, database, java to learn Hadoop (Even for Admin job profile) ?
- At my place training is expensive. if I want to start study with books, which book should I start with ? I can see popular ones are "Hadoop: The Definite Guide - O'Reilly" and also "Big Data for Dummies". (I am asking from beginners level).
Please help with my doubts. Your suggestions will help me to take decision.
(Moved from comment because too long.)
In order to administer Hadoop in any meaningful way you need to know a fair bit about (a) how Hadoop works, (b) how Hadoop runs its jobs, and (c) job-specific tuning.
I don't know what "learning Apache" means; Apache is a conglomerate of projects, unless you mean the web server itself.
"Learning databases" is too broad to be useful, and Hadoop isn't a database (HBase is).
You don't need any Java knowledge to administer a Java-based program, although knowing about JVM options, how to specify them, and generalities is certainly helpful.
There is a lot to digest, I would start very small, e.g., intro books. Also, keep in mind that there are other solutions besides Hadoop, and a lot of different ways to actually use Hadoop.
The Kiji project is a good way to get Hadoop/HBase/etc up and running, though if you're interested in doing everything "from scratch", it's not the best path.
hiall
My goal is to analyze log files of Hadoop and there are two tools starfish(open source) and splunk(commercial product). Does anyone know the pros and cons as to which one to choose.
I really appreciate your answer.
Thanks
Well,
the pros and cons are the same of any open source vs commercial tool choice.
The main guideline should be, what are your prerequisites?
Splunk core is opensource, the free license allows you to index 500Mb/day,
probably its main advantage is providing a BI tool cheaper than other comercial ones,
it also has an impressive amount of plugins, including for Hadoop,
and like Hadoop relies on a (different) MapReduce implementation since Splunk 4.x.
It both has a Python and Java SDK, which may come in handy.
Its approach is, install it and after (a minimal) setup, start playing with your data.
I don't know Starfish, though it does look promissing,
it only seems to require JavaFX while Splunk comes with its own Python alternative installation.
But in the end, it all boils down to what are your most important prerequisites.
Barriers to entry is low for both. Best is to try both out for a while and see what works for you.
Depending on your use case each tool has different strengths. What is your use case?
Generally speaking Splunk is easy and modern with great community support. Answers are generally a few searches away.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Just started going through Hadoop introduction videos.
How to practice it on your own? Is there a recommended way to install on local to practice?
I found that downloading and installing Hadoop, playing with it by working examples, making lots of mistakes and being ok with that worked well for practice.
By "install on local" if you're saying "how do I install it on my local machine without using HDFS?", there's an excellent guide here.
If you want to learn about Hadoop and Bigdata, look into bigdatauniversity.com. Its free and they give instructions on how to install Hadoop locally on a virtual machine and/or in Amazon's Web Services. BigDataUniversity provides labs and instructions to help guide your practice. I found it helpful so far.
Recently Cloudera launched a new platform online where you can play with Hadoop and its ecosystem as much as you want.Here you go -
cloudera.com/live
I have been training people on Hadoop for 2 years now. Here are my two cents.
For the learning part, I would recommend the following sources (as mentioned by others too above):
Yahoo Blog
Hadoop Definitive Guide
HortonWorks Practice Tutorials
And for practicing, traditionally people have been using Hadoop Virtual Machines but this approach has its downsides:
The VMs are huge in size for example HortonWorks' VM is 9.9 GB.
You might have to upgrade your RAM to 8GB.
Some BIOS don't allow virtualization. You might have change bios settings.
Some machines such as Office Desktops/Laptops may not allow installations.
My students and I too faced the these problems while. So, we setup a cluster for our students to practice Hadoop, Spark and related technologies. And we named it as CloudxLab.com.
...I liked bigdatauniversity.com and also noted that MapR, Hortonworks, and Cloudera all offer a downloadable environment that you can use to gain familiarity with the Hadoop operating paradigm.
In fact, if you are studying this with an eye toward working with Hadoop at an Enterprise scale, it's a good idea to explore the products that are being deployed at that level.
I've had a little chance now to explore hands-on with MapR's Hadoop environment and can commend it as a good way to looking into the matter.
---v
I would suggest https://developer.yahoo.com/hadoop/tutorial/ for hadoop self paced study. Its a very comprehensive guide, step by step, from beginner to advanced level.
You can install a virtual box that has Hadoop included but you may encounter some problems with it. I did so first when I started learning Hadoop and after several problems( IP, internet, different configs) I decided to learn with a Linux install.
You can find a tutorial here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
So, I've got an idea for a website. I can start off using any platform and frameworks I want, but there are almost too many options.
OS Platform:
Windows, *nix
Web Framework:
Rails, ASP.NET, ASP.NET MVC, Django, Zend, Cake, others
Hosting:
EC2, Dedicated Server, Shared Hosting, VPS, App Engine, Azure, others
Persistence:
S3, MySql, PostreSql, Sql Server, SimpleDB, CouchDB, others
How do you avoid decision paralysis and get started?
Firstly, your familiarity with a framework's language should dictate which framework you choose. Don't add the burden of learning another language on top of learning a framework.
Next, have a look at the remaining frameworks. Do they have good documentation? What about the community. (A good community can go a long way to making up any shortcomings of a given technology.) Does the framework solve the problems that you need solved?
Finally, just dive in and try something! Pick the one that makes the most sense to you and start writing code. Don't do too much hand-wringing over your decision. If it becomes obvious that you made the wrong choice, it should be obvious quite early. Learn from what you've accomplished so far and consider restarting with a different technology. (Just don't get several weeks down the road before you make this decision!)
I'm sure you don't like all of those technologies equally. Pick a framework that you like and get to work.
It depends on what your app is going to be doing. A handful of the technologies you listed are direct competitors (like Django vs. Rails), but some are completely different ways to do things (like MySQL vs. S3).
Questions to answer before you begin:
Will the app need to be horizontally partitioned in the near term? If so, using EC2, Google App Engine or Azure would be a good option.
Will your app fit into the constraints of Google App Engine? If so, it requires a lot less hassle on your part than running on bare metal (whether real or virtual).
What's your preferred web framework? If you want an MS framework, you'll need to run on a host that supports that.
What will your persistence and data access patterns look like? This will determine whether to use a database or something more exotic.
If you are running on EC2, the other AWS services are more appealing. Similarly, if you are using GAE, you have only one option for persistence. If you are using Rails, may as well start with MySQL.
In answer to your question of how to reduce the number of options, the answer is to realize that many of the options are related, so you don't have as many choices to make as it first appears.
Some advice that was once given to me is, pick what your friends (or colleagues) are using. Having people around you that you can share ideas and the learning experience with is invaluable.
If you want to learn something new: I'd just go with your gut and get started. If it sucks then switch to something more familiar.
If you don't have much time: Go with what you know and forget about the other options. Just start coding.
Optimize for happiness. Pick the one that you like the most. Or the one that intrigues you the most.
I've worked in Microsoft shops, in Ruby on Rails, and in homegrown shops having Apache, Jetty, even Mason.
All frameworks have their warts, their idiosyncracies that will keep you up until 3 AM, and their "tribal knowledge" vagaries that will be completely unexportable to other frameworks. (The last point is sometimes by design, the whole "platform entrenchment" business strategy)
Listen to what the supporters of the frameworks say about the problems with the other frameworks (Google: X framework vs Y framework). Pick the framework that has the loudest supporters. If they are equally loud, make the decision with a dice roll.
With me it's simple.
I only know MS stack and see no point in "checking out" all of those you mentioned.
No, actually I once tried to use JSF before excluding it from my list permanently.
Use what you are experienced in and where you can be more productive. The objective is to get your site up and running. Go for it.
One of the biggest factors in determining which platform/framework to use is your budget. You have to factor in the cost of licensing, software required to develop/maintain your website and other miscellaneous costs.
I suggest you begin with a scorecard of your own construction. Perhaps you can find different ones on the web, but if you do, modify them to meet YOUR needs. There should be a scorecard for each level in the stack (as you've described). Each scorecard should share some aspects to grade with other scorecards but each will also have their unique aspects.
Once constructed, weight each aspect graded according to your needs.
Once you've chosen the weights, pick the scales for grades.
At this point promise yourself you wont mess with the weights or the scale and then start collecting data on your options for each level in the stack.
You may also want to put a time limit on the collection period.
Make your decision based on the outcome of the scorecard.
The beauty of this approach is that the effort is made in constructing the scorecard, not in circular arguments of options. The effort in making the scorecard is vendor agnostic and focuses on the desired result, not the options. Thus you can avoid paralysis.
One more thing, my best scorecards have included sections addressing the availability of resources and other human related things. Don't make the mistake of just looking at the technology.
good luck.
Go for personal preferences.
One decision at a time:
Firts I would begin with type of language:
Script: PHP, Python,
Serious: Java, .Net
The language will restrict your OS, plattform and will give you hints for the dataabse decission. The database load is also important. And, Do you want logic in the DDBB? how much data?
Last advice. Try combinations well tested. LAMP, WAMP, Windows with SQL Server and .NET.
Evaluate each platform and technology for quality of tools for your needs. For example, if you are cost sensitive, you would value free operating systems and tools higher than costly ones. If you need performance, you would value tools which provide high performance higher than ones that don't.
It entirely depends on your situation. I spent several months evaluating stuff for a new commercial web site last year, and it was very easy to feel paralized. In the end it was talking to several people who'd done similar things, and of course reading a lot of stuff online and from Amazon. I chose Java, since our team had a lot of experience in it, and it has good performance and extensive supporting technologies. Oracle is our database but we used a persistence manager to make it easy to change later on. We used a half-dozen very good libraries to eliminate much of the boring and repetitive coding (Restlet, iBatis, Freemarker, XStream, jQuery, SLF4J). We used Glassfish as our web server.
Yours sounds like a small project with only you to work on it. In that case, pick a complete framework instead of a smorgasbord like we did. Pick something fun to work with, and something with good "return on resume". Look very hard at Ruby on Rails, Django (kind of a Python on Rails), and Groovy on Grails (a Rails-wannabe for the Java world). In your shoes I'd pick Ruby on Rails because there's a large and growing community and a good number of books and tutorials. Plus, Ruby looks like a worthwhile language to learn. For your database, just pick one. These frameworks make it easy to change your mind later. Pick MySQL unless you have another you like better.
And as other posters said, just do it! ;-)
Like others said, pick something you and your employees are familiar with. I highly doubt you are close to being industry ready with all those techs.
OS Platform: Windows, *nix
Shouldn't matter except for Windows licensing costs, and that is probably the least of your expenses.
Web Framework: Rails, ASP.NET, ASP.NET MVC, Django, Zend, Cake, others
Dependent on your favorite language
Hosting: EC2, Dedicated Server, Shared Hosting, VPS, App Engine, Azure, others
You should design your product to be movable, so you can scale among these. If you know for sure you are going big, then just start off with EC2. App Engine is extremely limiting, ex. they don't let you form outbound connections.
Persistence: S3, MySql, PostreSql, Sql Server, SimpleDB, CouchDB, others
You need to do the research yourself whether or not your product requires an RDBMS or a simple key/value store, and what features each of these have.
Just go for it! Your platform choice really is not all that important as long as you make a reasonable choice (Ruby + Rails, Python + Django, PHP + Cake/CodeIgniter). Any of these can be used to build successful sites. If your site really takes off, you'll be able to scale it fine.