hiall
My goal is to analyze log files of Hadoop and there are two tools starfish(open source) and splunk(commercial product). Does anyone know the pros and cons as to which one to choose.
I really appreciate your answer.
Thanks
Well,
the pros and cons are the same of any open source vs commercial tool choice.
The main guideline should be, what are your prerequisites?
Splunk core is opensource, the free license allows you to index 500Mb/day,
probably its main advantage is providing a BI tool cheaper than other comercial ones,
it also has an impressive amount of plugins, including for Hadoop,
and like Hadoop relies on a (different) MapReduce implementation since Splunk 4.x.
It both has a Python and Java SDK, which may come in handy.
Its approach is, install it and after (a minimal) setup, start playing with your data.
I don't know Starfish, though it does look promissing,
it only seems to require JavaFX while Splunk comes with its own Python alternative installation.
But in the end, it all boils down to what are your most important prerequisites.
Barriers to entry is low for both. Best is to try both out for a while and see what works for you.
Depending on your use case each tool has different strengths. What is your use case?
Generally speaking Splunk is easy and modern with great community support. Answers are generally a few searches away.
Lately, I am getting more engrossed in learning Oracle and Geospatial systems. I feel that mapping systems, combined with solid data structure are two technologies that are making their niche in today's market.
If you are starting to learn about these technologies, where would you recommend starting off? If I understand correctly, the best way to learn them would be through actual work (or hobby), but I can't seem to find good places to get the resources to do so.
I would appreciate any advice, tips, resources and information everyone could provide to jump-start my learning and understanding of these technologies.
Thanks.
Update:
Saw a nice PDF relating about this, but for a hobbyist wanting to learn it, are there free tools to start off with it?
http://download.oracle.com/otndocs/products/mapviewer/pdf/mv11g_spatialvis_inobiee.pdf
You appear to be interested in OLAP/BI combined with GIS/mapping.
See information on Spatial OLAP (aka SOLAP) at http://www.spatialbi.org/ , as well as this list of tools at http://spatialolap.scg.ulaval.ca/DevApproaches.asp
Also, see GeoKettle at http://www.spatialytics.org/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am a Web developer. I have experience in Web technologies like JavaScript , Jquery , Php , HTML . I know basic concepts of C. Recently I had taken interest in learning more about mapreduce and hadoop. So I enrolled my self in parallel data processing in mapreduce course in my university. Since I dont have any prior programing knowledge in any object oriented languages like Java or C++ , how should I go about learning map reduce and hadoop. I have started to read Yahoo hadoop tutorials and also OReilly's Hadoop The Definitive Guide 2nd.Edition.
I would like you guys to suggest me ways I could go about learning mapreduce and hadoop.
Here are some nice YouTube videos on MapReduce
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE
Also, here are nice tutorials on how to setup Hadoop on Ubuntu
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
You can access Hadoop from many different languages and a number of resources set up Hadoop for you. You could try Amazon's Elastic MapReduce (EMR), for instance, without having to go through the hassle of configuring the servers, workers, etc. This is a good way to get your head around MapReduce processing while delaying a bit the issues of learning how to use HDFS well, how to manage your scheduler, etc.
It's not hard to search for your favorite language & find Hadoop APIs for it or at least some tutorials on linking it with Hadoop. For instance, here's a walkthrough on a PHP app run on Hadoop: http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html
Answer 1 :
It is very desirable to know Java. Hadoop is written in Java. Its popular Sequence File format is dependent on Java.
Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.
Most Hadoop tools are not mature enough (like Sqoop, HCatalog and so on), so you'll see many Java error stack traces and probably you'll want to hack the source code someday
Answer 2
It is not required for you to know Java.
As the others said, it would be very helpful depending on how complex your processing may be. However, there is an incredible amount you can do with just Pig and say Hive.
I would agree that it is fairly likely you will eventually need to write a user defined function (UDF), however, I've written those in Python, and it is very easy to write UDFs in Python.
Granted, if you have very stringent performance requirements, then a Java based MapReduce program would be the way to go. However, great advancements in performance are being made all of the time in both Pig and Hive.
So, the short answer to your question is, "No", it is not required for you to know Java in order to perform Hadoop development.
Source :
http://www.linkedin.com/groups/Is-it-must-Hadoop-Developer-988957.S.141072851
1) Learn Java. No way around that, sorry.
2) Profit! It'll be very easy after that -- Hadoop is pretty darn simple.
It sounds like you are on the right track. I recommend setting up some Virtual Machines on your home computer to start taking what you see in the books and implementing them in your VMs. As with many things the only way to become better at something is to practice it. Once you get into I am sure you will have enough knowledge to start a small project to implement Hadoop with. Here are some examples of things people have built with Hadoop: Powered by Hadoop
Go through the Yahoo Hadoop tutorial before going through Hadoop the definitive guide. The Yahoo tutorial gives you a very clean and easy understanding of the architecture.
I think the concepts are not arranged properly in the Book. That makes it a little difficult to study it.
So do not study it together. Go through the web tutorial first.
I just put together a paper on this topic. Great resources above, but I think you'll find some additional pointers here: http://images.globalknowledge.com/wwwimages/whitepaperpdf/WP_CL_Learning_Hadoop.pdf
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been working with Hadoop for a couple of years and in this blog want to share my experience from the early start. I came from .NET environment and faced a couple of challenges related to switching from one language into another. My blog is oriented on people who didn’t work with Hadoop but have some primary technical background like you. Step by step I want to cover the whole family of Big Data services, describe the concepts and common problems I met working with them. Hope you will enjoy it
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
So, I've got an idea for a website. I can start off using any platform and frameworks I want, but there are almost too many options.
OS Platform:
Windows, *nix
Web Framework:
Rails, ASP.NET, ASP.NET MVC, Django, Zend, Cake, others
Hosting:
EC2, Dedicated Server, Shared Hosting, VPS, App Engine, Azure, others
Persistence:
S3, MySql, PostreSql, Sql Server, SimpleDB, CouchDB, others
How do you avoid decision paralysis and get started?
Firstly, your familiarity with a framework's language should dictate which framework you choose. Don't add the burden of learning another language on top of learning a framework.
Next, have a look at the remaining frameworks. Do they have good documentation? What about the community. (A good community can go a long way to making up any shortcomings of a given technology.) Does the framework solve the problems that you need solved?
Finally, just dive in and try something! Pick the one that makes the most sense to you and start writing code. Don't do too much hand-wringing over your decision. If it becomes obvious that you made the wrong choice, it should be obvious quite early. Learn from what you've accomplished so far and consider restarting with a different technology. (Just don't get several weeks down the road before you make this decision!)
I'm sure you don't like all of those technologies equally. Pick a framework that you like and get to work.
It depends on what your app is going to be doing. A handful of the technologies you listed are direct competitors (like Django vs. Rails), but some are completely different ways to do things (like MySQL vs. S3).
Questions to answer before you begin:
Will the app need to be horizontally partitioned in the near term? If so, using EC2, Google App Engine or Azure would be a good option.
Will your app fit into the constraints of Google App Engine? If so, it requires a lot less hassle on your part than running on bare metal (whether real or virtual).
What's your preferred web framework? If you want an MS framework, you'll need to run on a host that supports that.
What will your persistence and data access patterns look like? This will determine whether to use a database or something more exotic.
If you are running on EC2, the other AWS services are more appealing. Similarly, if you are using GAE, you have only one option for persistence. If you are using Rails, may as well start with MySQL.
In answer to your question of how to reduce the number of options, the answer is to realize that many of the options are related, so you don't have as many choices to make as it first appears.
Some advice that was once given to me is, pick what your friends (or colleagues) are using. Having people around you that you can share ideas and the learning experience with is invaluable.
If you want to learn something new: I'd just go with your gut and get started. If it sucks then switch to something more familiar.
If you don't have much time: Go with what you know and forget about the other options. Just start coding.
Optimize for happiness. Pick the one that you like the most. Or the one that intrigues you the most.
I've worked in Microsoft shops, in Ruby on Rails, and in homegrown shops having Apache, Jetty, even Mason.
All frameworks have their warts, their idiosyncracies that will keep you up until 3 AM, and their "tribal knowledge" vagaries that will be completely unexportable to other frameworks. (The last point is sometimes by design, the whole "platform entrenchment" business strategy)
Listen to what the supporters of the frameworks say about the problems with the other frameworks (Google: X framework vs Y framework). Pick the framework that has the loudest supporters. If they are equally loud, make the decision with a dice roll.
With me it's simple.
I only know MS stack and see no point in "checking out" all of those you mentioned.
No, actually I once tried to use JSF before excluding it from my list permanently.
Use what you are experienced in and where you can be more productive. The objective is to get your site up and running. Go for it.
One of the biggest factors in determining which platform/framework to use is your budget. You have to factor in the cost of licensing, software required to develop/maintain your website and other miscellaneous costs.
I suggest you begin with a scorecard of your own construction. Perhaps you can find different ones on the web, but if you do, modify them to meet YOUR needs. There should be a scorecard for each level in the stack (as you've described). Each scorecard should share some aspects to grade with other scorecards but each will also have their unique aspects.
Once constructed, weight each aspect graded according to your needs.
Once you've chosen the weights, pick the scales for grades.
At this point promise yourself you wont mess with the weights or the scale and then start collecting data on your options for each level in the stack.
You may also want to put a time limit on the collection period.
Make your decision based on the outcome of the scorecard.
The beauty of this approach is that the effort is made in constructing the scorecard, not in circular arguments of options. The effort in making the scorecard is vendor agnostic and focuses on the desired result, not the options. Thus you can avoid paralysis.
One more thing, my best scorecards have included sections addressing the availability of resources and other human related things. Don't make the mistake of just looking at the technology.
good luck.
Go for personal preferences.
One decision at a time:
Firts I would begin with type of language:
Script: PHP, Python,
Serious: Java, .Net
The language will restrict your OS, plattform and will give you hints for the dataabse decission. The database load is also important. And, Do you want logic in the DDBB? how much data?
Last advice. Try combinations well tested. LAMP, WAMP, Windows with SQL Server and .NET.
Evaluate each platform and technology for quality of tools for your needs. For example, if you are cost sensitive, you would value free operating systems and tools higher than costly ones. If you need performance, you would value tools which provide high performance higher than ones that don't.
It entirely depends on your situation. I spent several months evaluating stuff for a new commercial web site last year, and it was very easy to feel paralized. In the end it was talking to several people who'd done similar things, and of course reading a lot of stuff online and from Amazon. I chose Java, since our team had a lot of experience in it, and it has good performance and extensive supporting technologies. Oracle is our database but we used a persistence manager to make it easy to change later on. We used a half-dozen very good libraries to eliminate much of the boring and repetitive coding (Restlet, iBatis, Freemarker, XStream, jQuery, SLF4J). We used Glassfish as our web server.
Yours sounds like a small project with only you to work on it. In that case, pick a complete framework instead of a smorgasbord like we did. Pick something fun to work with, and something with good "return on resume". Look very hard at Ruby on Rails, Django (kind of a Python on Rails), and Groovy on Grails (a Rails-wannabe for the Java world). In your shoes I'd pick Ruby on Rails because there's a large and growing community and a good number of books and tutorials. Plus, Ruby looks like a worthwhile language to learn. For your database, just pick one. These frameworks make it easy to change your mind later. Pick MySQL unless you have another you like better.
And as other posters said, just do it! ;-)
Like others said, pick something you and your employees are familiar with. I highly doubt you are close to being industry ready with all those techs.
OS Platform: Windows, *nix
Shouldn't matter except for Windows licensing costs, and that is probably the least of your expenses.
Web Framework: Rails, ASP.NET, ASP.NET MVC, Django, Zend, Cake, others
Dependent on your favorite language
Hosting: EC2, Dedicated Server, Shared Hosting, VPS, App Engine, Azure, others
You should design your product to be movable, so you can scale among these. If you know for sure you are going big, then just start off with EC2. App Engine is extremely limiting, ex. they don't let you form outbound connections.
Persistence: S3, MySql, PostreSql, Sql Server, SimpleDB, CouchDB, others
You need to do the research yourself whether or not your product requires an RDBMS or a simple key/value store, and what features each of these have.
Just go for it! Your platform choice really is not all that important as long as you make a reasonable choice (Ruby + Rails, Python + Django, PHP + Cake/CodeIgniter). Any of these can be used to build successful sites. If your site really takes off, you'll be able to scale it fine.
I work in a very small shop (2 people), and since I started a few months back we have been relying on Windows Scheduled tasks. Finally, I've decided I've had enough grief with some of its inabilities such as
No logs that I can find except on a domain level (inaccessible to machine admins who aren't domain admins)
No alerting mechanism (e-mail, for one) when the job fails.
Once again, we are a small shop. I'm looking to do the analogous scheduling system upgrade than I'm doing with source control (VSS --> Subversion). I'm looking for suggestions of systems that
Are able to do the two things outlined above
Have been community-tested. I'd love to be a guinae pig for exciting software, but job scheduling is not my day job.
Ability to remotely manage jobs a plus
Free a plus. Cheap is okay, but I have very little interest in going through a full blown sales pitch with 7 power point presentations.
Built-in ability to run common tasks besides .EXE's a (minor) plus (run an assembly by name, run an Excel macro by name a plus, run a database stored procedure, etc.).
I think you can look at :
http://www.visualcron.com/
Consider Cygwin and its version of "cron". It meets requirements #1 thru 4 (though without a nice UI for #3.)
Apologize for kicking up the dust here on a very old thread. But I couldn't disagree more with what's been presented here.
Scheduled tasks in Windows are AWESOME (a %^#% load better than writing services I might add). Yes, not without limitations. But still extremely powerful. I rely on them in earnest for a variety of different things.
If you even have a slight grasp on c# you can write as custom "task" (essentially a console application) to do, well, virtually anything. If persistent/accessible logging is what you're after, why not something like Serilog or NLog? Even at the time of writing, it had a very robust feature set. This tool in and of itself, in conjunction with some c#, could've solved both your problems very easily.
Perhaps I'm missing the point, but it seems to me that this isn't really a problem. At least not anymore...
If you're looking for a free tool there is plenty of implementations for the popular Cron tool for Windows, for example CRONw. It's pretty easy to configure and maintain. You could easily write add custom WSH scripts to send your emails and add log entries.
If you're going commercial way BMC Control-M is arguably one of the best but I don't believe that it is particularly cheap.
You may also consider some upcoming packages like JobScheduler
Pretty old question, but we use Jenkins. Yes its main purpose is for CI\CD, but its also a really nice UI for CRON with a ton of plugins and integrations.