Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Helly guys,
I am new to hadoop and everything around big data.
while my research about social media data integration with big data i found a lot about hadoop.
But I know there is google analytics too, if i want to observe social media and get some statistics.
So, why are so many companies using hadoop instead of google analytics?
What is the difference between those two?
Thank you for your answer :)
I will try and answer this as good as possible, as it's a strange question :)
The reason I say it's strange is they are not really related and trying to find a co-relation to compare is tricky.
GA - Typically used to track web behavior. Provides a nice UI and is typically digestible by non-technical people (marketing etc) to find insights.
Hadoop - Hadoop at its core is a file system (think of a very large hard-drive), it stores data in a distributed fashion (on n number of servers). It's claim to fame is map/reduce and the plethora of applications like Hive or Pig to analyze data sitting in Hadoop.
A better comparison to the products you mentioned would be something like:
Why would I use Google Analytics vs Comscore? (web analytics)
Why would I use Hadoop vs Postgres? (data storage and data analyses)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to build a crawler which can update hundreds of thousands of links in several minutes.
Is there any mature ways to do the scheduling?
Is distributed system needed?
What is the greatest barrier that limits the performance?
Thx.
For Python you could go with Frontera by Scrapinghub
https://github.com/scrapinghub/frontera
https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst
They're the same guys that make Scrapy.
There's also Apache Nutch which is a much older project.
http://nutch.apache.org/
You would need a distributed crawler but don't reinvent the wheel, use Apache Nutch. it was built exactly for that purpose, is mature and stable and used by a wide community to deal with large scale crawls.
The amount of processing and memory required would need distributed processing unless you are willing to compromise speed. Remember you'd be dealing with billions of links and terabytes of text and images
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am new here, and i hope that i could find answers for my questions related to open source reporting systems.
Is it possible to change in the programming logic of 'Tableau desktop'? I am asking this because i need to make changes that
enables me to log users' interactions with the system (Tableau
Desktop).
Is it possible to perform Big Data analysis by combining Tableau Desktop with Hadoop or Spark?
If the answers for the above questions is no, then could you please
recommend any other open source (free) reporting system that satisfy
these requirements.
Thank you in advance and best regards to all of you
Tableau has drivers to connect to several "big data" No SQL databases, and has added a Spark SQL driver as of Tableau version 8.3.
The full list of supported drivers can be found on Tableau's website at http://www.tableau.com/support/drivers
Your question about logging user interactions is not at all clear, but you might have better luck instituting logging at the database level instead of at client level.
In response to your question regarding user interactions, I'd recommend you take a look at the views_stats table in the Tableau Server database.
Instructions for connecting to the 'workgroup' database: http://onlinehelp.tableau.com/current/server/en-us/adminview_postgres_connect.htm
Versions 8 and 9 includes a Spark connection
As far as logging users goes, Tableau Desktop is designed as a single license tool for developers and shouldn't need to be logged.
If you're interested in logging users, you may be thinking of Tableau Server, which has built-in functions for things like that as well as a REST API, which has some additional functions.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
This question answers part of my question but not completely.
How do I run a script that manages this, is it from my local filesystem? Where exactly do things like MrJob or Dumbo come into picture? Are there any more alternative?
I am trying to run K-Means where each iterations (a MapReduce job) output will be the input to the next iteration with Hadoop Streaming and Python.
I do not have much experience and any information should help me make this work.Thanks!
If you are not very tightly coupled with Python then you have a very good option. There is one project from Cloudera called "Crunch" that allows us to create pipelines of MR jobs easily. it's a java library that provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google's FlumeJava library.
There is another non-python option. Graphlab is an open source project to produce free implementations of scalable machine learning algorithms on multicore machine and clusters. There is an implemented fast scalable version of the Kmeans++ algorithm included in the package. See Graphlab for details.
Clustering API of graphlab can be found here .
Seems like a good applications for Spark it has also streaming option but I'm afraid it only works with Scala, but they have Python API, definitively worth a try, it is not that difficult to use ( at least the tutorials) and it can scale at large.
It should be possible to use GraphLab Create (in Python) running on Hadoop to do what you describe. The clustering toolkit can help implement the K-Means part. You can coordinate/script it from your local machine and use the graphlab.deploy API to run the job on Hadoop.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Are there reports or thesis about the performance of Google App Engine or other cloud platforms?
I'am writing an article about how to choose an appropriate cloud platform, and want to reference some test data.
A little work with Google may bring up some material that others have found. For instance the canonical resource for Azure benchmarking is here: http://azurescope.cloudapp.net/. However, there's not much comparative material as it really doesn't make sense.
Comparing cloud platforms solely on performance is like comparing apples with bananas with oranges. Each have their own qualities that make them appropriate for a particular kind of application.
For example, in broad terms, for multi-platform use where you have control of the underlying OS, go EC2; for a managed Windows application platform go Azure; or for a managed Java/Python platform choose App Engine. Once you've chosen the platform you can pretty much then pay for the performance you need.
Bear in mind too that "performance" means different things for different applications. The application I'm working on, for instance, relies heavily on SQL database performance. That will have a very different performance profile from (say) an application that uses a key-value pair storage system, or an application that's mostly static HTML.
So, in practice, there aren't much in the way of performance benchmarks out there because every application is different.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We have been asked to provide a data reporting solution. The followng are the requirements:
i. The client has a lot of data which is generated everyday as an outcome of the tests they run. These tests are run at several sites and they get automatically backed up into a central server.
ii. They already have perl scripts which post process them and generates excel based reports.
iii. They need a web based interface for comparing those reports and they need to mark and track issues which might be present in those data.
I am confused if we should build our own tool for this or we should go for already exiting tool(any suggestions?). Can you please provide supportive arguments for the decision that you would suggest?
You need to narrow down your requirements (what kind of data needs to be compared, and in which format?). Then check if there is already a software available (commercial or free) that fulfills your needs. Based on that, decide if its better (i.e. cheaper) to implement the functionality yourself, or use the other software.
Don't reinvent the wheel.
There are quite a few tools out there that specialise in this sort of thing, my gut feeling is that you can find something ready made that does what you need.
As a side note, that tool may also be a better solution for creating those excel reports than the perl scripts.