Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am using Cloudera Hadoop, I want to perform Spatial Analytics and need to connect to quantum GIS(QGIS) for geospatial purposes. I need to know how to connect both QGIS and Hadoop. Also is there any other way to connect any other GIS systems except ARCGIS?
There are a number of free and open-source offerings you can use to achieve your goals.
From the list of LocationTech projects, I'd note that GeoJinni (formerly Spatial Hadoop), GeoMesa, and GeoTrellis all work with Hadoop or distributed databases like Accumulo or Cassandra.
More generally, since working with Hadoop means using Java, I'd recommend the GeoTools project for processing geo/gis data on the JVM. GeoTools is used as a library for GeoServer to publish geospatial data using open standards. GeoServer and MapServer are two of the open alternatives for Arc server products.
As you are looking for alternatives to Arc desktop products, QGIS and OpenJump are both options.
As a concrete, small examples, I've used the GeoTools library to read shapefiles from HDFS for ingest into GeoMesa with no problems. Previously, I have looked at serving up geotiffs hosted on HDFS/S3 through GeoServer, and there were a few small changes necessary to wire that up through the stack, but I was able to do it since all the software involved is open-source.
(In full-disclosure, I work on GeoMesa, participate quite a bit with LocationTech projects and some with GeoTools/GeoServer.)
GeoWave is a project that matches the criteria.
The goal of the project is primarily to connect the Hadoop ecosystem with popular GIS software which seems to fit well here. It enables storage/retrieval/analysis of raster/vector/point cloud data within various distributed key-value stores.
Also, the project has installers for Cloudera in addition to other vendors - the latest release with support for Cloudera 5.12.1.
Please keep in mind that I am a GeoWave core contributor.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a Hadoop based environment. I use Flume, Hue and Cassandra in this system. There is a big hype around Docker nowadays, so would like to examine, what are pros and cons in dockerization in this case. I think it should be much more portable, but it can be set using Cloudera Manager with a few clicks. Is it maybe faster or why is worth it? What are advantages?
Maybe should be only multi node Cassandra cluster dockerized?
Is it maybe faster or why is worth it?
It sounds like you already have a Hadoop cluster. So you have to ask yourself, how long does it take to reproduce this environment? How often do you need to reproduce this environment?
If you are not needing a way to reproduce the environment repeatedly and and contain dependencies that may be conflicts with other applications on the host, then I don't yet see a use case for you.
What are advantages?
If you are running Hadoop in an environment where you may need mixed Java versions, then running it as a container could isolate the dependencies (in this case, Java) from the host system. In some case, it would get you a more easily reproducible artifact to move around and set up. But Java apps are already so simple with all their dependencies included in the JAR.
Maybe should be only multi node Cassandra cluster dockerized?
I don't think it really comes down to whether is is a multi-node environment or not. It comes down to the problems it solves. It doesn't sound like you have any pain point in deploying or reproducing Hadoop environments (yet), so I don't see the need to "dockerize" something just because it is the hot new thing on the block.
When you do have the need to reproduce the Hadoop environment easily, you might look at Docker for some of the orchestration and management tools (Kubernetes, Rancher, etc.) which make deploying and managing clusters of applications on an overlay network much more appetizing than just regular Docker. Docker is just the tool in my eyes. It really starts to shine when you can leverage some of the neat overlay multi-host networking, discovery, and orchestration that other packages are building on top of it.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am trying to evaluate Datameer and Altreyx for our bigdata analytics needs. What are the pros and cons of these two tools?
First off, full disclosure: I am Sr. Director for Technical Product Marketing at Datameer, so treat what I have to say with appropriate skepticism. For what it's worth, I also write about Big Data (but not about Datameer) for ZDNet, and I was Research Director for Big Data and Analytics at Gigaom Research. So I know a thing or two about the BI/Big Data market.
With that out of the way, let me say that Alteryx and Datameer are actually rather different products. Even if our messaging may sound similar at times, we do different things.
Alteryx does a great job of allowing its users to set up workflows, graphically, for data transformation, then run those workflows when the design is done. Alteryx connects to Hadoop via Hive and its ODBC driver, moving data out of Hadoop in order to process it.
Alteryx runs as a Windows desktop application, using a UI that looks much like an Integrated Development Environment (IDE). If you're a developer, or even a certain type of power user (for example, one who likes to write a little code now and then), you'll be right at home.
Datameer, on the other hand, can run on Hadoop natively. Instead of connecting via Hive and moving data from Hadoop into our engine, Hadoop in fact is our engine, where that makes the most sense. Rather than graphical workflows, we use a spreadsheet metaphor, allowing users to enter formulas in sheets in order to effect data transformation/shaping/cleansing. And instead of making you execute your whole workbook to see results, our Smart Sampling feature brings data in at design time, so you can work interactively with a subset of the data before you decide to execute the full workbook from end-to-end.
Datameer runs in a Web browser, not as a desktop application, allowing us to run cross-platform between Windows and Mac OS (for example), as well as on tablets running Android, iOS or Windows. Datameer can run on-prem or as a service, in various configurations. With our Personal and Workgroup products, with which you'd likely be processing smaller data volumes, we bypass Hadoop and execute your workbook in-memory.
We have premium modules that do some interesting things. Smart Execution can simplify some Hadoop decisions you'd otherwise have to make on your own, including whether to use MapReduce, Tez or our local in-memory engine. Our Smart Analytics module lets you use machine learning algorithms to understand your data better, and we make pretty short work of doing so.
Alteryx essentially wraps R to deliver machine learning services, and does so for predictive analytics, rather than for data discovery, per se. The ML capabilities in Alteryx are more comprehensive than ours, but they are based on R functions inserted into data flows whereas our ML feature is Wizard-driven. Our ML feature set is smaller and, we believe, simpler. The 80-20 rule applies, from our point of view.
Alteryx does an excellent job of integrating consumer and spatial data to calculate and visualize things like locations within a certain drive-time radius. Datameer does not have a comparable feature. On the other hand, we do have over 60 native connectors to various RDBMS, DW, NoSQL, social and SaaS databases and services, and they come in the box. The datasets that Alteryx can integrate with come at relatively high cost, per seat, at least in terms of list prices on the company's site (at http://www.alteryx.com/products/pricing).
Alteryx is BI product with a rich heritage dating back a decade, and the company has done a good job of adding Big Data features as those have become relevant to the market. Datameer was designed from scratch around Big Data use cases and technologies. So, really, we are very different. Can you do the same sorts of things with the two products? Sure. You can also do the same sorts of things with Excel macros and coding in C; that's just how computation works. But our approaches are rather different.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was thinking on some social media applications like facebook or linkedin. I read lots of articles on websites like http://highscalability.com/ and didn't find the correct answer.
Because, the biggest apps of now, use their custom systems. They use custom file systems or customized db-engines or customized web servers. They don't use the original iis, apache, mssql, mysql, windows or linux. They use lots of programming language for different problems. It's OK for them because of their load. They have to calculate every bit or something. They started on some small enviroments and they encountered problems and saw bottlenecks. So they founded new solutions.
Now, we can find some articles about their current systems. But we have no answer about what is the best start.
I need to learn the answer of "What kind of architecture is a correct start?"
I have some ideas on it but we need to be sure about it.
We think,
Use mysql for relational database. And a caching mechanism like memcached over mysql. And a rest api for business layer. We think using python for codding of rest api. And all systems run on a suitable linux distro. After all of these enviroments is ok, we can use any language or system for UIs. It can be a PHP site for web or a native application for IOS or Android.
We need your advice. Thank you so much.
(I am a good reader but it's my first question. I hope there's no problem.)
Following a similar question last year I compiled the techniques and technologies used by some of the larger social networking sites.
The following architecture concepts are prevalent among such sites:
Scalability
Caching (heavily, across multiple tiers and layers)
Data Sharding (preferrably by data-locality criteria)
In-Memory DBs for often referenced data
Efficient wire-level protocols (as opposed to what an enterprise typically considers state of the art)
asynchronous processing
Flexibility
service oriented architecture as a baseline principle
decoupled and layered components
asynchronous processing
Reliability
asynchronous processing
replication
cell architecture (independently operated subsets, e.g. by geographical criteria)
NB: If you start a new site, you are unlikely to have the kind of scaling or reliability requirements that these extremely large sites face. Hence the best advice is to start small but keep it flexible. One approach is to use an application framework that starts out simple but has flexibility to scale later, e.g. Django.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Is there an easy way to use Hadoop other than with the command line?
Which tools are you using and which one is the best?
Hue is pretty cool, new features are regularly pushed out and it's open source.
From its website:
Hue features a File Browser for HDFS, a Job Designer/Browser for MapReduce, query editors for Hive, Pig, Cloudera Impala and Solr Search.
It also ships with an Oozie Application for creating workflows, various Shells and a collection of Hadoop API.
Although Enrico already answered the question, I would like to add few points to that.
Hue is a really amazing tool and we have been using it at Goibibo.com for last 1 year. We have exposed it to developers and business people for running their hive queries and getting results.
Also we are indexing log data so cloudera search comes in pretty handy. With the new version of Hue (3.6), you can also run queries on RDBMS data using HUE itself.
I would really recommend using it because its really simple to use and provides a GUI to mostly everything there in the bigdata ecosystem.
If you are on windows, you can use an open source project called HDFS Explorer.
If you're on a Mac or Linux, then you can mount Hadoop filesystems directly using FUSE, and then use Finder, or Nautilus, or whatever you normally use for filesystem navigation. Checkout the Hadoop wiki on how to setup the mounts http://wiki.apache.org/hadoop/MountableHDFS
Each distribution provides a web based GUI, in some cases Hue, and in others based on the new Ambari views framework, which provide access to file functionality.
You can look for some Data Integration tools like Talend, CloverETL or Pentaho. They have provided support for Hadoop.Talend has provided a vast support of it. Don't have much information about the other tools support for Hadoop.
If you are just looking for something a step up from the cli for exploring, my installation has a web server that installed with hadoop and it is accessible at :50075. The port is configurable, but give that a try.
If you are using CDH, then Hue(hadoop user Interface) comes with it. And its a very good user interface for hadoop. You can also use install separately. It supports all components of hadoop.
This is a light-weight Hadoop file manager HFX. You can use this. It has some essential features like drag and drop, upload, cut, copy and paste...
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Just started going through Hadoop introduction videos.
How to practice it on your own? Is there a recommended way to install on local to practice?
I found that downloading and installing Hadoop, playing with it by working examples, making lots of mistakes and being ok with that worked well for practice.
By "install on local" if you're saying "how do I install it on my local machine without using HDFS?", there's an excellent guide here.
If you want to learn about Hadoop and Bigdata, look into bigdatauniversity.com. Its free and they give instructions on how to install Hadoop locally on a virtual machine and/or in Amazon's Web Services. BigDataUniversity provides labs and instructions to help guide your practice. I found it helpful so far.
Recently Cloudera launched a new platform online where you can play with Hadoop and its ecosystem as much as you want.Here you go -
cloudera.com/live
I have been training people on Hadoop for 2 years now. Here are my two cents.
For the learning part, I would recommend the following sources (as mentioned by others too above):
Yahoo Blog
Hadoop Definitive Guide
HortonWorks Practice Tutorials
And for practicing, traditionally people have been using Hadoop Virtual Machines but this approach has its downsides:
The VMs are huge in size for example HortonWorks' VM is 9.9 GB.
You might have to upgrade your RAM to 8GB.
Some BIOS don't allow virtualization. You might have change bios settings.
Some machines such as Office Desktops/Laptops may not allow installations.
My students and I too faced the these problems while. So, we setup a cluster for our students to practice Hadoop, Spark and related technologies. And we named it as CloudxLab.com.
...I liked bigdatauniversity.com and also noted that MapR, Hortonworks, and Cloudera all offer a downloadable environment that you can use to gain familiarity with the Hadoop operating paradigm.
In fact, if you are studying this with an eye toward working with Hadoop at an Enterprise scale, it's a good idea to explore the products that are being deployed at that level.
I've had a little chance now to explore hands-on with MapR's Hadoop environment and can commend it as a good way to looking into the matter.
---v
I would suggest https://developer.yahoo.com/hadoop/tutorial/ for hadoop self paced study. Its a very comprehensive guide, step by step, from beginner to advanced level.
You can install a virtual box that has Hadoop included but you may encounter some problems with it. I did so first when I started learning Hadoop and after several problems( IP, internet, different configs) I decided to learn with a Linux install.
You can find a tutorial here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/