Is Hadoop in Docker container faster/worth it? [closed] - hadoop

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a Hadoop based environment. I use Flume, Hue and Cassandra in this system. There is a big hype around Docker nowadays, so would like to examine, what are pros and cons in dockerization in this case. I think it should be much more portable, but it can be set using Cloudera Manager with a few clicks. Is it maybe faster or why is worth it? What are advantages?
Maybe should be only multi node Cassandra cluster dockerized?

Is it maybe faster or why is worth it?
It sounds like you already have a Hadoop cluster. So you have to ask yourself, how long does it take to reproduce this environment? How often do you need to reproduce this environment?
If you are not needing a way to reproduce the environment repeatedly and and contain dependencies that may be conflicts with other applications on the host, then I don't yet see a use case for you.
What are advantages?
If you are running Hadoop in an environment where you may need mixed Java versions, then running it as a container could isolate the dependencies (in this case, Java) from the host system. In some case, it would get you a more easily reproducible artifact to move around and set up. But Java apps are already so simple with all their dependencies included in the JAR.
Maybe should be only multi node Cassandra cluster dockerized?
I don't think it really comes down to whether is is a multi-node environment or not. It comes down to the problems it solves. It doesn't sound like you have any pain point in deploying or reproducing Hadoop environments (yet), so I don't see the need to "dockerize" something just because it is the hot new thing on the block.
When you do have the need to reproduce the Hadoop environment easily, you might look at Docker for some of the orchestration and management tools (Kubernetes, Rancher, etc.) which make deploying and managing clusters of applications on an overlay network much more appetizing than just regular Docker. Docker is just the tool in my eyes. It really starts to shine when you can leverage some of the neat overlay multi-host networking, discovery, and orchestration that other packages are building on top of it.

Related

How can I increase Tps in Google Cloud Platform? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I'm running minecraft server(modded) using 4Ram with 32G.
It's stable when 1~2people, but when people join server, tps become low.
I think it is not a problem with rams. But packets are too many transfered client and server.
How can I increase tps?
The primary causes of TPS drops are a result of what you have going on in your world.
When adding mods or plugins, you should be thinking about the long-term effects of your choices.
For each modded block you add that provides some type of function, the server has to allocate resources to ensure that function is carried out. Now on its own that one block is of little consequence. But if that block forms an array as is typically done with solar panels, then the server will need to dedicate more resources to carry out that arrays functions. When we break it down we can get an idea of how much is really going on in the background.
Minecraft does not have any built-in methods for checking the RAM usage, but you can check the RAM usage by installing the Essentials plugin and using the command “/memory”. You can take a look at this link for more information. Also this command can help you to determine the Current TPS.
Additionally, you will find some good recommendations the last link that may help you to resolve your problem:
Reduce view distance
Your Minecraft server will run at view distance of 10 by default. We recommend changing your view distance to 6, this will not make any noticeable difference to players, but this can hugely help your server performance. You can learn how to access your server settings here.
Setup automated restarts
Setting up automatic restarts can help your server run smoother by freeing up your server RAM usage. It can also reclaim RAM that gets used by plugins and mods that have small memory leaks. You can view a tutorial on how to setup automated restarts here.
Run the latest version
We recommend using the latest version of Minecraft, plugins, and mods on your server. Most newer versions of software will include bug fixes and performance improvements that will make your server run faster and more stable.
Remove unnecessary mods and plugins
Having unused plugins and mods on the server will use up server resources even if the plugins and mods are not being used. It is a good idea to remove any unnecessary mods and plugins from the server. If you think you may use some plugins in the future and are not using right now, you can disable plugins by renaming the plugin .jar file to end with “.disable.” E.g Essentials.jar.disable. You can remove “.disable” from the plugin name to enable the plugin again.
I also found this documentation that explains, How to optimize the server's performance? That may help you to troubleshoot your issue.
On the other hand, I recommend you to review the following guides on asking questions: How do I ask a good question? and How to create a Minimal, Complete, and Verifiable example in order to provide a better context on what you are doing and what you want to achieve.

How secure is Vagrant/Puppet/Puphpet? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am sorry if my question is stupid, but I think it is better to be safe than sorry. I am just a beginner when it comes to server configurations and DevOps.
I am checking server configuration management tools like Vagrant/Puppet/Puphpet. They look like extremely powerful tools, but I am worried about the security of using them in the production environment.
For example, when deploying to AWS, we need to specify the AWS access credentials (key and secret, and the key pair). If using Puphpet, you actually need to insert them into the website to create the script file. I downloaded the script as is, and replaced the credentials in the code, but still I wonder how secure it is to trust these external tools (vagrant/puppet) to manage configurations on the server.
Am I just being paranoid, or is this a possible security risk?
Creator of PuPHPet here. Your configs are not saved to the server, everything is deleted.
I suggest on the GUI you leave the entry blank and manually type it into the yaml file afterward.
PuPHPet's source code is open source, and you are more than welcome to go through all the Puppet modules included in the zip file.
Vagrant, Puppet and puPHPet are all different fro each other.
(1)Vagrant helps you spawn VMs with pre-defined/custom boxes within seconds. The configuration from "Vagrantfile" is applied on the boxes. You can bring up a server and apply your puppet code through it.
(2)puPHPet does the same thing but has a nice GUI and higher level of abstraction as compared to vagrant. It has various options to choose when it comes to the kind of box you want.
(3)Puppet is configuration management tool with a descriptive language where you write a module and apply it to your server to configure it.
Now coming to security, If you have keys/passwords in your manifests, I will not suggest you to use online tool. But you can install vagrant on your local-machine and use it. If your puppet code is internal to the devops team you work for, its pretty safe to ave passwords in it.
NOTE: NEVER SAVE PASSWORDS IN CODE.

Hadoop and geospatial connector [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am using Cloudera Hadoop, I want to perform Spatial Analytics and need to connect to quantum GIS(QGIS) for geospatial purposes. I need to know how to connect both QGIS and Hadoop. Also is there any other way to connect any other GIS systems except ARCGIS?
There are a number of free and open-source offerings you can use to achieve your goals.
From the list of LocationTech projects, I'd note that GeoJinni (formerly Spatial Hadoop), GeoMesa, and GeoTrellis all work with Hadoop or distributed databases like Accumulo or Cassandra.
More generally, since working with Hadoop means using Java, I'd recommend the GeoTools project for processing geo/gis data on the JVM. GeoTools is used as a library for GeoServer to publish geospatial data using open standards. GeoServer and MapServer are two of the open alternatives for Arc server products.
As you are looking for alternatives to Arc desktop products, QGIS and OpenJump are both options.
As a concrete, small examples, I've used the GeoTools library to read shapefiles from HDFS for ingest into GeoMesa with no problems. Previously, I have looked at serving up geotiffs hosted on HDFS/S3 through GeoServer, and there were a few small changes necessary to wire that up through the stack, but I was able to do it since all the software involved is open-source.
(In full-disclosure, I work on GeoMesa, participate quite a bit with LocationTech projects and some with GeoTools/GeoServer.)
GeoWave is a project that matches the criteria.
The goal of the project is primarily to connect the Hadoop ecosystem with popular GIS software which seems to fit well here. It enables storage/retrieval/analysis of raster/vector/point cloud data within various distributed key-value stores.
Also, the project has installers for Cloudera in addition to other vendors - the latest release with support for Cloudera 5.12.1.
Please keep in mind that I am a GeoWave core contributor.

GUI for using Hadoop [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Is there an easy way to use Hadoop other than with the command line?
Which tools are you using and which one is the best?
Hue is pretty cool, new features are regularly pushed out and it's open source.
From its website:
Hue features a File Browser for HDFS, a Job Designer/Browser for MapReduce, query editors for Hive, Pig, Cloudera Impala and Solr Search.
It also ships with an Oozie Application for creating workflows, various Shells and a collection of Hadoop API.
Although Enrico already answered the question, I would like to add few points to that.
Hue is a really amazing tool and we have been using it at Goibibo.com for last 1 year. We have exposed it to developers and business people for running their hive queries and getting results.
Also we are indexing log data so cloudera search comes in pretty handy. With the new version of Hue (3.6), you can also run queries on RDBMS data using HUE itself.
I would really recommend using it because its really simple to use and provides a GUI to mostly everything there in the bigdata ecosystem.
If you are on windows, you can use an open source project called HDFS Explorer.
If you're on a Mac or Linux, then you can mount Hadoop filesystems directly using FUSE, and then use Finder, or Nautilus, or whatever you normally use for filesystem navigation. Checkout the Hadoop wiki on how to setup the mounts http://wiki.apache.org/hadoop/MountableHDFS
Each distribution provides a web based GUI, in some cases Hue, and in others based on the new Ambari views framework, which provide access to file functionality.
You can look for some Data Integration tools like Talend, CloverETL or Pentaho. They have provided support for Hadoop.Talend has provided a vast support of it. Don't have much information about the other tools support for Hadoop.
If you are just looking for something a step up from the cli for exploring, my installation has a web server that installed with hadoop and it is accessible at :50075. The port is configurable, but give that a try.
If you are using CDH, then Hue(hadoop user Interface) comes with it. And its a very good user interface for hadoop. You can also use install separately. It supports all components of hadoop.
This is a light-weight Hadoop file manager HFX. You can use this. It has some essential features like drag and drop, upload, cut, copy and paste...

How to Practice Hadoop Programming? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Just started going through Hadoop introduction videos.
How to practice it on your own? Is there a recommended way to install on local to practice?
I found that downloading and installing Hadoop, playing with it by working examples, making lots of mistakes and being ok with that worked well for practice.
By "install on local" if you're saying "how do I install it on my local machine without using HDFS?", there's an excellent guide here.
If you want to learn about Hadoop and Bigdata, look into bigdatauniversity.com. Its free and they give instructions on how to install Hadoop locally on a virtual machine and/or in Amazon's Web Services. BigDataUniversity provides labs and instructions to help guide your practice. I found it helpful so far.
Recently Cloudera launched a new platform online where you can play with Hadoop and its ecosystem as much as you want.Here you go -
cloudera.com/live
I have been training people on Hadoop for 2 years now. Here are my two cents.
For the learning part, I would recommend the following sources (as mentioned by others too above):
Yahoo Blog
Hadoop Definitive Guide
HortonWorks Practice Tutorials
And for practicing, traditionally people have been using Hadoop Virtual Machines but this approach has its downsides:
The VMs are huge in size for example HortonWorks' VM is 9.9 GB.
You might have to upgrade your RAM to 8GB.
Some BIOS don't allow virtualization. You might have change bios settings.
Some machines such as Office Desktops/Laptops may not allow installations.
My students and I too faced the these problems while. So, we setup a cluster for our students to practice Hadoop, Spark and related technologies. And we named it as CloudxLab.com.
...I liked bigdatauniversity.com and also noted that MapR, Hortonworks, and Cloudera all offer a downloadable environment that you can use to gain familiarity with the Hadoop operating paradigm.
In fact, if you are studying this with an eye toward working with Hadoop at an Enterprise scale, it's a good idea to explore the products that are being deployed at that level.
I've had a little chance now to explore hands-on with MapR's Hadoop environment and can commend it as a good way to looking into the matter.
---v
I would suggest https://developer.yahoo.com/hadoop/tutorial/ for hadoop self paced study. Its a very comprehensive guide, step by step, from beginner to advanced level.
You can install a virtual box that has Hadoop included but you may encounter some problems with it. I did so first when I started learning Hadoop and after several problems( IP, internet, different configs) I decided to learn with a Linux install.
You can find a tutorial here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Resources