Where to install Build Cassandra is better on Mac - macos

I have built Cassandra through Mac command manually.
The first time I have installed it under ~/Download folder and it builds fast, but at that time I didn't find /var/lib/cassandra so I removed cassandra from ~/Download file.
Then I have installed it under /, this time lots of operations needed "sudo" and it became slower.
So I am wondering where to install Cassandra is better? Since later I am going to pull large amount of data into Cassandra but I also hope it will influence my machine performance at the lowest level.

Install Cassandra in a location where you can provision full access to it. The problem you faced was due to the fact that you left the directories of data and commitlogs to defaults. Change those in the cassandra.yaml file and start Cassandra.

Related

Is it possible to install CDH on a RHEL7 server where Hadoop and few other components are installed seperatly

I have an RHEL7 server in which i am trying to create a common datalake platform for POC and learning purpose. I have setup Hadoop,Hive,Zookeeper,Kafka,Spark,Sqoop separately.
Installing these components separately turns out to be a tricky affair and is taking lot of effort even though this is for an internal purpose and not production specific.
I am now trying to install CDH package in this Server now.
Is it possible to do so? Will it overlap with the current installations?
How can this be achieved.
Note: Reason why we went with separate installation is due to unavailability of internet in the server at that point of time.
Reason why going for CDH now is due to availability of internet for few days after some approvals plus CDH saves lot of time and effort and includes the
components required to setup a datalake.
Can someone please help me out here.
Yes it is feasible to setup CDH without disturbing existing configs with docker. Checkout the below link for setup guide. I have tested this and it works fine even if I have individual tools setup.
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/quickstart_docker_container.html

Undo deletion of files caused by rsync

I have a Cent OS 6 VM running on my Mac. I was developing remotely using a FTP directly on the VM. I had the vagrant-rsync-auto command running.
Not realizing that rsync-auto is running, I tried to copy the project I created onto my mac in the same directory where it is rsyncing. Apparently, I lost all the files I created which is one whole day's work.
Is it possible to have the files back?
This is a data recovery issue, not a programming issue, but I would venture a guess that it approaches close to 0% possible. Not only is the data gone/overwritten, but it was inside a VM so the recovery would have to be of the VM disk at an unknown point in time in the past which has since been overwritten by a newer machine state.
Perhaps this will be a good lesson in creating backups / cron jobs or using version control, like Git?

how to install sqoop in cygwin with windows7?

I've already installed cygwin in windows7. Now I plan to add sqoop to cygwin for hadoop but I'm not getting it right...
Can anybody please suggest me the correct way for doing so, or a link detailing it?
I think you should reconsider installing Hadoop on Windows, it is not very easy to do it and it is probably more trouble than it is worth, although I believe others have done it.
Anyway there are several other options you could consider with regards to hadoop, first there's two companies I know of that provide free VM's and one of them has worked with Microsoft to try and integrate Hadoop into Windows. Anyway, these are the links:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html
http://hortonworks.com/products/hortonworks-sandbox/#install
Otherwise you can try your luck with the default apache installation, though I warn you, if you're new to linux or don't like to spend a lot of time changing configuration files, going this way is not the best. I did my installation this way, and you have to modify a lot of files, plus anything extra like Hive, Sqoop, HBase, etc. needs to be installed separately and configured as well.
Please don't make yourself complicated.
I can only recommend running sqoop on hadoop in a linux virtual machine or native linux. Although successfully running hadoop 0.20.0 on windows xp+cygwin and windows7+cygwin, I once tried setting up a newer version of hadoop on windows7, but failed miserably due to errors in hadoop.
I have wasted days and weeks on this.
So my advice: run hadoop on linux if you can, you'll avoid a serious amount of problems.

How to build and install spark in an offline environment?

I am trying to install Spark 1.3.1 in an offline clusters (No Internet at all, only Lan). However, I don't know how to build it from source code since either via maven or sbt requires network connection. Can someone offer some help or possible solutions?
Thanks.
A simply (albiet somewhat hacky) solution would be to build it on a machine with internet access and then copy all of the items in ~/.ivy2 over to the machine with only lan access so that it can access the cached items. Another, perhaps simpler, option would be to use a pre-built Spark is thats an acceptable solution.

How to Setup a Low cost cluster

At my house I have about 10 computers all different processors and speeds (all x86 compatible). I would like to cluster these. I have looked at openMosix but since they stopped development on it I am deciding against using it. I would prefer to use the latest or next to latest version of a mainstream distribution of Linux (Suse 11, Suse 10.3, Fedora 9 etc).
Does anyone know any good sites (or books) that explain how to get a cluster up and running using free open source applications that are common on most mainstream distributions?
I would like a load balancing cluster for custom software I would be writing. I can not use something like Folding#home because I need constant contact with every part of the application. For example if I was running a simulation and one computer was controlling where rain was falling, and another controlling what my herbivores are doing in the simulation.
I recently set up an OpenMPI cluster using Ubuntu. Some existing write up is at https://wiki.ubuntu.com/MpichCluster .
Your question is too vague. What cluster application do you want to use?
By far the easiest way to set up a "cluster" is to install Folding#Home on each of your machines. But I doubt that's really what you're asking for.
I have set up clusters for music/video transcoding using simple bash scripts and ssh shared keys before.
I manage mail server clusters at work.
You only need a cluster if you know what you want to do. Come back with an actual requirement, and someone will suggest a solution.
Take a look at Rocks. It's a fullblown cluster "distribution" based on CentOS 5.1. It installs all you need (libs, applications and tools) to run a cluster and is dead simple to install and use. You do all the tweaking and configuration on the master node and it helps you with kickstarting all your other nodes. I've recently been installing a 1200+ nodes (over 10.000 cores!) cluster with it! And would not hesitate to install it on a 4 node cluster since the workload to install the master is none!
You could either run applications written for cluster libs such as MPI or PVM or you could use the queue system (Sun Grid Engine) to distribute any type of jobs. Or distcc to compile code of choice on all nodes!
And it's open source, gpl, free, everything that you like!
I think he's looking for something similar with openMosix, some kind of a general cluster on top of which any application can run distributed among the nodes. AFAIK there's nothing like that available. MPI based clusters are the closest thing you can get, but I think you can only run MPI applications on them.
Linux Virtual Server
http://www.linuxvirtualserver.org/
I use pvm and it works. But even with a nice ssh setup, allowing for login without entering passwd to the machine, you can easily remotely launch commands on your different computing nodes.

Resources