How to have lzo compression in hadoop mapreduce? - hadoop

I want to use lzo to compress map output but I can't run it! The version of Hadoop I used is 0.20.2. I set:
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.map.output.compression.codec",
"org.apache.hadoop.io.compress.LzoCodec");
When I run the jar file in Hadoop it shows an exception that can't write map output.
Do I have to install lzo?
What do I have to do to use lzo?

LZO's licence (GPL) is incompatible with that of Hadoop (Apache) and therefore it cannot be bundled with it. One needs to install LZO separately on the cluster.
The following steps are tested on Cloudera's Demo VM (CentOS 6.2, x64) that comes with full stack of CDH 4.2.0 and CM Free Edition installed, but they should work on any Linux based on Red Hat.
The installation consists of the following steps:
Installing LZO
sudo yum install lzop
sudo yum install lzo-devel
Installing ANT
sudo yum install ant ant-nodeps ant-junit java-devel
Downloading the source
git clone https://github.com/twitter/hadoop-lzo.git
Compiling Hadoop-LZO
ant compile-native tar
For further instructions and troubleshooting see https://github.com/twitter/hadoop-lzo
Copying Hapoop-LZO jar to Hadoop libs
sudo cp build/hadoop-lzo*.jar /usr/lib/hadoop/lib/
Moving native code to Hadoop native libs
sudo mv build/hadoop-lzo-0.4.17-SNAPSHOT/lib/native/Linux-amd64-64/ /usr/lib/hadoop/lib/native/
cp /usr/lib/hadoop/lib/native/Linux-amd64-64/libgplcompression.* /usr/lib/hadoop/lib/native/
Correct version number with the version you cloned
When working with a real cluster (as opposed to a pseudo-cluster) you need to rsync these to the rest of the machines
rsync /usr/lib/hadoop/lib/ to all hosts.
You can dry run this first with -n
Login to Cloudera Manager
Select from Services: mapreduce1->Configuration
Client->Compression
Add to Compression Codecs:
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
Search "valve"
Add to MapReduce Service Configuration Safety Valve
io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
mapred.child.env="JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64/"
Add to MapReduce Service Environment Safety Valve
HADOOP_CLASSPATH=/usr/lib/hadoop/lib/*
That's it.
Your MarReduce jobs that use TextInputFormat should work seamlessly with .lzo files. However, if you choose to index the LZO files to make them splittable (using com.hadoop.compression.lzo.DistributedLzoIndexer), you will find that the indexer writes a .index file next to each .lzo file. This is a problem because your TextInputFormat will interpret these as part of the input. In this case you need to change your MR jobs to work with LzoTextInputFormat.
As of Hive, as long as you don't index the LZO files, the change is also transparent. If you start indexing (to take advantage of a better data distribution) you will need to update the input format to LzoTextInputFormat. If you use partitions, you can do it per partition.

Related

ElasticSearch - uninstall version 6.4.3, install version 6.4.2 - Linux Ubuntu

We have a 3-node cluster with ElasticSearch 6.4.3 on Ubuntu 16.04. There is nothing existing outside of the fresh install of ES - no indexes, no Kibana, no Beats, no Logstash, etc.
I have been asked to downgrade to version 6.4.2. I have limited Linux experience, but enough to be able to run command line commands and understand the output. Google has lead me to bits and pieces around accomplishing this, but I'd feel a lot less anxiety around it if someone with ES experience may be able to point me to something that's a bit more step-by-step.
I do have this link to download 6.4.2, but one of the things I need to know is which file to download: https://www.elastic.co/downloads/past-releases/elasticsearch-6-4-2
Sure here you go with step by step guide, As I did this for you, using your version.
Using this link https://www.elastic.co/downloads/past-releases/elasticsearch-6-4-2, which you mentioned, download the tar file to your local system.
Use SCP to transfer the .tar file to your ubuntu instance, I used my AWS ubuntu instance.
scp -i ~/your-identity-file ~/Desktop/elasticsearch-6.4.2.tar.gz
ubuntu#aws-ec2-instance-ip:/home/ubuntu
Untar file using tar -xvf elasticsearch-6.4.2.tar.gz command.
Go to config folder like cd elasticsearch-6.4.2/config/ and set the proper values in elasticsearch.config.
Start the elasticsearch from bin folder ./elastic command.
Update:- Based on the chat with OP, Adding official ES link https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html and https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html for detailed instruction.

How to get HDFS and YARN version programmatically?

I'm writing a spark program that download different jars from maven based on the environment it runs on, each for a different version of Hadoop distribution (e.g. CDH, HDP, MapR).
This is necessary because some low-level APIs of HDFS and YARN are not shared between these distributions. However, I cannot find any public API of HDFS and YARN that tells their version.
Is it possible to do it only in Java? Or I have to run an external shell to know it?
In Java org.apache.hadoop.util.VersionInfo.getVersion() should work.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/util/VersionInfo.html
For the CLIs, you can use:
$ hadoop version
$ hdfs version
$ yarn version

Greenplum installation error

While installing greenplum we are getting below error after running gpcheck command
GPCHECK_ERROR : uname -r output is different among hosts.
on two machines we have installed centos6 and in one machine we have installed centos7.
for greenplum installation is it necessary all hosts should have same os version?
should we ignore this error and go ahead.?
You must have the same OS version on all the cluster machines. Greenplum home directory is used for installing gppkgs (add-ons) that are in fact packed rpms. Greenplum initializes rpm database inside of GPDB home directory for managing add-ons. Whenever you do "gpseginstall" (installation, expansion), GPDB copies the content of GPDB home directory to other hosts. However RPM database created on one version of OS is not valid on another, so you would get errors trying to install/list/remove packages there
In general, if you don't plan to use any gppkgs and use it merely for PoC purposes, this should work, but I would strongly recommend to use the same OS version on all the cluster hosts
It is recommended to have same OS (kernel). If it is not production environment you can give try ignoring it. I have never tested it.

How to find cdh version hadoop

When connecting to Hadoop cluster, how can I know which version of Hadoop this cluster is running? In particular this is important for proper configuration of libraries when compiling and packaging Hadoop Java jobs with Maven.
The simplest way if you have ssh access to hadoop node is by running command
$ hadoop version
If you are looking for CDH version then check /usr/lib/hadoop/cloudera/cdh_version.properties
In cdh, in the cluster I am using, there is not any cdh_version.properties (or I couldn't find it)
If your cluster uses "Parcels", you could check which version of cdh is used by doing:
/opt/cloudera/parcels
And you could see the version as the name of the folder:
CDH-5.5.1-1.cdh5.5.1.p0.11
Note: I know that this is a not a general rule for getting which cdh version is used. I am trying to show an alternative way that it worked to me.
We can check the installed version with the help of following command:
cat /usr/lib/hadoop/cloudera/cdh_version.properties
Hope this may help you.

How to install cloudera impala on EMR?

Is there anyway i can install the only impala without cloudera manager and without cdh. I will be using the apache version of hadoop?
Yes, it is absolutely possible. Add the repository into your sources.list file and update the repository after that.
deb [arch=amd64]
http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala
precise-impala1 contrib deb-src
http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala
precise-impala1 contrib
After that, it's merely :
sudo apt-get install impala (Binaries for daemons)
sudo apt-get install impala-server (Service start/stop script)
sudo apt-get install impala-state-store (Service start/stop script)
But do not forget to meet all the prerequisites. For a detailed info you can go here
You can view detailed instructions on how to install and use Impala with Amazon EMR here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-impala.html
EMR is based on a Amazon Hadoop distribution that runs on top of Debian squeeze. So, yes it's possible using Cloudera's DEB repo.
You will need to SSH to your EMR master node, find the address on EMR console.
You will also need to enable security rules on the security group you have assigned to your EMR cluster, if you intend to connect to Impala using a JDBC/ODBC client form the outside world.

Resources