How to unzip .gz files in a new directory in hadoop? - hadoop

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt
Using Java program
In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
This code takes the gz file path as input.
You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.
Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
The part-m-00000 contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

Bash solution
In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.
I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.
#!/bin/bash
workdir=/tmp/unziphdfs/
cd $workdir
# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
echo $hdfsfile
# copy to temp folder to unpack
hdfs dfs -copyToLocal $hdfsfile $workdir
hdfsdir=$(dirname "$hdfsfile")
zipname=$(basename "$hdfsfile")
# unpack locally and remove
unzip $zipname
rm -rf $zipname
# copy files back to hdfs
files=$(ls $workdir)
for file in $files; do
hdfs dfs -copyFromLocal $file $hdfsdir
rm -rf $file
done
# optionally remove the zip file from hdfs?
# hadoop fs -rm -skipTrash $hdfsfile
done
Description
Get all the *.zip files in an hdfs dir
One-by-one: copy zip to a temp dir (on filesystem)
Unzip
Copy all the extracted files to the dir of the zip file
Cleanup
I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.
Good luck :)

If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).
hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a

You can do this using hive (assuming it is text data).
create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;
Data will be uncompressed into new set of files.
if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.
hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>

Providing the scala code
import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.io.IOUtils
val conf = new org.apache.hadoop.conf.Configuration()
def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val inputPath = new Path(compath)
val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
val codec = factory.getCodec(inputPath)
if (codec == null){
throw new RuntimeException(s"Not a valid codex $codec")
}
var in : CompressionInputStream = null;
var out : FSDataOutputStream = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(uncompPath));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
uncompPath
}

Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.

Related

Creating Merged Folder of Symlinks on macOS

I'm attempting to use the perl script below to create symlinks of all folders from three locations. My desired result would be something like:
Source
/TV720/GoT/Season00/
/TV1080/GoT/Season01/
/TV2160/GoT/Season02/
Destination (Symlink)
/TV/GoT/Season00/
/TV/GoT/Season01/
/TV/GoT/Season02/
However when I run the script, I get: /TV/GoT/Season00/ without the other folders found in other sources.
It appears the 2nd and 3rd source location sub-folders aren't symlinked and merged in the event of duplicate folder names.
#!/usr/bin/perl
use strict;
use warnings;
my #sourceList = (
"/Volumes/Disk/TV720",
"/Volumes/Disk/TV1080",
"/Volumes/Disk/TV2160",
);
my $destinationFolder = "/Volumes/Disk/TV";
foreach my $currentSource (#sourceList) {
opendir SDIR, $currentSource or do {
warn "$0: can't opendir $currentSource: $!\n";
next;
};
my #sourceFolders = grep { not /^\.{1,2}$/ } readdir SDIR;
closedir SDIR;
foreach my $currentFolder (sort #sourceFolders) {
my $fromPath = $currentSource . '/' . $currentFolder;
my $toPath = $destinationFolder . '/' . $currentFolder;
if (not -e $toPath) {
# print "Creating $toPath as symlink to $fromPath\n";
symlink $fromPath, $toPath
or warn "$0: can't symlink $toPath to $fromPath: $!\n";
}
}
}
Given the information above, you might be able to accomplish what you want with union mounts on the file system. This is where you can overlay directories on top of other directories. It's more advanced than a symlink for sure, but it might do what you want.
If you want to continue with the symlinks, I would suggest modifying your script to loop through your sources and recursively iterate the subfolders in each directory.
Then for each subfolder, create an actual new folder in the destination, then iterate the files in the source subfolder and symlink each file individually.

Renaming files with a specific scheme

I have a FTP folder receiving files from a remote camera. The camera stores the video file name always as ./rec_YYYY-MM-DD_HH-MM.mkv. The video files are stored all in the same folder, the root folder from the FTP server.
I need to move these files to another folder, with this new scheme:
Remove rec_ from the file name.
Change date format to DD-MM-YY.
Remove date from the file name and make it a folder instead, where that same file and all the others in the same date will be stored in.
Final file path would be: ./DD-MM-YYYY/HH-MM.mkv.
The process would continue to all the files, putting them in the folder corresponding to the day it was created.
Summing up: ./rec_YYYY-MM-DD_HH-MM.mkv >> ./DD-MM-YYYY/HH-MM.mkv. The same should apply to all files that are in the same folder.
As I can't make it happen directly from the camera, this needs to be done with Bash on the server that is receiving the files.
So far, what I got is script, which would get the file's creation date and use it to make a folder, and then get creation time to move the file with the new name, based on it's creation time.:
for f in *.mp4
do
mkdir "$f" "$(date -r "$f" +"%d-%m-%Y")"
mv -n "$f" "$(date -r "$f" +"%d-%m-%Y/%H-%M-%S").mp4"
done
I'm getting this output (with testfile 1.mp4):
It creates the folder based on the file's creation date;
it renames the file to it's creation time;
Then, it returns mkdir: cannot create directory ‘1.mp4’: File exists
If two or more files, only one gets renamed and moved as described. The others stay the same and terminal returns:
mkdir: cannot create directory ‘1.mp4’: File exists
mkdir: cannot create directory ‘2.mp4’: File exists
mkdir: cannot create directory ‘12-12-2018’: File exists
Could someone help me out? Better suggestions? Thanks!
Honestly I would just use Perl or Python for this. Here's how to embed either in a shell script.
Here's a perl script that doesn't use any libraries, even ones that ship with Perl (so it'll work without extra packages on distributions like CentOS that don't ship with the entire Perl library). The perl script launches one new process per file in order to perform the copy.
perl -e '
while (<"*.m{p4,kv}">) {
my $path = $_;
my ($prefix, $year, $month, $day, $hour, $minute, $ext) =
split /[.-_]/, $path;
my $sec = q[00];
die "unexpected prefix ($prefix) in $path"
unless $prefix eq q[rec];
die "unexpected extension ($ext) in $path"
unless $ext eq q[mp4] or $ext eq q[mkv];
my $dir = "$day-$month-$year";
my $name = "$hour-$min-$sec" . q[.] . $ext;
my $destpath = $dir . q[/] . $name;
die "$dir . $name is unexpectedly a directory" if (-d $dir);
system("cp", "--", $path, $destpath);
}
'
Here's a Python example, it's compatible with either Python 2 or Python 3 but does use the standard library. The Python script does not spawn any additional processes.
python3 -c '
import os.path as path
import re
from glob import iglob
from itertools import chain
from os import mkdir
from shutil import copyfile
for p in chain(iglob("*.mp4"), iglob("*.mkv")):
fields = re.split("[-]|[._]", p)
prefix = fields[0]
year = fields[1]
month = fields[2]
day = fields[3]
hour = fields[4]
minute = fields[5]
ext = fields[6]
sec = "00"
assert prefix == "rec"
assert ext in ["mp4", "mkv"]
directory = "".join([day, "-", month, "-", year])
name = "".join([hour, "-", minute, "-", sec, ".", ext])
destpath = "".join([directory, "/", name])
assert not path.isdir(destpath)
try:
mkdir(directory)
except FileExistsError:
pass
copyfile(src=p, dst=destpath)
'
Finally, here's a bash solution. It splits paths using -, ., and _ and then extracts various subfields by indexing into $# inside a function. The indexing trick is portable, although regex substitution on variables is a bash extension.
#!/bin/bash
# $1 $2 $3 $4 $5 $6 $7 $8
# path rec YY MM DD HH MM ext
process_file() {
mkdir "$5-$4-$3" &> /dev/null
cp -- "$1" "$5-$4-$3"/"$6-$7-00.$8"
}
for path in *.m{p4,kv}; do
[ -e "$path" ] || continue
# NOTE: two slashes are needed in the substitution to replace everything
# read -a ARRAYVAR <<< ... reads the words of a string into an array
IFS=' ' read -a f <<< "${path//[-_.]/ }"
process_file "$path" "${f[#]}"
done
If you cd /to/some/directory/containing_your_files then you could use the following script
#!/usr/bin/env bash
for f in rec_????-??-??_??-??.m{p4,kv} ; do
dir=${f:4:10} # skip 4 chars ('rec_') take 10 chars ('YYYY_MM_DD')
fnm=${f:15} # skip 15 chars, take the remainder
test -d "$dir" || mkdir "$dir"
mv "$f" "$dir"/"$fnm"
done
note ① that I have not exchanged the years and the days, if you absolutely need to do the swap you can extract the year like this, year=${dir::4} etc and ② that this method of parameter substitution is a Bash-ism, e.g., it doesn't work in dash.
your problem is: mkdir creates folder but you are giving filename for folder creation.
if you want to use fileName for folder creation then use it without extension.
the thing is you are trying to create folder with the already existing fileName

Source local file to HDFS sink by using flume

I am using flume to Source local file to HDFS sink, below is my conf:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/download/test_data/
a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = fileName
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://172.16.10.5/user/admin/Data/
a1.sinks.k1.hdfs.filePrefix = %{fileName}
a1.sinks.k1.hdfs.idleTimeout=60
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 5000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
And I used the user 'flume' to execute this conf file.
time bin/flume-ng agent -c conf -f conf/hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
But it shows I could not find the local file, permission denied
Could not find file: /usr/download/test_data/sale_record0501.txt
java.io.FileNotFoundException: /usr/download/test_data/.flumespool/.flumespool-main.meta (Permission denied)
How to solve this?
Your flume user may not have permission under the spooling directory. Your spooling directory is at /usr and It may be require a root permission to access this path.
First become a root with sudo su then execute or replace your execution command with
sudo bin/flume-ng agent -c conf -f conf/hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
On the other hand, you can give permission to flume user with
cd /usr/download/
sudo chown -R flume:somegroup test_data

How to extract a .gz file in hadoop cluster environment?

How to extract a .gz file in hadoop cluster environment , via java code without copying file to local(extracting directly in hadoop through code)?
"gunzip -k file.gz" is usually used to unpack the .gz file keeping the original .gz as well, is it what you were looking for?
Consider your .gz file has one file in it, you can do like this:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);
FSDataInputStream in = fs.open(inFile);
org.apache.hadoop.io.compress.GzipCodec.GzipInputStream gis = new org.apache.hadoop.io.compress.GzipCodec.GzipInputStream(in);
FSDataOutputStream out = fs.create(outFile);
doCopy(gis, out);
public static void doCopy(FSDataInputStream is, FSDataOutputStream os) throws Exception {
int oneByte;
while ((oneByte = is.read()) != -1) {
os.write(oneByte);
}
os.close();
is.close();
}
the other way of doing this you can create a shell script and run the same script whenever you need. or els you try implement terminal commands in your code.
If you want to do it from terminal you can run this command
gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ... ]
gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ... ]
source : https://www.techonthenet.com/linux/commands/gzip.php

Any command to get active namenode for nameservice in hadoop?

The command:
hdfs haadmin -getServiceState machine-98
Works only if you know the machine name. Is there any command like:
hdfs haadmin -getServiceState <nameservice>
which can tell you the IP/hostname of the active namenode?
To print out the namenodes use this command:
hdfs getconf -namenodes
To print out the secondary namenodes:
hdfs getconf -secondaryNameNodes
To print out the backup namenodes:
hdfs getconf -backupNodes
Note: These commands were tested using Hadoop 2.4.0.
Update 10-31-2014:
Here is a python script that will read the NameNodes involved in Hadoop HA from the config file and determine which of them is active by using the hdfs haadmin command. This script is not fully tested as I do not have HA configured. Only tested the parsing using a sample file based on the Hadoop HA Documentation. Feel free to use and modify as needed.
#!/usr/bin/env python
# coding: UTF-8
import xml.etree.ElementTree as ET
import subprocess as SP
if __name__ == "__main__":
hdfsSiteConfigFile = "/etc/hadoop/conf/hdfs-site.xml"
tree = ET.parse(hdfsSiteConfigFile)
root = tree.getroot()
hasHadoopHAElement = False
activeNameNode = None
for property in root:
if "dfs.ha.namenodes" in property.find("name").text:
hasHadoopHAElement = True
nameserviceId = property.find("name").text[len("dfs.ha.namenodes")+1:]
nameNodes = property.find("value").text.split(",")
for node in nameNodes:
#get the namenode machine address then check if it is active node
for n in root:
prefix = "dfs.namenode.rpc-address." + nameserviceId + "."
elementText = n.find("name").text
if prefix in elementText:
nodeAddress = n.find("value").text.split(":")[0]
args = ["hdfs haadmin -getServiceState " + node]
p = SP.Popen(args, shell=True, stdout=SP.PIPE, stderr=SP.PIPE)
for line in p.stdout.readlines():
if "active" in line.lower():
print "Active NameNode: " + node
break;
for err in p.stderr.readlines():
print "Error executing Hadoop HA command: ",err
break
if not hasHadoopHAElement:
print "Hadoop High-Availability configuration not found!"
Found this:
https://gist.github.com/cnauroth/7ff52e9f80e7d856ddb3
This works out of the box on my CDH5 namenodes, although I'm not sure other hadoop distributions will have http://namenode:50070/jmx available - if not, I think it can be added by deploying Jolokia.
Example:
curl 'http://namenode1.example.com:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeStatus",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
"State" : "active",
"NNRole" : "NameNode",
"HostAndPort" : "namenode1.example.com:8020",
"SecurityEnabled" : true,
"LastHATransitionTime" : 1436283324548
} ]
So by firing off one http request to each namenode (this should be quick) we can figure out which one is the active one.
It's also worth noting that if you talk WebHDFS REST API to an inactive namenode you will get a 403 Forbidden and the following JSON:
{"RemoteException":{"exception":"StandbyException","javaClassName":"org.apache.hadoop.ipc.StandbyException","message":"Operation category READ is not supported in state standby"}}
In a High Availability Hadoop cluster, there will be 2 namenodes - one active and one standby.
To find the active namenode, we can try executing the test hdfs command on each of the namenodes and find the active name node corresponding to the successful run.
Below command executes successfully if the name node is active and fails if it is a standby node.
hadoop fs -test -e hdfs://<Name node>/
Unix script
active_node=''
if hadoop fs -test -e hdfs://<NameNode-1>/ ; then
active_node='<NameNode-1>'
elif hadoop fs -test -e hdfs://<NameNode-2>/ ; then
active_node='<NameNode-2>'
fi
echo "Active Dev Name node : $active_node"
You can do it in bash with hdfs cli calls, too. With the noted caveat that this takes a bit more time since it's a few calls to the API in succession, but this may be preferable to using a python script for some.
This was tested with Hadoop 2.6.0
get_active_nn(){
ha_name=$1 #Needs the NameServiceID
ha_ns_nodes=$(hdfs getconf -confKey dfs.ha.namenodes.${ha_name})
active=""
for node in $(echo ${ha_ns_nodes//,/ }); do
state=$(hdfs haadmin -getServiceState $node)
if [ "$state" == "active" ]; then
active=$(hdfs getconf -confKey dfs.namenode.rpc-address.${ha_name}.${node})
break
fi
done
if [ -z "$active" ]; then
>&2 echo "ERROR: no active namenode found for ${ha_name}"
exit 1
else
echo $active
fi
}
After reading all the existing answers none seemed to combine the three steps of:
Identifying the namenodes from the cluster.
Resolving the node names to host:port.
Checking the status of each node (without requiring
cluster admin privs).
Solution below combines hdfs getconf calls and JMX service call for node status.
#!/usr/bin/env python
from subprocess import check_output
import urllib, json, sys
def get_name_nodes(clusterName):
ha_ns_nodes=check_output(['hdfs', 'getconf', '-confKey',
'dfs.ha.namenodes.' + clusterName])
nodes = ha_ns_nodes.strip().split(',')
nodeHosts = []
for n in nodes:
nodeHosts.append(get_node_hostport(clusterName, n))
return nodeHosts
def get_node_hostport(clusterName, nodename):
hostPort=check_output(
['hdfs','getconf','-confKey',
'dfs.namenode.rpc-address.{0}.{1}'.format(clusterName, nodename)])
return hostPort.strip()
def is_node_active(nn):
jmxPort = 50070
host, port = nn.split(':')
url = "http://{0}:{1}/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus".format(
host, jmxPort)
nnstatus = urllib.urlopen(url)
parsed = json.load(nnstatus)
return parsed.get('beans', [{}])[0].get('State', '') == 'active'
def get_active_namenode(clusterName):
for n in get_name_nodes(clusterName):
if is_node_active(n):
return n
clusterName = (sys.argv[1] if len(sys.argv) > 1 else None)
if not clusterName:
raise Exception("Specify cluster name.")
print 'Cluster: {0}'.format(clusterName)
print "Nodes: {0}".format(get_name_nodes(clusterName))
print "Active Name Node: {0}".format(get_active_namenode(clusterName))
From java api, you can use HAUtil.getAddressOfActive(fileSystem).
You can do a curl command to find out the Active and secondary Namenode
for example
curl -u username -H "X-Requested-By: ambari" -X GET
http://cluster-hostname:8080/api/v1/clusters//services/HDFS
Regards
I found the below when i simply typed 'hdfs' and found a couple of helpful commands, which could be useful for someone who could maybe come here seeking for help.
hdfs getconf -namenodes
This above command will give you the service id of the namenode. Say, hn1.hadoop.com
hdfs getconf -secondaryNameNodes
This above command will give you the service id of the available secondary namenodes. Say , hn2.hadoop.com
hdfs getconf -backupNodes
This above command will get you the service id of backup nodes, if any.
hdfs getconf -nnRpcAddresses
This above command will give you info of name service id along with rpc port number. Say, hn1.hadoop.com:8020
You're Welcome :)
In HDFS 2.6.0 the one that worked for me
ubuntu#platform2:~$ hdfs getconf -confKey dfs.ha.namenodes.arkin-platform-cluster
nn1,nn2
ubuntu#platform2:~$ sudo -u hdfs hdfs haadmin -getServiceState nn1
standby
ubuntu#platform2:~$ sudo -u hdfs hdfs haadmin -getServiceState nn2
active
Here is example of bash code that returns active name node even if you do not have local hadoop installation.
It also works faster as curl calls are usually faster than hadoop.
Checked on Cloudera 7.1
#!/bin/bash
export nameNode1=myNameNode1
export nameNode2=myNameNode2
active_node=''
T1=`curl --silent --insecure -request GET https://${nameNode1}:9871/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus | grep "\"State\" : \"active\"" | wc -l`
if [ $T1 == 1 ]
then
active_node=${nameNode1}
else
T1=`curl --silent --insecure -request GET https://${nameNode2}:9871/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus | grep "\"State\" : \"active\"" | wc -l`
if [ $T1 == 1 ]
then
active_node=${nameNode2}
fi
fi
echo "Active Dev Name node : $active_node"
#!/usr/bin/python
import subprocess
import sys
import os, errno
def getActiveNameNode () :
cmd_string="hdfs getconf -namenodes"
process = subprocess.Popen(cmd_string, shell=True, stdout=subprocess.PIPE)
out, err = process.communicate()
NameNodes = out
Value = NameNodes.split(" ")
for val in Value :
cmd_str="hadoop fs -test -e hdfs://"+val
process = subprocess.Popen(cmd_str, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
if (err != "") :
return val
def main():
out = getActiveNameNode()
print(out)
if __name__ == '__main__':
main()
You can simply use the below command. I have tested this in hadoop 3.0 You can check the reference here -
hdfs haadmin -getAllServiceState
It returns the state of all the NameNodes.
more /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.ha.namenodes.nameservice1</name>
<value>namenode1353,namenode1357</value>
</property>
hdfs#:/home/ubuntu$ hdfs haadmin -getServiceState namenode1353
active
hdfs#:/home/ubuntu$ hdfs haadmin -getServiceState namenode1357
standby

Resources