Hadoop Pig fs test command - hadoop

Wondering what does this line mean? Searched around but cannot find a reference for this command,
Pig.fs("test -e " + pathToCheck) == 0:
thanks in advance,
Lin

Use the command line tool and run hadoop fs -help and got :
-test -[defsz] <path> :
Answer various questions about <path>, with result via exit status.
-d return 0 if <path> is a directory.
-e return 0 if <path> exists.
-f return 0 if <path> is a file.
-s return 0 if file <path> is greater than zero bytes in size.
-z return 0 if file <path> is zero bytes in size, else return 1.

Related

baseDir issue with nextflow

This might be a very basic question for you guys, however, I am have just started with nextflow and I struggling with the simplest example.
I first explain what I have done and the problem.
Aim: I aim to make a workflow for my bioinformatics analyses as the one here (https://www.nextflow.io/example4.html)
Background: I have installed all the packages that were needed and they all work from the console without any error.
My run: I have used the same script as in example only by replacing the directory names. Here is how I have arranged the directories
location of script
~/raman/nflow/script.nf
location of Fastq files
~/raman/nflow/Data/T4_1.fq.gz
~/raman/nflow/Data/T4_2.fq.gz
Location of transcriptomic file
~/raman/nflow/Genome/trans.fa
The script
#!/usr/bin/env nextflow
/*
* The following pipeline parameters specify the refence genomes
* and read pairs and can be provided as command line options
*/
params.reads = "$baseDir/Data/T4_{1,2}.fq.gz"
params.transcriptome = "$baseDir/HumanGenome/SalmonIndex/gencode.v42.transcripts.fa"
params.outdir = "results"
workflow {
read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true )
INDEX(params.transcriptome)
FASTQC(read_pairs_ch)
QUANT(INDEX.out, read_pairs_ch)
}
process INDEX {
tag "$transcriptome.simpleName"
input:
path transcriptome
output:
path 'index'
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i index
"""
}
process FASTQC {
tag "FASTQC on $sample_id"
publishDir params.outdir
input:
tuple val(sample_id), path(reads)
output:
path "fastqc_${sample_id}_logs"
script:
"""
fastqc "$sample_id" "$reads"
"""
}
process QUANT {
tag "$pair_id"
publishDir params.outdir
input:
path index
tuple val(pair_id), path(reads)
output:
path pair_id
script:
"""
salmon quant --threads $task.cpus --libType=U -i $index -1 ${reads[0]} -2 ${reads[1]} -o $pair_id
"""
}
Output:
(base) ntr#ser:~/raman/nflow$ nextflow script.nf
N E X T F L O W ~ version 22.10.1
Launching `script.nf` [modest_meninsky] DSL2 - revision: 032a643b56
executor > local (2)
executor > local (2)
[- ] process > INDEX (gencode) -
[28/02cde5] process > FASTQC (FASTQC on T4) [100%] 1 of 1, failed: 1 ✘
[- ] process > QUANT -
Error executing process > 'FASTQC (FASTQC on T4)'
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Command executed:
fastqc "T4" "T4_1.fq.gz T4_2.fq.gz"
Command exit status:
0
Command output:
(empty)
Command error:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
Work dir:
/home/ruby/raman/nflow/work/28/02cde5184f4accf9a05bc2ded29c50
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
I believe I have an issue with my baseDir understanding. I am assuming that the baseDir is the one where I have my file script.nf I am not sure what is going wrong and how can I fix it.
Could anyone please help or guide.
Thank you
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Nextflow complains when it can't find the declared output files. This can occur even if the command completes successfully, i.e. with exit status 0. The problem here is that fastqc simply skips files that don't exist or can't be read (e.g. permissions problems), but it does produce these warnings:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
The solution is to just make sure all files exist. Note that the fromFilePairs factory method produces a list of files in the second element. Therefore quoting a space-separated pair of filenames is also problematic. All you need is:
script:
"""
fastqc ${reads}
"""

Change back DPI settings in a bash script

I would like to run a program that does not properly support my desired resolution+DPI settings.
Also I want to change my default GTK theme to a lighter one.
What I currently have:
#!/bin/bash
xfconf-query -c xsettings -p /Xft/DPI -s 0
GTK_THEME=/usr/share/themes/Adwaita/gtk-2.0/gtkrc /home/unknown/scripts/ch_resolution.py --output DP-0 --resolution 2560x1440 beersmith3
This sets my DPI settings to 0, changes the gtk-theme, runs a python script that changes my resolution and runs the program, and on program exit changes it back. This is working properly.
Now I want to change back my DPI settings to 136 on program exit
xfconf-query -c xsettings -p /Xft/DPI -s 136
My guess is I need to use a while loop but have no idea how to do it.
ch_resolution.py
#!/usr/bin/env python3
import argparse
import re
import subprocess
import sys
parser = argparse.ArgumentParser()
parser.add_argument('--output', required=True)
parser.add_argument('--resolution', required=True)
parser.add_argument('APP')
args = parser.parse_args()
device_context = '' # track what device's modes we are looking at
modes = [] # keep track of all the devices and modes discovered
current_modes = [] # remember the user's current settings
# Run xrandr and ask it what devices and modes are supported
xrandrinfo = subprocess.Popen('xrandr -q', shell=True, stdout=subprocess.PIPE)
output = xrandrinfo.communicate()[0].decode().split('\n')
for line in output:
# luckily the various data from xrandr are separated by whitespace...
foo = line.split()
# Check to see if the second word in the line indicates a new context
# -- if so, keep track of the context of the device we're seeing
if len(foo) >= 2: # throw out any weirdly formatted lines
if foo[1] == 'disconnected':
# we have a new context, but it should be ignored
device_context = ''
if foo[1] == 'connected':
# we have a new context that we want to test
device_context = foo[0]
elif device_context != '': # we've previously seen a 'connected' dev
# mode names seem to always be of the format [horiz]x[vert]
# (there can be non-mode information inside of a device context!)
if foo[0].find('x') != -1:
modes.append((device_context, foo[0]))
# we also want to remember what the current mode is, which xrandr
# marks with a '*' character, so we can set things back the way
# we found them at the end:
if line.find('*') != -1:
current_modes.append((device_context, foo[0]))
for mode in modes:
if args.output == mode[0] and args.resolution == mode[1]:
cmd = 'xrandr --output ' + mode[0] + ' --mode ' + mode[1]
subprocess.call(cmd, shell=True)
break
else:
print('Unable to set mode ' + args.resolution + ' for output ' + args.output)
sys.exit(1)
subprocess.call(args.APP, shell=True)
# Put things back the way we found them
for mode in current_modes:
cmd = 'xrandr --output ' + mode[0] + ' --mode ' + mode[1]
subprocess.call(cmd, shell=True)
edit:
Thanks #AndreLDM for pointing out that I do not need a separate python script to change the resolution, I don't know why I didn't think of that.
I changed it so I don't need the python script and it is working properly now. If I can improve this script please tell me!
#!/bin/bash
xrandr --output DP-0 --mode 2560x1440
xfconf-query -c xsettings -p /Xft/DPI -s 0
GTK_THEME=/usr/share/themes/Adwaita/gtk-2.0/gtkrc beersmith3
if [ $? == 0 ]
then
xrandr --output DP-0 --mode 3840x2160
xfconf-query -c xsettings -p /Xft/DPI -s 136
exit 0
else
xrandr --output DP-0 --mode 3840x2160
xfconf-query -c xsettings -p /Xft/DPI -s 136
exit 1
fi

DCOS Troubles with ip-detect script

I am trying to install Mesosphere 1.10 using the advanced instructions and I have created the following ip-detect script as per the examples:
#!/usr/bin/env bash
set -o nounset -o errexit -o pipefail
export PATH=/sbin:/usr/sbin:/bin:/usr/bin:$PATH
MASTER_IP=$(dig +short master.mesos || true)
MASTER_IP=${MASTER_IP:-192.168.24.20}
INTERFACE_IP=$(ip r g ${MASTER_IP} | \
awk -v master_ip=${MASTER_IP} '
BEGIN { ec = 1 }
{
if($1 == master_ip) {
print $7
ec = 0
} else if($1 == "local") {
print $6
ec = 0
}
if (ec == 0) exit;
}
END { exit ec }
')
Before installing any dcos files, I tested this script on the intended master node and it worked perfectly.
However after installing dcos on this same node, the exact same script returns the following error:
Error: ??? prefix is expected rather than ";;".
awk: fatal: cannot open file `timed' for reading (No such file or directory)
Any ideas why this is happening? Many thanks in advance...
It's because dig is querying remote servers and not finding an entry for "master.mesos". If master.mesos == 192.168.24.20, remove the two MASTER_IP lines and add one MASTER_IP=192.168.24.20.

how parse console on travis-ci and find the number of occured word

I use github + travis-ci for continuous integration. I have a maven project with lot of tests. I want parse all console and find a special word by xpath. If this word is present x times my job is OK else my job is KO.
how parse console on travis-ci and find the number of occured word by xpath or other method?
Given log.txt as input, and desired input lines like these:
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.957 sec - in TestSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0
Assuming we just want to test the "10", this might do:
n=5
awk -F '[ ,]*' '/^Tests run:/ \
{ if ($3>'$n') { print "OK found " $3 ; x=$3 ; exit} } \
END {if (x<'$n') print "Fail."} ' log.txt
Output:
OK found 10
I try with travis API =>
travis logs
But with this API is not possible because I have a infinite loop (this command copy logs to logs so copy logs to logs so copy copy ...). This API is good if you read the logs of others builds only!!!
And I find a solution:
in .travis.yml file:
script:
- test/run.sh
in run.sh file:
curl -s "https://api.travis-ci.org/jobs/${TRAVIS_JOB_ID}/log.txt?deansi=true" > nonaui.log
expectation=`sed -n 's:.*<EXPECTED_RESULTS>\(.*\)</EXPECTED_RESULTS>.*:\1:p' nonaui.log | head -n 1`
nb_expectation=`sed -n ":;s/$expectation//p;t" nonaui.log | sed -n '$='`
# 3 = 1 (real) + 2 counters (Excel and CSV)
if [ "$nb_expectation" == "3" ]; then
echo "******** All counter is SUCCESS"
else
echo "******** All counter is FAIL"
exit 255
fi
exit 0

Any command to get active namenode for nameservice in hadoop?

The command:
hdfs haadmin -getServiceState machine-98
Works only if you know the machine name. Is there any command like:
hdfs haadmin -getServiceState <nameservice>
which can tell you the IP/hostname of the active namenode?
To print out the namenodes use this command:
hdfs getconf -namenodes
To print out the secondary namenodes:
hdfs getconf -secondaryNameNodes
To print out the backup namenodes:
hdfs getconf -backupNodes
Note: These commands were tested using Hadoop 2.4.0.
Update 10-31-2014:
Here is a python script that will read the NameNodes involved in Hadoop HA from the config file and determine which of them is active by using the hdfs haadmin command. This script is not fully tested as I do not have HA configured. Only tested the parsing using a sample file based on the Hadoop HA Documentation. Feel free to use and modify as needed.
#!/usr/bin/env python
# coding: UTF-8
import xml.etree.ElementTree as ET
import subprocess as SP
if __name__ == "__main__":
hdfsSiteConfigFile = "/etc/hadoop/conf/hdfs-site.xml"
tree = ET.parse(hdfsSiteConfigFile)
root = tree.getroot()
hasHadoopHAElement = False
activeNameNode = None
for property in root:
if "dfs.ha.namenodes" in property.find("name").text:
hasHadoopHAElement = True
nameserviceId = property.find("name").text[len("dfs.ha.namenodes")+1:]
nameNodes = property.find("value").text.split(",")
for node in nameNodes:
#get the namenode machine address then check if it is active node
for n in root:
prefix = "dfs.namenode.rpc-address." + nameserviceId + "."
elementText = n.find("name").text
if prefix in elementText:
nodeAddress = n.find("value").text.split(":")[0]
args = ["hdfs haadmin -getServiceState " + node]
p = SP.Popen(args, shell=True, stdout=SP.PIPE, stderr=SP.PIPE)
for line in p.stdout.readlines():
if "active" in line.lower():
print "Active NameNode: " + node
break;
for err in p.stderr.readlines():
print "Error executing Hadoop HA command: ",err
break
if not hasHadoopHAElement:
print "Hadoop High-Availability configuration not found!"
Found this:
https://gist.github.com/cnauroth/7ff52e9f80e7d856ddb3
This works out of the box on my CDH5 namenodes, although I'm not sure other hadoop distributions will have http://namenode:50070/jmx available - if not, I think it can be added by deploying Jolokia.
Example:
curl 'http://namenode1.example.com:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeStatus",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
"State" : "active",
"NNRole" : "NameNode",
"HostAndPort" : "namenode1.example.com:8020",
"SecurityEnabled" : true,
"LastHATransitionTime" : 1436283324548
} ]
So by firing off one http request to each namenode (this should be quick) we can figure out which one is the active one.
It's also worth noting that if you talk WebHDFS REST API to an inactive namenode you will get a 403 Forbidden and the following JSON:
{"RemoteException":{"exception":"StandbyException","javaClassName":"org.apache.hadoop.ipc.StandbyException","message":"Operation category READ is not supported in state standby"}}
In a High Availability Hadoop cluster, there will be 2 namenodes - one active and one standby.
To find the active namenode, we can try executing the test hdfs command on each of the namenodes and find the active name node corresponding to the successful run.
Below command executes successfully if the name node is active and fails if it is a standby node.
hadoop fs -test -e hdfs://<Name node>/
Unix script
active_node=''
if hadoop fs -test -e hdfs://<NameNode-1>/ ; then
active_node='<NameNode-1>'
elif hadoop fs -test -e hdfs://<NameNode-2>/ ; then
active_node='<NameNode-2>'
fi
echo "Active Dev Name node : $active_node"
You can do it in bash with hdfs cli calls, too. With the noted caveat that this takes a bit more time since it's a few calls to the API in succession, but this may be preferable to using a python script for some.
This was tested with Hadoop 2.6.0
get_active_nn(){
ha_name=$1 #Needs the NameServiceID
ha_ns_nodes=$(hdfs getconf -confKey dfs.ha.namenodes.${ha_name})
active=""
for node in $(echo ${ha_ns_nodes//,/ }); do
state=$(hdfs haadmin -getServiceState $node)
if [ "$state" == "active" ]; then
active=$(hdfs getconf -confKey dfs.namenode.rpc-address.${ha_name}.${node})
break
fi
done
if [ -z "$active" ]; then
>&2 echo "ERROR: no active namenode found for ${ha_name}"
exit 1
else
echo $active
fi
}
After reading all the existing answers none seemed to combine the three steps of:
Identifying the namenodes from the cluster.
Resolving the node names to host:port.
Checking the status of each node (without requiring
cluster admin privs).
Solution below combines hdfs getconf calls and JMX service call for node status.
#!/usr/bin/env python
from subprocess import check_output
import urllib, json, sys
def get_name_nodes(clusterName):
ha_ns_nodes=check_output(['hdfs', 'getconf', '-confKey',
'dfs.ha.namenodes.' + clusterName])
nodes = ha_ns_nodes.strip().split(',')
nodeHosts = []
for n in nodes:
nodeHosts.append(get_node_hostport(clusterName, n))
return nodeHosts
def get_node_hostport(clusterName, nodename):
hostPort=check_output(
['hdfs','getconf','-confKey',
'dfs.namenode.rpc-address.{0}.{1}'.format(clusterName, nodename)])
return hostPort.strip()
def is_node_active(nn):
jmxPort = 50070
host, port = nn.split(':')
url = "http://{0}:{1}/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus".format(
host, jmxPort)
nnstatus = urllib.urlopen(url)
parsed = json.load(nnstatus)
return parsed.get('beans', [{}])[0].get('State', '') == 'active'
def get_active_namenode(clusterName):
for n in get_name_nodes(clusterName):
if is_node_active(n):
return n
clusterName = (sys.argv[1] if len(sys.argv) > 1 else None)
if not clusterName:
raise Exception("Specify cluster name.")
print 'Cluster: {0}'.format(clusterName)
print "Nodes: {0}".format(get_name_nodes(clusterName))
print "Active Name Node: {0}".format(get_active_namenode(clusterName))
From java api, you can use HAUtil.getAddressOfActive(fileSystem).
You can do a curl command to find out the Active and secondary Namenode
for example
curl -u username -H "X-Requested-By: ambari" -X GET
http://cluster-hostname:8080/api/v1/clusters//services/HDFS
Regards
I found the below when i simply typed 'hdfs' and found a couple of helpful commands, which could be useful for someone who could maybe come here seeking for help.
hdfs getconf -namenodes
This above command will give you the service id of the namenode. Say, hn1.hadoop.com
hdfs getconf -secondaryNameNodes
This above command will give you the service id of the available secondary namenodes. Say , hn2.hadoop.com
hdfs getconf -backupNodes
This above command will get you the service id of backup nodes, if any.
hdfs getconf -nnRpcAddresses
This above command will give you info of name service id along with rpc port number. Say, hn1.hadoop.com:8020
You're Welcome :)
In HDFS 2.6.0 the one that worked for me
ubuntu#platform2:~$ hdfs getconf -confKey dfs.ha.namenodes.arkin-platform-cluster
nn1,nn2
ubuntu#platform2:~$ sudo -u hdfs hdfs haadmin -getServiceState nn1
standby
ubuntu#platform2:~$ sudo -u hdfs hdfs haadmin -getServiceState nn2
active
Here is example of bash code that returns active name node even if you do not have local hadoop installation.
It also works faster as curl calls are usually faster than hadoop.
Checked on Cloudera 7.1
#!/bin/bash
export nameNode1=myNameNode1
export nameNode2=myNameNode2
active_node=''
T1=`curl --silent --insecure -request GET https://${nameNode1}:9871/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus | grep "\"State\" : \"active\"" | wc -l`
if [ $T1 == 1 ]
then
active_node=${nameNode1}
else
T1=`curl --silent --insecure -request GET https://${nameNode2}:9871/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus | grep "\"State\" : \"active\"" | wc -l`
if [ $T1 == 1 ]
then
active_node=${nameNode2}
fi
fi
echo "Active Dev Name node : $active_node"
#!/usr/bin/python
import subprocess
import sys
import os, errno
def getActiveNameNode () :
cmd_string="hdfs getconf -namenodes"
process = subprocess.Popen(cmd_string, shell=True, stdout=subprocess.PIPE)
out, err = process.communicate()
NameNodes = out
Value = NameNodes.split(" ")
for val in Value :
cmd_str="hadoop fs -test -e hdfs://"+val
process = subprocess.Popen(cmd_str, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
if (err != "") :
return val
def main():
out = getActiveNameNode()
print(out)
if __name__ == '__main__':
main()
You can simply use the below command. I have tested this in hadoop 3.0 You can check the reference here -
hdfs haadmin -getAllServiceState
It returns the state of all the NameNodes.
more /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.ha.namenodes.nameservice1</name>
<value>namenode1353,namenode1357</value>
</property>
hdfs#:/home/ubuntu$ hdfs haadmin -getServiceState namenode1353
active
hdfs#:/home/ubuntu$ hdfs haadmin -getServiceState namenode1357
standby

Resources