How to Access the built-in counters in hadoop MapReduce - hadoop

I have been working on to find the numbers of total records parsed by all the mappers using the MAP_INPUT_RECORDS variable.
Here is the code I am using :
Counters counters = job.getCounters();
for (CounterGroup group : counters) {
System.out.println("* Counter Group: " + group.getDisplayName() + " (" + group.getName() + ")");
System.out.println(" number of counters in this group: " + group.size());
for (Counter counter : group) {
System.out.println(" - " + counter.getDisplayName() + ": " + counter.getName() + ": "+counter.getValue());
}
}
Also, I tried using the following code snippet :
{
Counters counters = job.getCounters();
int recordCountData = (int)
//counters.getGroup("org.apache.hadoop.mapred.Task$Counter").findCounter("MAP_INPUT_RECORDS").getValue();
int recordCountData = (int) counters.findCounter(
"org.apache.hadoop.mapred.Task$Counter","MAP_INPUT_RECORDS")
.getValue();
}
But everytime it throws the following error :
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.CounterGroup, but class was expected
at com.ssga.common.riskmeasures.validation.mr.RiskMeasuresValidationDriver.run(RiskMeasuresValidationDriver.java:169)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.ssga.common.riskmeasures.validation.mr.RiskMeasuresValidationDriver.main(RiskMeasuresValidationDriver.java:189)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
PS : I am trying to use the above mentioned approaches after job.waitForCompletion(true) in the Driver class.
Any approach on how I can resolve this issue?
Thanks in advance.
Akhilesh

"The new API favors abstract classes over interfaces, since these are easier to evolve. For example, you can add a method (with a default implementation) to an abstract class without breaking old implementations of the class2. For example, the Mapper and Reducer interfaces in the old API are abstract classes in the new API."
"The new API is in the org.apache.hadoop.mapreduce package (and subpackages).
The old API can still be found in org.apache.hadoop.mapred."
-Hadoop the Definitive Guide by Tom White, 3rd Edition, Page 28
Check your mapper and reducer, use them as classes.

Related

How do I get the get list of connections from processContext in NiFi 1.11.4

Our production instance of NiFi is version 1.8.0. We have a custom processor that continually looks at it's downstream connections in order to route flow files based on the connection's queue size.
Here is the salient snippet of how we do this . . .
String processorId = this.getIdentifier();
ProcessorGroupStatus processGroupStatus = ((EventAccess) getControllerService()).getContollerStatus();
Collection<ConnectionStatus> groupConnections = processGroupStatus.getConnectionStatus();
ArrayList connections = new ArrayList<>(groupConnections);
for (Object processorConnection : connections) {
ConnectionStatus connection = (ConnectionStatus) processorConnection;
if(connection.getSourceId().equals(processorId){
//do stuff with connection.getQueuedCount() & connection.getQueuedBytes()
break;
}
}
Everything has been working as expected for the last couple of years. However, upgrading our NiFi instance to version 1.11.4 has broken this approach. The exception thrown is:
class org.apache.nifi.contoller.serviceStandardContollerServiceProvider cannot be cast to class org.apache.nifi.reporting.EventAccess
Is there another way to retrieve connections from processContext?
One approach that may be more upwardly compatible (and easier to maintain) than a custom Java processor would be to use the ExecuteGroovyScript processor.
The Groovy script in this case would look something like:
ff = session.get()
if (ff) {
me = context.procNode
processorId = me.identifier
connections = me.processGroup.connections
connections.each { connection ->
if(connection.source.identifier.equals(processorId)) {
ff[connection.identifier] = "I am the source " +
"[" + connection.flowFileQueue.size().objectCount + "]" +
"[" + connection.flowFileQueue.size().byteCount + "]"
}
else {
ff[connection.identifier] = "I am NOT the source; my name is [" + connection.name + "]"
}
}
REL_SUCCESS << ff
}
To find out what is available to the Groovy script, I use a combination of the NiFi JavaDocs (https://javadoc.io/static/org.apache.nifi/nifi-api/1.12.0/index.html) and the Github code for NiFi (https://github.com/apache/nifi/tree/c396927299586b896df4ebc745793b4c451f3898/nifi-api/src/main/java/org/apache/nifi).
As a side note, we converted our custom Java processors to Groovy script, because of an upgrade incompatibility when going to (ironically) 1.8.0. We have not had an issue with NiFi upgrades since then, and are currently running v 1.11.4.

[Snowflake-jdbc]It hangs when get info from resetset object of connection.getMetadata().getColumns(...)

I try to test the jdbc connection of snowflake with codes below
Connection conn = .......
.......
ResultSet rs = conn.getMetaData().getColumns(**null**, "PUBLIC", "TAB1", null); // 1. set parameters to get metadata of table TAB1
while (rs.next()) { // 2. It hangs here if the first parameter is null in above liune; otherwise(set the corrent db name), it works fine
System.out.println( "precision:" + rs.getInt(7)
+ ",col type name:" + rs.getString(6)
+ ",col type:" + rs.getInt(5)
+ ",col name:" + rs.getString(4)
+ ",CHAR_OCTET_LENGTH:" + rs.getInt(16)
+ ",buf LENGTH:" + rs.getString(8)
+ ",SCALE:" + rs.getInt(9));
}
.......
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
The JDBC driver I used is snowflake-jdbc-3.12.5.jar
Is it a bug?
When the catalog (database) argument is null, the JDBC code effectively runs the following SQL, which you can verify in your Snowflake account's Query History UIs/Views:
show columns in account;
This is an expensive metadata query to run due to no filters and the wide requested breadth (columns across the entire account).
Depending on how many databases exist in your organization's account, it may require several minutes or upto an hour of execution to return back results, which explains the seeming "hang". On a simple test with about 50k+ tables dispersed across 100+ of databases and schemas, this took at least 15 minutes to return back results.
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
This may be a weirdness with your IDE, but in a pinch you can use the Dump Threads (Ctrl + Escape, or Ctrl + Break) option in IDEA to provide a single captured thread dump view. This should help show that the JDBC client thread isn't hanging (as in, its not locked or starved), it is only waiting on the server to send back results.
There is no issue with the 3.12.5 jar.I just tested the same version in Eclipse, I can inspect all the objects . Could be an issue with your IDE.
ResultSet columns = metaData.getColumns(null, null, "TESTTABLE123",null);
while (columns.next()){
System.out.print("Column name and size: "+columns.getString("COLUMN_NAME"));
System.out.print("("+columns.getInt("COLUMN_SIZE")+")");
System.out.println(" ");
System.out.println("COLUMN_DEF : "+columns.getString("COLUMN_DEF"));
System.out.println("Ordinal position: "+columns.getInt("ORDINAL_POSITION"));
System.out.println("Catalog: "+columns.getString("TABLE_CAT"));
System.out.println("Data type (integer value): "+columns.getInt("DATA_TYPE"));
System.out.println("Data type name: "+columns.getString("TYPE_NAME"));
System.out.println(" ");
}

HBase Exception: org.apache.hadoop.hbase.UnknownRowLockException

I'm running a sample code I wrote to test HBase lockRow() and unlockRow() methods. The sample code is below:
HTable table = new HTable(config, "test");
RowLock rowLock = table.lockRow(Bytes.toBytes(row));
System.out.println("Obtained rowlock on " + row + "\nRowLock: " + rowLock);
Put p = new Put(Bytes.toBytes(row));
p.add(Bytes.toBytes("colFamily"), Bytes.toBytes(colFamily), Bytes.toBytes(value));
table.put(p);
System.out.println("put row");
table.unlockRow(rowLock);
System.out.println("Unlocked row!");
When I execute my code, I get an UnknownRowLockException. The documentation says that this error is thrown when an unknown row lock is passed to the region servers. I'm not sure how this is happening & how to resolve it.
The stack trace is below:
Obtained rowlock on row2
RowLock: org.apache.hadoop.hbase.client.RowLock#15af33d6
put row
Exception in thread "main" org.apache.hadoop.hbase.UnknownRowLockException: org.apache.hadoop.hbase.UnknownRowLockException: 5763272717012243790
at org.apache.hadoop.hbase.regionserver.HRegionServer.unlockRow(HRegionServer.java:2099)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:604)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1055)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.translateException(HConnectionManager.java:1268)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1014)
at org.apache.hadoop.hbase.client.HTable.unlockRow(HTable.java:870)
at HelloWorld.Hello.HelloWorld.main(HelloWorld.java:41)
EDIT:
I just realized that I should be printing rowLock.getLockId() instead of rowLock. I did this and compared it to the rowlock in the stack trace, and they are the same, so I'm not sure why the UnknownRowLockException occurs.
Please change the 'file descriptor limit' on the underlying system.
On linux you can do this with ulimit
Note that HBase prints in its logs as the first line the ulimit its seeing.
I was able to resolve this error in this way:
The rowLock being obtained needs to be passed as a parameter to the put constructor.
HTable table = new HTable(config, "test");
RowLock rowLock = table.lockRow(Bytes.toBytes(row));
System.out.println("Obtained rowlock on " + row + "\nRowLock: " + rowLock);
Put p = new Put(Bytes.toBytes(row), rowLock);
p.add(Bytes.toBytes("colFamily"), Bytes.toBytes(colFamily), Bytes.toBytes(value));
table.put(p);
System.out.println("put row");
table.unlockRow(rowLock);
System.out.println("Unlocked row!");
In my earlier approach, a rowLock was being obtained on a row of the table. However, since the rowLock was not used (not passed to put constructor), when I call the unlockRow method, the method waits for 60 seconds (lock timeout) to check if the lock has been used. After 60 seconds, the lock expires, and I end up with UnknownRowLockException

How to set system env of mapper?

First I tried How to set system environment variable from Mapper Hadoop? but mapred.map.child.env doesn't work for me.
I am using hadoop 0.20.1. I wanna to pass all system env from the class that started the job to mapper. Here is what I do:
StringBuilder envStr = new StringBuilder();
for (Entry<String, String> entry : System.getenv().entrySet()) {
envStr.append(entry.getKey() + "=" + entry.getValue() + ",");
}
if (envStr.length() > 0) {
envStr.deleteCharAt(envStr.length() - 1);
}
// System.out.println("Setting mapper child env to :" + envStr);
getConf().set("mapred.map.child.env", envStr.toString());
But It doesn't work. I also tried just set one system value but it does't work either. In Mapper the System.getenv doesn't contains the value. But job.xml has the key and value. Is there any way to do this?
It seems that your hadoop is too old. This is a bug of hadoop 0.20.
Plz updgrade to 0.21 or more stable 1.0.x.
See related jira and hadoop 0.21.0 release note for more information.

Tracking Hadoop job status via web interface? (Exposing Hadoop to internal clients in the company)

I want to develop a website that will allow analysts within the company to run Hadoop jobs (choose from a set of defined jobs) and see their job's status\progress.
Is there an easy way to do this (get running jobs statuses etc.) via Ruby\Python?
How do you expose your Hadoop cluster to internal clients on your company?
I have found one way to get information about jobs on JobTracker. This is the code:
Configuration conf = new Configuration();
conf.set("mapred.job.tracker", "URL");
JobClient client = new JobClient(new JobConf(conf));
JobStatus[] jobStatuses = client.getAllJobs();
for (JobStatus jobStatus : jobStatuses) {
long lastTaskEndTime = 0L;
TaskReport[] mapReports = client.getMapTaskReports(jobStatus.getJobID());
for (TaskReport r : mapReports) {
if (lastTaskEndTime < r.getFinishTime()) {
lastTaskEndTime = r.getFinishTime();
}
}
TaskReport[] reduceReports = client.getReduceTaskReports(jobStatus.getJobID());
for (TaskReport r : reduceReports) {
if (lastTaskEndTime < r.getFinishTime()) {
lastTaskEndTime = r.getFinishTime();
}
}
client.getSetupTaskReports(jobStatus.getJobID());
client.getCleanupTaskReports(jobStatus.getJobID());
System.out.println("JobID: " + jobStatus.getJobID().toString() +
", username: " + jobStatus.getUsername() +
", startTime: " + jobStatus.getStartTime() +
", endTime: " + lastTaskEndTime +
", Durration: " + (lastTaskEndTime - jobStatus.getStartTime()));
}
Since version 'beta 2' of Cloudera's Hadoop Distribution you can almost with no effort use Hadoop User Experience (HUE), which was earlier called Cloudera Desktop.
But since this version it has grown enormously. It comes with job designer,hive interface and many more. You should definitely check this out before deciding to build your own application.
Maybe a good place to start would be to take a look at Cloudera Destktop. It provides a web interface to enable cluster administration and job development tasks. Its free to download.
There is nothing like this that ships with hadoop. It should be trivial to build this functionality. Some of this is available via the JobTracker's page and some you will have to build yourself.

Resources