WebHdfsFileSystem local ip vs network ip hadoop - hadoop

have a requirement to read HDFS from a outside of the hdfs cluster. I stumbled upon WebHdfsFileSystem and even though I got the idea but I could not make it work with the network address. For example, the code below works fine as long as I use 127.0.0.1 or localhost. But the moment I use the network ip address 192.168.. , I get "Retrying connect to server" messages followed by ConnectException.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.web.WebHdfsFileSystem;
public class ReadHDFSFile {
public static void main(String[] args) {
Path p = new Path("hdfs://127.0.0.1:9000/user/hduser");
WebHdfsFileSystem web = new WebHdfsFileSystem();
try {
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://127.0.0.1:9000/");
web.setConf(conf);
Configuration conf1 = web.getConf();
FileSystem fs = FileSystem.get(web.getConf());
System.out.println(fs.exists(p));
} catch (IOException e) {
e.printStackTrace();
}
}
}
I am not sure what am I missing here.

I have a version of this working on Hadoop 2.4. I had to change two things relative to using the regular Hadoop FileSystem API:
the protocol changes from hdfs:// to webhdfs://
the port changed to the http-port (which on our Hortonworks cluster is 50070), not the default hdfs port (which might also be called the RPC port?), which is 8020 on our system
Example code that works for me:
Configuration conf = new Configuration();
String conxUrl = String.format("webhdfs://%s:%s", NAMENODE_IP_ADDR, WEBHDFS_PORT);
conf.set("fs.defaultFS", conxUrl);
FileSystem fs = WebHdfsFileSystem.get(conf);
Path path = new Path("/path/to/my/file");
System.out.println(fs.exists(path));

Related

Oracle Database JDBC driver cannot read wallet file from Spark

Objective
I'm trying to write to Oracle's ADWC (basically oracle database) from a Spark application running on Yarn. The only way to connect to this database is by using an Oracle Wallet file, which is basically a Java keystore.
Problem
The problem arises when the JDBC driver tries to read the wallet from HDFS. If I include the hdfs:// prefix the parser in the JDBC driver throws an error and if I don't then it cannot find the file.
Previous Attempts
including the directory in the connect string (prefixed and non) jdbc:oracle:thin:#luigi_low?TNS_ADMIN=/user/spark/wallet_LUIGI
including the directory as an spark.driver.extraJavaOptions with -Doracle.net.tns_admin and -Doracle.net.wallet_location
All the code is on GitHub, and specifically, the error messages are here https://github.com/sblack4/kafka-scala-jdbc/blob/master/ERROR.md
I've got a working example of the same connection here https://github.com/sblack4/scala-jdbc-adwc
help me StackOverflow. you are my only hope
If you need any more clarification don't hesitate :)
update (SparkFiles attempt)
the code is on a separate branch of the same repository, https://github.com/sblack4/kafka-scala-jdbc/tree/sparkfiles
This error message mystifies me as it seems my JDBC library has stopped trying to read the wallet files. It may be unrelated to the previous problem
Exception in thread "main" java.sql.SQLRecoverableException: IO Error: Invalid connection string format, a valid format is: "host:port:sid"
I've deleted the other JDBC libraries from my classpath through Ambari as this error could be related to spark picking up an older version of my JDBC library
Here's some code that will help diagnose what the issues is.
It checks and configures everything required to connect.
JDBC Driver version
JCE Installed
Classpath dependencies
Configures
tns_admin
ssl settings
trust/key stores
This is a slimmed down version of what's in sqldev/sqlcl
import java.security.NoSuchAlgorithmException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.util.Collections;
import java.util.Properties;
import javax.crypto.Cipher;
import oracle.jdbc.OracleConnection;
public class JDBCTest {
public static void fail(String msg){
System.err.println(String.join("", Collections.nCopies(20, "*")));
System.err.println(msg);
System.err.println(String.join("", Collections.nCopies(20, "*")));
System.exit(1);
}
public static void main(String[] args) throws SQLException {
System.out.println("JDBC Driver Version:" + oracle.jdbc.OracleDriver.getDriverVersion());
// Check JDBC Driver Version
if (!oracle.jdbc.OracleDriver.getDriverVersion().startsWith("18.")) {
fail(" DRIVER TOOO OLD!!!");
}
// Check JCE Installed
int maxKeySize = 0;
try {
maxKeySize = Cipher.getMaxAllowedKeyLength("AES");
} catch (NoSuchAlgorithmException e) {
}
if (maxKeySize < 129 ) {
fail(" JCE Policy not unlimited!!!");
}
// Check Classpath
String cp = System.getProperty("java.class.path");
String[] cpFiles = {"ojdbc8.jar","oraclepki.jar","osdt_cert.jar","osdt_core.jar"};
for (String file:cpFiles){
if ( cp.indexOf(file) == -1 ){
fail("CLASSPATH Missing:" + file);
}
}
// Wallet unziped location
String unzippedWalletLocation = "/Users/klrice/workspace/12.2JDBC/wallet";
String conString = "jdbc:oracle:thin:#sqldev_medium";
Properties props = new Properties();
props.setProperty("oracle.net.wallet_location",unzippedWalletLocation);
props.setProperty(OracleConnection.CONNECTION_PROPERTY_THIN_NET_CONNECT_TIMEOUT, "2000");
// unzipped includes a tnsnames.ora
props.setProperty("oracle.net.tns_admin",unzippedWalletLocation);
props.setProperty("javax.net.ssl.trustStore","truststore.jks");
props.setProperty("javax.net.ssl.trustStorePassword","<password>");
props.setProperty("javax.net.ssl.keyStore","keystore.jks");
props.setProperty("javax.net.ssl.keyStorePassword","<password>");
props.setProperty("oracle.net.ssl_server_dn_match","true");
props.setProperty("oracle.net.ssl_version","1.2");
props.setProperty("user", "ADMIN");
props.setProperty("password", "<password>");
try {
// now Connect
Connection conn = DriverManager.getConnection(conString,props);
} catch (Exception e){
e.printStackTrace();
fail(e.getLocalizedMessage());
}
System.out.println("SUCCESS!!");
}
}
Are you using 18.3 JDBC drivers? Passing TNS_ADMIN as part of the connection URL requires 18.3 JDBC driver. Also, are you attempting to connect within the corporate network. In that case, you will need to pass HTTPS_PROXY and HTTPS_PROXY_PORT in the connection URL. Let us know. Happy to help with the problem.

phoenix jdbc doesn't work, no exceptions and stuck

I'm new to phoenix and hbase.hbase table and phoenix view works well and i can fetch data through phoenix. when I access jdbc to phoenix,it stucks.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class Phoenix {
private static String driver = "org.apache.phoenix.jdbc.PhoenixDriver";
public static void main(String[] args) throws SQLException {
try {
Class.forName(driver);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
Statement stmt = null;
ResultSet rs = null;
System.out.println("start...");
Connection con = DriverManager.getConnection("jdbc:phoenix:[my_cloud_server_ip]:2181");
System.out.println(con);
con.close();
}
}
(only one zookeeper server has public internet ip,so i write this ip there,does it matters?)
it prints "start..." and no response any more
BUT when when the url is "jdbc:phoenix:ip:2181" or "jdbc:phoenix:ip:2181/hbase"
i got no response
when i add some other words ,for example "jdbc:phoenix:ip:2181/balabala"
i got NULL POINTER EXCEPTION
why ?
I hope u can understand what i say:)
Reproducing your example, I am getting a java.net.SocketTimeoutException after 60 seconds, caused by java.net.UnknownHostException: unknown host: <hostname_of_my_zk_server>. But maybe that's what you called "no response" if you were in a hurry (or if you have a custom hbase-site.xml with a larger timeout client-side).
This error seems like the one explained in this article (section Zookeeper at the end) :
As in our case CDH was running in a test VM, we encountered this issue: http://stackoverflow.com/questions/18428722/hbase-java-client-unknown-host-localhost-localdomain
That was worked around by adding localhost.localdomain to the existing /etc/hosts entry for cluster1, which was already pointing to the right IP address.
This answer on SO summarizes the solution.
Basically, you need to add an entry to your /etc/hosts client-side :
<my_cloud_server_ip> <hostname_of_my_cloud_server_ip>
Moreover, you need to have an hbase-site.xml client-side (you can use the one on your server as a basis).

spark job fails in yarn cluster

I have a spark job which runs without any issue in spark-shell. I am currently trying to submit this job to yarn using spark's api.
I am using the below class to run a spark job
import java.util.ResourceBundle;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
public class SubmitSparkJobToYARNFromJavaCode {
public static void main(String[] arguments) throws Exception {
ResourceBundle bundle = ResourceBundle.getBundle("device_compare");
String accessKey = bundle.getString("accessKey");
String secretKey = bundle.getString("secretKey");
String[] args = new String[] {
// path to my application's JAR file
// required in yarn-cluster mode
"--jar",
"my_s3_path_to_jar",
// name of my application's main class (required)
"--class", "com.abc.SampleIdCount",
// comma separated list of local jars that want
// SparkContext.addJar to work with
// "--addJars", arguments[1]
};
// create a Hadoop Configuration object
Configuration config = new Configuration();
// identify that I will be using Spark as YARN mode
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("spark.local.dir", "/tmp");
// create an instance of SparkConf object
SparkConf sparkConf = new SparkConf();
sparkConf.set("fs.s3n.awsAccessKeyId", accessKey);
sparkConf.set("fs.s3n.awsSecretAccessKey", secretKey);
sparkConf.set("spark.local.dir", "/tmp");
// create ClientArguments, which will be passed to Client
ClientArguments cArgs = new ClientArguments(args);
// create an instance of yarn Client client
Client client = new Client(cArgs, config, sparkConf);
// submit Spark job to YARN
client.run();
}
}
This is the spark job I am trying to run
package com.abc;
import java.util.ResourceBundle;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class SampleIdCount {
private static String accessKey;
private static String secretKey;
public SampleIdCount() {
ResourceBundle bundle = ResourceBundle.getBundle("device_compare");
accessKey = bundle.getString("accessKey");
secretKey = bundle.getString("secretKey");
}
public static void main(String[] args) {
System.out.println("Started execution");
SampleIdCount sample = new SampleIdCount();
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("spark.local.dir", "/tmp");
SparkConf conf = new SparkConf();
{
conf = new SparkConf().setAppName("SampleIdCount").setMaster("yarn-cluster");
}
JavaSparkContext sc = new JavaSparkContext(conf);
sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", accessKey);
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", secretKey);
JavaRDD<String> installedDeviceIdsRDD = sc.emptyRDD();
installedDeviceIdsRDD = sc.textFile("my_s3_input_path");
installedDeviceIdsRDD.saveAsTextFile("my_s3_output_path");
sc.close();
}
}
When I run my java code , the spark job is being submitted to yarn but the issue is I face the below error
Diagnostics: File file:/mnt/tmp/spark-1b86d806-5c8f-4ae6-a486-7b68d46c759a/__spark_libs__8257948728364304288.zip does not exist
java.io.FileNotFoundException: File file:/mnt/tmp/spark-1b86d806-5c8f-4ae6-a486-7b68d46c759a/__spark_libs__8257948728364304288.zip does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:616)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:829)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:431)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The problem I thought was that the folder /mnt was not available in slave nodes so I tried to change spark local directory to /tmp by doing the following things
I set System environment variable to "/tmp" in my Java code
I set a system env variable by doing export LOCAL_DIRS=/tmp
I set a system env variable by doing export SPARK_LOCAL_DIRS=/tmp
None of these had any effect and I still face the very same error. None of the suggestions in other links helped me either. I am really stuck in this. Any help would be much appreciated. Thanks in advance. Cheers!

DRPC Server error in storm

I am trying to execute the below code and getting an error .. Not sure if i am missing something here.. Also where would i see the output?
Error
java.lang.RuntimeException: No DRPC servers configured for topology
at backtype.storm.drpc.DRPCSpout.open(DRPCSpout.java:79)
at storm.trident.spout.RichSpoutBatchTriggerer.open(RichSpoutBatchTriggerer.java:58)
at backtype.storm.daemon.executor$fn__5802$fn__5817.invoke(executor.clj:519)
at backtype.storm.util$async_loop$fn__442.invoke(util.clj:434)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:744)
Code:
----
package com.**.trident.storm;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import storm.kafka.*;
import storm.trident.*;
import backtype.storm.*;
public class EventTridentDrpcTopology
{
private static final String KAFKA_SPOUT_ID = "kafkaSpout";
private static final Logger log = LoggerFactory.getLogger(EventTridentDrpcTopology.class);
public static StormTopology buildTopology(OpaqueTridentKafkaSpout spout) throws Exception
{
TridentTopology tridentTopology = new TridentTopology();
TridentState ts = tridentTopology.newStream("event_spout",spout)
.name(KAFKA_SPOUT_ID)
.each(new Fields("mac_address"), new SplitMac(), new Fields("mac"))
.groupBy(new Fields("mac"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("maccount"))
.parallelismHint(4)
;
tridentTopology
.newDRPCStream("mac_count")
.each(new Fields("args"), new SplitMac(), new Fields("mac"))
.stateQuery(ts,new Fields("mac"),new MapGet(), new Fields("maccount"))
.each(new Fields("maccount"), new FilterNull())
.aggregate(new Fields("maccount"), new Sum(), new Fields("sum"))
;
return tridentTopology.build();
}
public static void main(String[] str) throws Exception
{
Config conf = new Config();
BrokerHosts hosts = new ZkHosts("xxxx:2181,xxxx:2181,xxxx:2181");
String topic = "event";
//String zkRoot = topologyConfig.getProperty("kafka.zkRoot");
String consumerGroupId = "StormSpout";
DRPCClient drpc = new DRPCClient("xxxx",3772);
TridentKafkaConfig tridentKafkaConfig = new TridentKafkaConfig(hosts, topic, consumerGroupId);
tridentKafkaConfig.scheme = new SchemeAsMultiScheme(new XScheme());
OpaqueTridentKafkaSpout opaqueTridentKafkaSpout = new OpaqueTridentKafkaSpout(tridentKafkaConfig);
StormSubmitter.submitTopology("event_trident", conf, buildTopology(opaqueTridentKafkaSpout));
}
}
You have to configure the locations of the DRPC servers and launch them.
See Remote mode DRPC on http://storm.apache.org/releases/0.10.0/Distributed-RPC.html
Launch DRPC server(s)
Configure the locations of the DRPC servers
Submit DRPC topologies to Storm cluster
Launching a DRPC server can be done with the storm script and is just like launching Nimbus or the UI:
bin/storm drpc
Next, you need to configure your Storm cluster to know the locations of the DRPC server(s). This is how DRPCSpout knows from where to read function invocations. This can be done through the storm.yaml file or the topology configurations. Configuring this through the storm.yaml looks something like this:
drpc.servers:
- "drpc1.foo.com"
- "drpc2.foo.com"

How to append to created file on HDFS

I have problem with appending to file which was created by me. I don't have such problem with file which was upload manually to HDFS. What is a difference between file upload and created?
To append and create i use code below
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Test {
public static final String hdfs = "hdfs://192.168.15.62:8020";
public static final String hpath = "/user/horton/wko/test.log";
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", hdfs);
conf.set("hadoop.job.ugi", "hdfs");
FileSystem fs = FileSystem.get(conf);
Path filenamePath = new Path(hpath);
//FSDataOutputStream out = fs.create(filenamePath);
FSDataOutputStream out = fs.append(filenamePath);
out.writeUTF("TEST\n");
out.close();
}
}
I got such exception in case append:
Exception in thread "main" java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.15.62:50010], original=[192.168.15.62:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
I had a similar problem that fixed adding conf.set("dfs.replication", "1").
In my case I had only one node in the cluster, and even though dfs.replication was set to 1 in hdfs-site.xml, it still was using the default value of 3.
Note that Hadoop will try to replicate the blocks of the file as soon as they get written in the first node, and since the default value for replication is 3, it will fail to access other nodes if you only have one node cluster.

Resources