Why h2o give different prediction over spark cluster from spark local? - h2o

H2O in spark cluster mode giving different predictions from spark local mode. H2O in spark local is giving better than spark cluster why it is happening ,can you help me? Tell me whether it's H2O behaviour.
Two Data set are being used. One for training the model and another for scoring.
trainingData.csv : 1.8MB (number of rows are 2211),
testingData.csv : 1.8MB (number of rows are 2211),
Driver Memory : 1G,
Executors Memory: 1G,
Number Of Executors : 1
The following command is being used over cluster :=>
nohup /usr/hdp/current/spark2-client/bin/spark-submit --class com.inn.sparkrunner.h2o.GradientBoostingAlgorithm --master yarn --driver-memory 1G --executor-memory 1G --num-executors 1 --deploy-mode cluster spark-runner-1.0.jar > tool.log &
1)Main Method
public static void main(String args[]) {
SparkSession sparkSession = getSparkSession();
H2OContext h2oContext = getH2oContext(sparkSession);
UnseenDataTestDRF(sparkSession, h2oContext);
}
2)h2o context is being created.
private static H2OContext getH2oContext(SparkSession sparkSession) {
H2OConf h2oConf = new H2OConf(sparkSession.sparkContext()).setInternalClusterMode();
H2OContext orCreate = H2OContext.getOrCreate(sparkSession.sparkContext(), h2oConf);
return orCreate;
}
3)spark session is being created.
public static SparkSession getSparkSession() {
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").master("yarn")
.getOrCreate();
return spark;
}
4)Setting GBM parameters.
private static GBMParameters getGBMParam(H2OFrame asH2OFrame) {
GBMParameters gbmParam = new GBMParameters();
gbmParam._response_column = "high";
gbmParam._train = asH2OFrame._key;
gbmParam._ntrees = 10;
gbmParam._seed = 1;
return gbmParam;
}

Related

Connect to Hbase using hadoop config in spark

I am trying to create hbase connection in MapPartitionFunction of spark.
Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
I tried the following code
SparkConf conf = new SparkConf()
.setAppName("EnterPrise Risk Score")
.setMaster("local");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.set("spark.kryo.registrationRequired", "true");
conf.registerKryoClasses(new Class<?>[] {
Class.forName("org.apache.hadoop.conf.Configuration"),
Class.forName("org.apache.hadoop.hbase.client.Table"),
Class.forName("com.databricks.spark.avro.DefaultSource$SerializableConfiguration")});
SparkSession sparkSession = SparkSession.builder().config(conf)
.getOrCreate();
Configuration hbaseConf= HBaseConfiguration
.create(hadoopConf);
I am using sparkSession to create dataset and pass hbaseConf to create connections to hbase.
Is there any way to connect to hbase?
You probably implicitly pass an HBase configuration to a spark action like this:
Configuration hbaseConfiguration = HBaseConfiguration.create();
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});
Why don't you just create Configuration right inside of it like this:
sc.hadoopFile(inDirTrails, AvroInputFormat.class, AvroWrapper.class, NullWritable.class)).mapPartitions( i -> {
Configuration hbaseConfiguration = HBaseConfiguration.create();
hbaseConfiguration.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM);
Connection connection = ConnectionFactory.createConnection(hbaseConfiguration)
//more valid code
});

unable to execute "put" in the function of map using hbase and hadoop

everybody.I'm using mr to process some log file, the file is on hdfs. I want to retrieve some info form the file and store them to hbase.
so I launch the job
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar crm_hbase-1.0.jar /datastream/music/useraction/2014-11-30/music_useraction_20141130-230003072+0800.24576015364769354.00018022.lzo
if I just run job as "hadoop jar xxxx" it shows "not find HbaseConfiguraion"
My code is quite simple,
public int run(String[] strings) throws Exception {
Configuration config = HBaseConfiguration.create();
String kerbConfPrincipal = "ndir#HADOOP.HZ.NETEASE.COM";
String kerbKeytab = "/srv/zwj/ndir.keytab";
UserGroupInformation.loginUserFromKeytab(kerbConfPrincipal, kerbKeytab);
UserGroupInformation ugi = UserGroupInformation.getLoginUser();
System.out.println(" auth: " + ugi.getAuthenticationMethod());
System.out.println(" name: " + ugi.getUserName());
System.out.println(" using keytab:" + ugi.isFromKeytab());
HBaseAdmin.checkHBaseAvailable(config);
//set job name
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
//set output format and output table name
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "crm_data");
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
but when i try to run this MR, I cannot execute "context.write(null,put)", it seems the "map" halts at this line.
I think it has relationship with "kerbKeytab", does it mean I need to "login" when I run the "map" process
after adding TableMapReduceUtil it works
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
TableMapReduceUtil.initTableReducerJob(table, null, job);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
FileInputFormat.setInputPaths(job,input);
//FileInputFormat.addInputPath(job, new Path(strings[0]));
int ret = job.waitForCompletion(true) ? 0 : 1;

Get the total mapping and reducing times in hadoop programmatically

I am trying to calculate the individual total times of Mapping, Shuffling and Reducing by all tasks in my MR code.
I need help retrieving that information for each MapReduce Job.
Can someone post any code snippet that does that calculation?
You need to use the JobClient API as shown below:
There are however some quirks to the API. Try it let me know i will help you out.
JobClient client = null;
Configuration configuration = new Configuration();
configuration.set("mapred.job.tracker", jobTrackerURL);
client = new JobClient(new JobConf(configuration));
while (true) {
List<JobStatus> jobEntries = getTrackerEntries(jobName,
client);
for (JobStatus jobStatus : jobEntries) {
JobID jobId = jobStatus.getJobID();
String trackerJobName = client.getJob(jobId)
.getJobName();
TaskReport[] mapReports = client
.getMapTaskReports(jobId);
TaskReport[] reduceReports = client
.getReduceTaskReports(jobId);
client.getJob(jobId).getJobStatus().getStartTime();
int jobMapper = mapReports.length;
mapNumber = +jobMapper;
int jobReducers = reduceReports.length;
reduceNumber = +jobReducers;
}
}

ycsb load run elasticsearch

I'm trying to run the benchmark software yscb on ElasticSearch
The problem I'm having is that after the load, the data seems to get removed during cleanup.
I'm struggling to understand what is supposed to happen?
If I comment out the cleanup, it still fails because it cannot find the index during the "run" phase.
Can someone please explain what is supposed to happen in YSCB?
I mean I think it would have
1. load phase: load say 1,000,000 records
2. run phase: query the records loaded during the "load phase"
Thanks,
Okay I have discovered by running Couchbase in YCSB that the data shouldn't be removed.
Looking at cleanup() for ElasticSearchClient I see no reason why the files would be deleted (?)
#Override
public void cleanup() throws DBException {
if (!node.isClosed()) {
client.close();
node.stop();
node.close();
}
}
The init is as follows: any reason this would not persist on the filesystem?
public void init() throws DBException {
// initialize OrientDB driver
Properties props = getProperties();
this.indexKey = props.getProperty("es.index.key", DEFAULT_INDEX_KEY);
String clusterName = props.getProperty("cluster.name", DEFAULT_CLUSTER_NAME);
Boolean newdb = Boolean.parseBoolean(props.getProperty("elasticsearch.newdb", "false"));
Builder settings = settingsBuilder()
.put("node.local", "true")
.put("path.data", System.getProperty("java.io.tmpdir") + "/esdata")
.put("discovery.zen.ping.multicast.enabled", "false")
.put("index.mapping._id.indexed", "true")
.put("index.gateway.type", "none")
.put("gateway.type", "none")
.put("index.number_of_shards", "1")
.put("index.number_of_replicas", "0");
//if properties file contains elasticsearch user defined properties
//add it to the settings file (will overwrite the defaults).
settings.put(props);
System.out.println("ElasticSearch starting node = " + settings.get("cluster.name"));
System.out.println("ElasticSearch node data path = " + settings.get("path.data"));
node = nodeBuilder().clusterName(clusterName).settings(settings).node();
node.start();
client = node.client();
if (newdb) {
client.admin().indices().prepareDelete(indexKey).execute().actionGet();
client.admin().indices().prepareCreate(indexKey).execute().actionGet();
} else {
boolean exists = client.admin().indices().exists(Requests.indicesExistsRequest(indexKey)).actionGet().isExists();
if (!exists) {
client.admin().indices().prepareCreate(indexKey).execute().actionGet();
}
}
}
Thanks,
Okay what I am finding is as follows
(any help from ElasticSearch-ers much appreciated!!!!
because I'm obviously doing something wrong )
Even when the load shuts down leaving the data behind, the "run" still cannot find the data on startup
ElasticSearch node data path = C:\Users\Pl_2\AppData\Local\Temp\/esdata
org.elasticsearch.action.NoShardAvailableActionException: [es.ycsb][0] No shard available for [[es.ycsb][usertable][user4283669858964623926]: routing [null]]
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:140)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:125)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:72)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:83)
The github README has been updated.
It looks like you need to specify using:
-p path.home=<path to folder to persist data>

how to get job counters with sqoop 1.4.4 java api?

I'm using Sqoop 1.4.4 and its java api to run an import job and I'm having
trouble figuring out how to access the job counters once the import has
completed. I see suitable methods in the ConfigurationHelper class, like
getNumMapOutputRecords, but I'm not sure how to pass the job to them.
Is there a way to get at the job from the SqoopTool or Sqoop objects?
My code looks something like this:
SqoopTool sqoopTool = new ImportTool();
SqoopOptions options = new SqoopOptions();
options.setConnectString(connectString);
options.setUsername(username);
options.setPassword(password);
options.setTableName(table);
options.setColumns(columns);
options.setWhereClause(whereClause);
options.setTargetDir(targetDir);
options.setNumMappers(1);
options.setFileLayout(FileLayout.TextFile);
options.setFieldsTerminatedBy(delimiter);
Configuration config = new Configuration();
config.set("oracle.sessionTimeZone", timezone.getID());
System.setProperty(Sqoop.SQOOP_RETHROW_PROPERTY, "1");
Sqoop sqoop = new Sqoop(sqoopTool, config, options);
String[] nullArgs = new String[0];
Sqoop.runSqoop(sqoop, nullArgs);

Resources