Understanding HBase Java Client - hadoop

I started Hbase few days back and going through all the material of online.
I have installed and configured HBase and shell commands are working fine.
I got an example of Java client to get data from HBase Table and it executed successfully but I could not understand how it is working? In the code nowhere we have mentioned the port, host of Hbase server? How it able to fetch the data from table?
This is my code:
public class RetriveData {
public static void main(String[] args) throws IOException {
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
#SuppressWarnings({ "deprecation", "resource" })
HTable table = new HTable(config, "emp");
// Instantiating Get class
Get g = new Get(Bytes.toBytes("1"));
// Reading the data
Result result = table.get(g);
// Reading values from Result class object
byte [] value = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("name"));
byte [] value1 = result.getValue(Bytes.toBytes("personal data"),Bytes.toBytes("city"));
// Printing the values
String name = Bytes.toString(value);
String city = Bytes.toString(value1);
System.out.println("name: " + name + " city: " + city);
}
}
The output looks like:
Output:
name: raju city: hyderabad

I agree with Binary Nerds answer
adding some more interesting information for better understanding.
Your Question :
I could not understand how it is working? In the code nowhere we have
mentioned the port, host of Hbase server? How it able to fetch the
data from table?
Since you are executing this program in cluster
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create()
all the cluster properties will be taken care from inside the cluster.. since you are in cluster and you are executing hbase java client program..
Now try like below (execute same program in different way from remote machine eclipse on windows to find out difference of what you have done earlier and now).
public static Configuration configuration; // this is class variable
static { //fill clusternode1,clusternode2,clusternode3 from your cluster
configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum",
"clusternode1,clusternode2,clusternode3");
configuration.set("hbase.master", "clusternode1:600000");
}
Hope this heps you to understand.

If you look at the source code for HBaseConfiguration on github you can see what it does when it calls create().
public static Configuration create() {
Configuration conf = new Configuration();
// In case HBaseConfiguration is loaded from a different classloader than
// Configuration, conf needs to be set with appropriate class loader to resolve
// HBase resources.
conf.setClassLoader(HBaseConfiguration.class.getClassLoader());
return addHbaseResources(conf);
}
Followed by:
public static Configuration addHbaseResources(Configuration conf) {
conf.addResource("hbase-default.xml");
conf.addResource("hbase-site.xml");
checkDefaultsVersion(conf);
HeapMemorySizeUtil.checkForClusterFreeMemoryLimit(conf);
return conf;
}
So its loading the configuration from your HBase configuration files hbase-default.xml and hbase-site.xml.

Related

Dynamically generate Application.properties file in spring boot

I have came across a situation where I need to fetch cron expression from database and then schedule it in Spring boot. I am fetching the data using JPA. Now the problem is in spring boot when I use #Scheduled annotation it does not allow me to use the db value directly as it is taken only constant value. So, what I am planning to do is to dynamically generate properties file and read cron expression from properties file. But here also I am facing one problem.The dynamically generated properties file created in target directory.
So I cant use it the time of program loading.
So can anyone assist me to read the dynamically generated file from the resource folder or how to schedule cron expression fetching from DB in spring boot?
If I placed all the details of corn expression in properties file I can schedule the job.
Latest try with dynamically generate properties file.
#Configuration
public class CronConfiguration {
#Autowired
private JobRepository jobRepository;
#Autowired
private ResourceLoader resourceLoader;
#PostConstruct
protected void initialize() {
updateConfiguration();
}
private void updateConfiguration() {
Properties properties = new Properties();
List<Job> morningJobList=new ArrayList<Job>();
List<String> morningJobCornExp=new ArrayList<String>();
// Map<String,String> map=new HashMap<>();
int num=1;
System.out.println("started");
morningJobList= jobRepository.findByDescriptionContaining("Morning Job");
for(Job job:morningJobList) {
//morningJobURL.add(job.getJobUrl());
morningJobCornExp.add(job.getCronExp());
}
for(String cron:morningJobCornExp ) {
properties.setProperty("cron.expression"+num+"=", cron);
num++;
}
Resource propertiesResource = resourceLoader.getResource("classpath:application1.properties");
try (OutputStream out = new BufferedOutputStream(new FileOutputStream(propertiesResource.getFile()))) {
properties.store(out, null);
} catch (Exception ex) {
// Handle error
ex.printStackTrace();
}
}
}
Still it is not able to write in properties file under resource folder.
Consider using Quartz Scheduler framework. It stores scheduler info in DB. No need to implement own DB communication, it is already provided.
Found this example: https://www.callicoder.com/spring-boot-quartz-scheduler-email-scheduling-example/

Hazelcast ClassNotFound using Near Cache in Client

I try to use Hazelcast (3.9.2, 3.11 no difference) in the following way:
I got Hazelcast servers (members). I run them dedicated, not embedded.
I do not want to teach the Hazelcast members the classes I want to store within them. I used the bundled hazelcast.xml file and did the following addon (3.9.2)
<replicatedmap name="default">
<in-memory-format>BINARY</in-memory-format>
<statistics-enabled>true</statistics-enabled>
</replicatedmap>
I also activated TCP, not Multicast (true/false)
That is all changes I did. I started with one Member listening to 127.0.0.1:5701
Then I try to attach Hazelcast clients to the member for storing and retrieving Maps (Primarily ReplicatedMaps, but Maps also do not work in my scenario)
My Client Code looks like this (Cache is just a Serializable Class with no attributes):
public class Main {
public static final String HAZELCAST_INSTANCE_NAME = "HAZI";
public static final String REPLICATEDMAP_NAME = "REP_MAP";
public static final String MAP_NAME = "NORMAL_MAP";
public static void main(String[] args) {
init();
HazelcastInstance instance = HazelcastClient.getHazelcastClientByName(HAZELCAST_INSTANCE_NAME);
Map<String, Object> repMap = instance.getReplicatedMap(REPLICATEDMAP_NAME);
repMap.put("MyKey", new Cache());
System.err.println("Retrieve " + repMap.get("MyKey"));
Map<String, Object> normalMap = instance.getReplicatedMap(MAP_NAME);
normalMap.put("MyKey", new Cache());
System.err.println("Retrieve " + normalMap.get("MyKey"));
System.exit(1);
}
private static void init() {
ClientConfig cfg = new ClientConfig();
cfg.setInstanceName(HAZELCAST_INSTANCE_NAME);
cfg.addNearCacheConfig(defineNearCache(REPLICATEDMAP_NAME));
cfg.addNearCacheConfig(defineNearCache(MAP_NAME));
// for analysis in the hazelcast management console
cfg.getProperties().put("hazelcast.client.statistics.enabled", "true");
cfg.getProperties().put("hazelcast.client.statistics.period.seconds", "60");
cfg.getNetworkConfig().addAddress("127.0.0.1:5701");
if (HazelcastClient.newHazelcastClient(cfg) == null) {
System.err.println(" !!! ERROR in Cache Config !!!");
}
}
private static NearCacheConfig defineNearCache(String mapName) {
EvictionConfig evictionConfig = new EvictionConfig()
.setMaximumSizePolicy(EvictionConfig.MaxSizePolicy.ENTRY_COUNT)
.setSize(200);
return new NearCacheConfig()
.setName(mapName)
.setInMemoryFormat(InMemoryFormat.BINARY)
.setInvalidateOnChange(true)
.setEvictionConfig(evictionConfig);
}
}
My problem now is:
Using this code I get a ClassNotFoundError trying put put things to the replicated map or regular map, but in the dedicated Hazelcast server (member), not on the client side.
SCHWERWIEGEND: [127.0.0.1]:5701 [dev] [3.9.2] hz._hzInstance_1_dev.event-3 caught an exception while processing task:com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher#eeed098
com.hazelcast.nio.serialization.HazelcastSerializationException: java.lang.ClassNotFoundException: de.empic.hazelwar.model.Cache
at com.hazelcast.internal.serialization.impl.JavaDefaultSerializers$JavaSerializer.read(JavaDefaultSerializers.java:224)
at com.hazelcast.internal.serialization.impl.StreamSerializerAdapter.read(StreamSerializerAdapter.java:48)
at com.hazelcast.internal.serialization.impl.AbstractSerializationService.toObject(AbstractSerializationService.java:185)
at com.hazelcast.map.impl.DataAwareEntryEvent.getValue(DataAwareEntryEvent.java:90)
at com.hazelcast.client.impl.protocol.task.replicatedmap.AbstractReplicatedMapAddEntryListenerMessageTask.handleEvent(AbstractReplicatedMapAddEntryListenerMessageTask.java:92)
at com.hazelcast.client.impl.protocol.task.replicatedmap.AbstractReplicatedMapAddEntryListenerMessageTask.entryAdded(AbstractReplicatedMapAddEntryListenerMessageTask.java:132)
at com.hazelcast.replicatedmap.impl.ReplicatedMapEventPublishingService.dispatchEvent(ReplicatedMapEventPublishingService.java:82)
at com.hazelcast.replicatedmap.impl.ReplicatedMapService.dispatchEvent(ReplicatedMapService.java:247)
at com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher.run(LocalEventDispatcher.java:64)
at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:225)
at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:208)
Caused by: java.lang.ClassNotFoundException: de.empic.hazelwar.model.Cache
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at com.hazelcast.nio.ClassLoaderUtil.tryLoadClass(ClassLoaderUtil.java:173)
at com.hazelcast.nio.ClassLoaderUtil.loadClass(ClassLoaderUtil.java:147)
at com.hazelcast.nio.IOUtil$ClassLoaderAwareObjectInputStream.resolveClass(IOUtil.java:591)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at com.hazelcast.internal.serialization.impl.JavaDefaultSerializers$JavaSerializer.read(JavaDefaultSerializers.java:219)
... 10 more
Whenever I remove the near cache config from the client config, all works pretty perfect, except I do not have a near cache of course.
What do I miss here ?
#magicroomy, I run the same on both 3.9.2 & 3.11. I can confirm that:
If you change Replicated Map to Map, it works with or without Near Cache.
When using Replicated Map, if Near Cache defined, the exception thrown on the server side.
Without Near Cache, ReplicatedMap also works.
I created a github issue as well: https://github.com/hazelcast/hazelcast/issues/14210
My problem is solved using the 3.11.1 version of hazelcast.

Connecting to Accumulo inside a Mapper using Kerberos

I am moving some software from an older Hadoop Cluster (uses username/password authentication) to a newer one, 2.6.0-cdh5.12.0 which has Kerberos authentication enabled.
I have been able to get many of existing Map/Reduce jobs that use Accumulo for its input and/or output to work fine using a DelegationToken set in the AccumuloInput/OutputFormat classes.
However, I have 1 job, that uses AccumuloInput/OutputFormat for input and output, but also inside its Mapper.setup() method, it connects to Accumulo via Zookeeper so that in the Mapper.map() method, it can compare each key/value being processed my the Mapper.map() to and entry in another Accumulo table.
I included the relevant code below which shows the setup() method connecting to Zookeeper user a PasswordToken and then creating an Accumulo table Scanner which is then used in the mapper method.
So the question is how do I replace the use of the PasswordToken with a KerberosToken for setting up the Accumulo scanner in the Mapper.setup() method? I can find no way to "get" the DelegationToken used by the AccumuloInput/OutputFormat classes that I set.
I have tried context.getCredentials().getAllTokens() and looking for a token of type org.apache.accumulo.code.client.security.tokens.AuthenticationToken -- all of the tokens returned here are of type org.apache.hadoop.security.token.Token.
Please note that I typed the code fragments in versus cut/paste as the code runs on a network unconnected to the internet - aka there may be a typo. :)
//****************************
// code in the M/R driver
//****************************
ClientConfiguration accumuloCfg = ClientConfiguration.loadDefault().withInstance("Accumulo1").withZkHosts("zookeeper1");
ZooKeeperInstance inst = new ZooKeeperInstance(accumuloCfg);
AuthenticationToken dt = conn.securityOperations().getDelegationToken(new DelagationTokenConfig());
AccumuloInputFormat.setConnectorInfo(job, username, dt);
AccumuloOutputFormat.setConnectorInfo(job, username, dt);
// other job setup and then
job.waitForCompletion(true)
//****************************
// this is inside the Mapper class of the M/R job
//****************************
private Scanner index_scanner;
public void setup(Context context) {
Configuration cfg = context.getConfiguration();
// properties set and passed from M/R Driver program
String username = cfg.get("UserName");
String password = cfg.get("Password");
String accumuloInstName = cfg.get("InstanceName");
String zookeepers = cfg.get("Zookeepers");
String tableName = cfg.get("TableName");
Instance inst = new ZooKeeperInstance(accumuloInstName, zookeepers);
try {
AuthenticationToken passwordToken = new PasswordToken(password);
Connector conn = inst.getConnector(username, passwordToken);
index_scanner = conn.createScanner(tableName, conn.securityOperations().getUserAuthorizations(username));
} catch(Exception e) {
e.printStackTrace();
}
}
public void map(Key key, Value value, Context context) throws IOException, InterruptedException {
String uuid = key.getRow().toString();
index_scanner.clearColumns();
index_scanner.setRange(Range.exact(uuid));
for(Entry<Key, Value> entry : index_scanner) {
// do some processing in here
}
}
The provided AccumuloInputFormat and AccumuloOutputFormat have a method to set the token in the job configuration with the Accumulo*putFormat.setConnectorInfo(job, principle, token). You can also serialize the token in a file in HDFS, using the AuthenticationTokenSerializer and use the version of the setConnectorInfo method which accepts a file name.
If a KerberosToken is passed in, the job will create a DelegationToken to use, and if a DelegationToken is passed in, it will just use that.
The provided AccumuloInputFormat should handle its own scanner, so normally, you shouldn't have to do that in your Mapper if you've set the configuration properly. However, if you're doing secondary scanning (for something like a join) inside your Mapper, you can inspect the provided AccumuloInputFormat's RecordReader source code for an example of how to retrieve the configuration and construct a Scanner.

Specify job properties and override properties in hadoop jobs

I have a hadoop (2.2.0) map-reduce job which reads text from a specified path (say INPUT_PATH), and does some processing. I don't want to hardcode the input path (since it comes from some other source which changes each week).
I believe there should be a way in hadoop to specify an xml properties file while running though the command line. How should I do it?
One way I thought was to set an environment variable which points to the location of the properties file and then read this env variable in code and subsequently read the property file. This could work because the value of the env variable can be changed each week without changing the code. But I feel this is an ugly way of loading properties and overrides.
Please let me know the least hacky way of doing this.
There is no inbuilt way to read any configuration file for input/output.
One way I can suggest is to implement a Java M/R Driver program that does the following,
Read the configuration (XML/properties/anything) (Probably generated / updated by the other process)
Set the Job Properties
Submit the Job using your hadoop command (pass the configuration file as argument)
Something like this,
public class SampleMRDriver
extends Configured implements Tool {
#Override
public int run(
String[] args)
throws Exception {
// Read from args the configuration file
Properties prop = new Properties();
prop.loadFromXML(new FileInputStream(args[0]));
Job job = Job.getInstance(getConf(), "Test Job");
job.setJarByClass(SampleMRDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
FileInputFormat.setInputPaths(job, new Path(prop.get("input_path")));
FileOutputFormat.setOutputPath(job, new Path(prop.get("output_path")));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(
String[] args)
throws Exception {
ToolRunner.run(new BatteryAnomalyDetection(), args);
}
}

Hadoop setInputPathFilter error

I am using Hadoop 0.20.2 (that cannot be changed) and I want to add a filter to my input path. The data looks as follows:
/path1/test_a1
/path1/test_a2
/path1/train_a1
/path1/train_a2
and I only want to process all files with train in them.
A look at the FileInputFormat class suggests to use:
FileInputFormat.setInputPathFilter(Job job, Class<? extends PathFilter> filter)
and this is where my problem starts, since PathFilter is an interface - of course, I can extend the interface but then I still do not have an implementation. So instead, I implemented the interface:
class TrainFilter implements PathFilter
{
boolean accept(Path path)
{
return path.toString().contains("train");
}
}
When I use TrainFilter as PathFilter the code compiles, however when I run it, I get an exception as the input path is screwed up. Without setting the filter, my code runs through all files that are below /path1, however, when setting the filter, it throws the error:
InvalidInputException: Input path does not exist hdfs://localhost:9000/path1
Here is how I set it up in the driver code:
job.setMapperClass(....class);
job.setInputFormatClass(....class);
job.setMapOutputKeyClass(...class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPathFilter(job, TrainFilter.class);
FileInputFormat.addInputPath(job, new Path("/path1/"));
FileOutputFormat.setOutputPath(job, new Path("/path2/"));
job.waitForCompletion(true);
Any suggestions of what I am doing wrong here?
EDIT: I found the problem. The first call to the PathFilter is always the directory itself (/path1) and since it does not contain ("train"), the directory itself is invalid and thus the exception is thrown. Which brings me to another question: how can I test if an arbitrary path is a directory? For all I know, I need a reference to the FileSystem, which is not one of the default parameters of PathFilter.
Alternatively, you may try to loop through all of the files in the given directory and check if the file names begin with train. E.g:
Job job = new Job(conf, "myJob");
List<Path> inputhPaths = new ArrayList<Path>();
String basePath = "/user/hadoop/path";
FileSystem fs = FileSystem.get(conf);
FileStatus[] listStatus = fs.globStatus(new Path(basePath + "/train*"));
for (FileStatus fstat : listStatus) {
inputhPaths.add(fstat.getPath());
}
FileInputFormat.setInputPaths(job,
(Path[]) inputhPaths.toArray(new Path[inputhPaths.size()]));
A quick fix, You can blacklist paths instead of whitelisting like return false if path contains "test"
You can get a FileSystem instance by having your Filter implement the Configurable interface (or extend the Configured class), and create a fileSystem instance variable in the setConf method:
class TrainFilter extends Configured implements PathFilter
{
FileSystem fileSystem;
boolean accept(Path path)
{
// TODO: use fileSystem here to determine if path is a directory
return path.toString().contains("train");
}
public void setConf(Configuration conf) {
if (conf != null) {
fileSystem = FileSystem.get(conf);
}
}
}
I know this is a very old question, but it helped me find an answer to excluding paths when all the examples of it failed me, such as documented here.
I just want to warn about the answer that #ChrisWhite gave, because I am working on Hadoop 3.3.0 API and using setConf threw a NullPointerException when I was about to pull something out of the configuration.
I found that instead I'll keep the extends Configured in the PathFilter signature but just put getConf().get(<your configuration parameter name>) whenever you need what you need from the job configuration.
So, my excluding PathFilter looks like this:
public static class ExcludedPathsFilter extends Configured implements PathFilter {
public boolean accept(Path includePathGlob){
//debugging
System.out.println("excludedPath parameter is "+getConf().get("excludedPath")+", includePath parameter is "+includePathGlob.toString()+" and !includePathGlob.toString().contains(getConf().get(\"excludedPath\")) is "+!includePathGlob.toString().contains(getConf().get("excludedPath")));
return !includePathGlob.toString().contains(getConf().get("excludedPath"));
}
}

Resources