Input format for MaxEnt OpenNLP implementation? - opennlp

I'm trying to use the OpenNLP implementation of the Maximum Entropy classifier but it seems documentation is quite lacking and despite this library is apparently designed for easy of use I cannot find a single example and/or specification for the input file format (i.e., the training set).
Anybody knows where to find this or a minimal working example of training?

OpenNLP's format is quite flexible. If you want to use the MaxEnt classifier in OpenNLP there are a few steps involved.
Here is sample code with comments:
package example;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import opennlp.tools.ml.maxent.GISTrainer;
import opennlp.tools.ml.model.Event;
import opennlp.tools.ml.model.MaxentModel;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.FilterObjectStream;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
public class ReadData {
public static void main(String[] args) throws Exception{
// this is the data file ...
// the format is <LIST of FEATURES separated by spaces> <outcome>
// change the file to fit your needs
File f=new File("football.dat");
// we need to create an ObjectStream of events for the trainer..
// First create an InputStreamFactory -- given a file we can create an InputStream, required for resetting...
MarkableFileInputStreamFactory factory=new MarkableFileInputStreamFactory(f);
// create a PlainTextByLineInputStream -- Note: you can create your own Stream that can handle binary files or data that
// -- crosses two line...
ObjectStream<String> stream=new PlainTextByLineStream(factory, Charset.defaultCharset());
// Now you have a stream of string you need to convert it to a stream of events...
// I use a custom FilterObjectStream which simply takes a line, breaks it up into tokens,
// uses all except the last as the features [context] and the last token as the outcome class
ObjectStream<Event> eventStream=new FilterObjectStream<String, Event>(stream) {
#Override
public Event read() throws IOException {
String line=samples.read();
if (line==null) return null;
String[] parts=WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] context=Arrays.copyOf(parts, parts.length-1);
System.out.println(parts[parts.length-1]+" "+Arrays.toString(context));
return new Event(parts[parts.length-1], context);
}
};
TrainingParameters parameters=new TrainingParameters();
// By default OpenNLP uses a cutoff of 5 (a feature has to occur 5 times before it is used)
// use 1 for my small dataset
parameters.put(GISTrainer.CUTOFF_PARAM, 1);
GISTrainer trainer=new GISTrainer();
// the report map is supposed to mark when default values are assigned...
Map<String,String> reportMap=new HashMap<>();
// DONT FORGET TO INITIALIZE THE TRAINER!!!
trainer.init(parameters, reportMap);
MaxentModel model=trainer.train(eventStream);
// Now we have a model -- you should test on a test set, but
// this is a toy example... so I am just resetting the eventstream.
eventStream.reset();
Event evt=null;
while ( (evt=eventStream.read())!=null ){
System.out.print(Arrays.toString(evt.getContext())+": ");
// Evaluate the context from the event using our model.
// you would want to calculate summary statistics..
double[] p=model.eval(evt.getContext());
System.out.print(model.getBestOutcome(p)+" ");
if (model.getBestOutcome(p).equals(evt.getOutcome())){
System.out.println("CORRECT");
}else{
System.out.println("INCORRECT");
}
}
}
}
Football.dat:
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=man_united Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous man_united
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous tie
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=man_united Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal
home=arsenal Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
home=arsenal Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal
Hope it Helps

Related

Spring Boot test Kafka

I'm using Spring Boot version 2.1.8.RELEASE, and I have this problem:
Have you a solution please ?
[Thread-2] ERROR o.a.k.t.TestUtils - Error deleting C:\Users\usr\AppData\Local\Temp\kafka-255644115154741962
java.nio.file.FileSystemException: C:\Users\usr\AppData\Local\Temp\kafka-255644115154741962\version-2\log.1:
The process cannot access the file because it is being used by another process.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.apache.kafka.common.utils.Utils$2.visitFile(Utils.java:734)
at org.apache.kafka.common.utils.Utils$2.visitFile(Utils.java:723)
at java.nio.file.Files.walkFileTree(Files.java:2670)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at org.apache.kafka.common.utils.Utils.delete(Utils.java:723)
at org.apache.kafka.test.TestUtils$1.run(TestUtils.java:184)
This is my test, i use a windows 10 like OS,
This is my test,
import org.I0Itec.zkclient.ZkClient;
import org.junit.Test;
import kafka.utils.ZKStringSerializer$;
import kafka.utils.ZkUtils;
import kafka.zk.EmbeddedZookeeper;
public class BaseTest {
private static final String ZKHOST = "127.0.0.1";
#Test
public void producerTest(){
// setup Zookeeper
EmbeddedZookeeper zkServer = new EmbeddedZookeeper();
String zkConnect = ZKHOST + ":" + zkServer.port();
ZkClient zkClient = new ZkClient(zkConnect, 30000, 30000, ZKStringSerializer$.MODULE$);
ZkUtils zkUtils = ZkUtils.apply(zkClient, false);
zkClient.close();
zkServer.shutdown();
}
}

What's the best practice for Kafka Streaming as ETL replacement?

I am new to kafka and currently looking at Kafka Streams, especially joining two streams.
The samples I browsed worked with rather simple messages/ text messages.
So I constructed another simple sample, that more applies to the traditional ETL.
Let's say, we have two "datasets": Contract (=Vertrag) and Cashflow, with a cardinality of 1 to n.
In my sample I created a topic for each and sent objects (Vertrag, Cashflow) to each.
And I managed a first join of them.
KStream<String, String> joined = srcVertrag.leftJoin(srcCashflow,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, /* ValueJoiner */
JoinWindows.of(5000),
Joined.with(
Serdes.String(), /* key */
Serdes.String(), /* left value */
Serdes.String()) /* right value */
);
The result looks like this:
left={"name":"Vertrag123","vertragId":"123"}, right={"buchungstag":1560715764709,"betrag":12.0,"vertragId":"123"}
Now my questions:
is this the right way to do this?
should I create Objects at all or rather process just Strings?
After your hints and further research, I came up with the following test.
- I created Pojos for "Vertrag" and "Cashflow"
- I created Serdes for each
- I stream them as objects
- Finally I try to join them into a Wrapper-Class. (and here I hang)
I don't find samples, that do something like this. Is this so exotic?
package tki.bigdata.kafkaetl;
import java.time.Duration;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.serialization.Serializer;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.JoinWindows;
import org.apache.kafka.streams.kstream.Joined;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Printed;
import org.apache.kafka.streams.kstream.ValueJoiner;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.scheduling.annotation.EnableScheduling;
import tki.bigdata.domain.Cashflow;
import tki.bigdata.domain.Vertrag;
import tki.bigdata.serde.JsonPOJODeserializer;
import tki.bigdata.serde.JsonPOJOSerializer;
#ComponentScan(basePackages = { "tki.bigdata.domain", "tki.bigdata.config", "tki.bigdata.app" }, basePackageClasses = App.class)
#SpringBootApplication
#EnableScheduling
public class App implements CommandLineRunner {
private static String bootstrapServers = "tobi0179.westeurope.cloudapp.azure.com:9092";
#Autowired
private KafkaTemplate<String, Object> template;
// #Autowired
// ExcelReader excelReader;
public static void main(String[] args) {
SpringApplication.run(App.class, args).close();
}
private void populateSampleData() {
Vertrag v = new Vertrag();
v.setVertragId("123");
v.setName("Vertrag123");
template.send("Vertrag", "123", v);
//template.send("Vertrag", "124", "124;Vertrag12");
Cashflow c = new Cashflow();
c.setVertragId("123");
c.setBetrag(12);
c.setBuchungstag(new Date());
template.send("Cashflow", "123", c);
}
//#Override
public void run(String... args) throws Exception {
// Topics mit Demodata befüllen
populateSampleData();
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// TODO: the following can be removed with a serialization factory
Map<String, Object> serdeProps = new HashMap<>();
// prepare Serde for Vertrag
final Serializer<Vertrag> vertragSerializer = new JsonPOJOSerializer<Vertrag>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
vertragSerializer.configure(serdeProps, false);
final Deserializer<Vertrag> vertragDeserializer = new JsonPOJODeserializer<Vertrag>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
vertragDeserializer.configure(serdeProps, false);
final Serde<Vertrag> vertragSerde = Serdes.serdeFrom(vertragSerializer, vertragDeserializer);
// prepare Serde for Cashflow
final Serializer<Cashflow> cashflowSerializer = new JsonPOJOSerializer<Cashflow>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
cashflowSerializer.configure(serdeProps, false);
final Deserializer<Cashflow> cashflowDeserializer = new JsonPOJODeserializer<Cashflow>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
cashflowDeserializer.configure(serdeProps, false);
final Serde<Cashflow> cashflowSerde = Serdes.serdeFrom(cashflowSerializer, cashflowDeserializer);
// streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG,
// TestUtils.tempDir().getAbsolutePath());
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Vertrag> srcVertrag = builder.stream("Vertrag");
KStream<String, Cashflow> srcCashflow = builder.stream("Cashflow");
// print to sysout
//srcVertrag.print(Printed.toSysOut());
KStream<String, MyValueContainer> joined = srcVertrag.leftJoin(srcCashflow,
(leftValue, rightValue) -> new MyValueContainer(leftValue , rightValue), /* ValueJoiner */
JoinWindows.of(600),
Joined.with(
Serdes.String(), /* key */
vertragSerde, /* left value */
cashflowSerde) /* right value */
);
joined.to("Output");
final Topology topology = builder.build();
System.out.println(topology.describe());
final KafkaStreams streams = new KafkaStreams(topology, streamsConfiguration);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
When executed, it produces the error:
2019-06-17 22:18:31.892 ERROR 1599 --- [-StreamThread-1] o.a.k.s.p.i.AssignedStreamsTasks : stream-thread [streams-pipe-0638d359-94df-43bd-9ef7-eb6769ed8a1c-StreamThread-1] Failed to process stream task 0_0 due to the following error:
java.lang.ClassCastException: java.lang.String cannot be cast to tki.bigdata.domain.Vertrag
at org.apache.kafka.streams.kstream.internals.KStreamKStreamJoin$KStreamKStreamJoinProcessor.process(KStreamKStreamJoin.java:98) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:63) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:302) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:409) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:964) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:832) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736) [kafka-streams-2.0.1.jar:na]
is this the right way to do this?
Yes.
should I create Objects at all or rather process just Strings?
Yes. Look at Avro as a good example of a data format for serializing/deserializing your pojos. Here, you are looking for an Avro "serde" (serializer/deserializer). Confluent provide such an Avro serde for KStreams, for instance (this serde requires the use of Confluent Schema Registry).
what should I do with the above result?
It's unclear to me what your question is.

Read from HDFS and write to HBASE

The Mapper is reading file from two places
1) Articles visited by user(sorting by country)
2) Statistics of country (country wise)
The output of both Mapper is Text, Text
I am running program of Amazon Cluster
My aim is read data from two different set and combine the result and store it in hbase.
HDFS to HDFS is working.
The code is getting stuck at reducing 67% and gives error as
17/02/24 10:45:31 INFO mapreduce.Job: map 0% reduce 0%
17/02/24 10:45:37 INFO mapreduce.Job: map 100% reduce 0%
17/02/24 10:45:49 INFO mapreduce.Job: map 100% reduce 67%
17/02/24 10:46:00 INFO mapreduce.Job: Task Id : attempt_1487926412544_0016_r_000000_0, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length is 0
at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:565)
at org.apache.hadoop.hbase.client.Put.<init>(Put.java:110)
at org.apache.hadoop.hbase.client.Put.<init>(Put.java:68)
at org.apache.hadoop.hbase.client.Put.<init>(Put.java:58)
at com.happiestminds.hadoop.CounterReducer.reduce(CounterReducer.java:45)
at com.happiestminds.hadoop.CounterReducer.reduce(CounterReducer.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Driver class is
package com.happiestminds.hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.MasterNotRunningException;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Main extends Configured implements Tool {
/**
* #param args
* #throws Exception
*/
public static String outputTable = "mapreduceoutput";
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Main(), args);
System.exit(exitCode);
}
#Override
public int run(String[] args) throws Exception {
Configuration config = HBaseConfiguration.create();
try{
HBaseAdmin.checkHBaseAvailable(config);
}
catch(MasterNotRunningException e){
System.out.println("Master not running");
System.exit(1);
}
Job job = Job.getInstance(config, "Hbase Test");
job.setJarByClass(Main.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, ArticleMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, StatisticsMapper.class);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.initTableReducerJob(outputTable, CounterReducer.class, job);
//job.setReducerClass(CounterReducer.class);
job.setNumReduceTasks(1);
return job.waitForCompletion(true) ? 0 : 1;
}
}
Reducer class is
package com.happiestminds.hadoop;
import java.io.IOException;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class CounterReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
public static final byte[] CF = "counter".getBytes();
public static final byte[] COUNT = "combined".getBytes();
#Override
protected void reduce(Text key, Iterable<Text> values,
Reducer<Text, Text, ImmutableBytesWritable, Mutation>.Context context)
throws IOException, InterruptedException {
String vals = values.toString();
int counter = 0;
StringBuilder sbr = new StringBuilder();
System.out.println(key.toString());
for (Text val : values) {
String stat = val.toString();
if (stat.equals("***")) {
counter++;
} else {
sbr.append(stat + ",");
}
}
sbr.append("Article count : " + counter);
Put put = new Put(Bytes.toBytes(key.toString()));
put.addColumn(CF, COUNT, Bytes.toBytes(sbr.toString()));
if (counter != 0) {
context.write(null, put);
}
}
}
Dependencies
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.2</version>
</dependency>
</dependencies>
A good practice is to validate your values before submitting them somewhere. In your particular case you can validate your key and sbr or wrap them into try-catch section with proper notification policy. You should output them into some log if they are not correct and update you unit tests with new test-cases:
try
{
Put put = new Put(Bytes.toBytes(key.toString()));
put.addColumn(CF, COUNT, Bytes.toBytes(sbr.toString()));
if (counter != 0) {
context.write(null, put);
}
}
catch (IllegalArgumentException ex)
{
System.err.println("Error processing record - Key: "+ key.toString() +", values: " +sbr.ToString());
}
According to the exception thrown by the program it is clear that key length is 0 so before putting into hbase you can check if key length is 0 or not then only you can put into the hbase.
More clarity why key length's 0 is not supported by hbase
Becuase HBase data model does not allow 0-length row key, it should be at least 1 byte. 0-byte row key is reserved for internal usage (to designate empty start key and end keys).
Can you try to check whether you are inserting any null values or not ?
HBase data model does not allow zero length row key, it should be at least 1 byte.
Please check in your reducer code before executing the put command , whether some of the values are populated to null or not.
The error you get is quite self-explanatory. Row keys in HBase can't be empty (though values can be).
#Override
protected void reduce(Text key, Iterable<Text> values,
Reducer<Text, Text, ImmutableBytesWritable, Mutation>.Context context)
throws IOException, InterruptedException {
if (key == null || key.getLength() == 0) {
// Log a warning about the empty key.
return;
}
// Rest of your reducer follows.
}

Access Data from REST API in HIVE

Is there a way to create a hive table where the location for that hive table will be a http JSON REST API? I don't want to import the data every time in HDFS.
I had encountered similar situation in a project couple of years ago. This is the sort of low-key way of ingesting data from Restful to HDFS and then you use Hive analytics to implement the business logic.I hope you are familiar with core Java, Map Reduce (if not you might look into Hortonworks Data Flow, HDF which is a product of Hortonworks).
Step 1: Your data ingestion workflow should not be tied to your Hive workflow that contains business logic. This should be executed independently in timely manner based on your requirement (volume & velocity of data flow) and monitored regularly. I am writing this code on a text editor. WARN: It's not compiled or tested!!
The code below is using a Mapper which would take in the url or tweak it to accept the list of urls from the FS. The payload or requested data is stored as text file in the specified job output directory (forget the structure of data this time).
Mapper Class:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class HadoopHttpClientMap extends Mapper<LongWritable, Text, Text, Text> {
private int file = 0;
private String jobOutDir;
private String taskId;
#Override
protected void setup(Context context) throws IOException,InterruptedException {
super.setup(context);
jobOutDir = context.getOutputValueClass().getName();
taskId = context.getJobID().toString();
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
Path httpDest = new Path(jobOutDir, taskId + "_http_" + (file++));
InputStream is = null;
OutputStream os = null;
URLConnection connection;
try {
connection = new URL(value.toString()).openConnection();
//implement connection timeout logics
//authenticate.. etc
is = connection.getInputStream();
os = FileSystem.getLocal(context.getConfiguration()).create(httpDest,true);
IOUtils.copyBytes(is, os, context.getConfiguration(), true);
} catch(Throwable t){
t.printStackTrace();
}finally {
IOUtils.closeStream(is);
IOUtils.closeStream(os);
}
context.write(value, null);
//context.write(new Text (httpDest.getName()), new Text (os.toString()));
}
}
Mapper Only Job:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class HadoopHttpClientJob {
private static final String data_input_directory = “YOUR_INPUT_DIR”;
private static final String data_output_directory = “YOUR_OUTPUT_DIR”;
public HadoopHttpClientJob() {
}
public static void main(String... args) {
try {
Configuration conf = new Configuration();
Path test_data_in = new Path(data_input_directory, "urls.txt");
Path test_data_out = new Path(data_output_directory);
#SuppressWarnings("deprecation")
Job job = new Job(conf, "HadoopHttpClientMap" + System.currentTimeMillis());
job.setJarByClass(HadoopHttpClientJob.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(test_data_out, true);
job.setMapperClass(HadoopHttpClientMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, test_data_in);
FileOutputFormat.setOutputPath(job, test_data_out);
job.waitForCompletion(true);
}catch (Throwable t){
t.printStackTrace();
}
}
}
Step 2: Create external table in Hive based on the HDFS directory. Remember to use Hive SerDe for the JSON data (in your case) then you can copy the data from external table into managed master tables. This is the step where you implement your incremental logics, compression..
Step 3: Point your hive queries (which you might have already created) to the master table to implement your business needs.
Note: If you are supposedly referring to realtime analysis or streaming api, you might have to change your application's architecture. Since you have asked architectural question, I am using my best educated guess to support you. Please go through this once. If you feel you can implement this in your application then you can ask the specific question, I will try my best to address them.

Setting number of Reduce tasks using command line

I am a beginner in Hadoop. When trying to set the number of reducers using command line using Generic Options Parser, the number of reducers is not changing. There is no property set in the configuration file "mapred-site.xml" for the number of reducers and I think, that would make the number of reducers=1 by default. I am using cloudera QuickVM and hadoop version : "Hadoop 2.5.0-cdh5.2.0".
Pointers Appreciated. Also my issue was I wanted to know the preference order of the ways to set the number of reducers.
Using configuration File "mapred-site.xml"
mapred.reduce.tasks
By specifying in the driver class
job.setNumReduceTasks(4)
By specifying at the command line using Tool interface:
-Dmapreduce.job.reduces=2
Mapper :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
//Split the line into words
for(String word: line.split("\\W+"))
{
//Make sure that the word is legitimate
if(word.length() > 0)
{
//Emit the word as you see it
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Reducer :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
//Initializing the word count to 0 for every key
int count=0;
for(IntWritable value: values)
{
//Adding the word count counter to count
count += value.get();
}
//Finally write the word and its count
context.write(key, new IntWritable(count));
}
}
Driver :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
//Instantiate the job object for configuring your job
Job job = new Job();
//Specify the class that hadoop needs to look in the JAR file
//This Jar file is then sent to all the machines in the cluster
job.setJarByClass(WordCount.class);
//Set a meaningful name to the job
job.setJobName("Word Count");
//Add the apth from where the file input is to be taken
FileInputFormat.addInputPath(job, new Path(args[0]));
//Set the path where the output must be stored
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Set the Mapper and the Reducer class
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//Set the type of the key and value of Mapper and reducer
/*
* If the Mapper output type and Reducer output type are not the same then
* also include setMapOutputKeyClass() and setMapOutputKeyValue()
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//job.setNumReduceTasks(4);
//Start the job and wait for it to finish. And exit the program based on
//the success of the program
System.exit(job.waitForCompletion(true)?0:1);
return 0;
}
public static void main(String[] args) throws Exception
{
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
}
And I have tried the following commands to run the job :
hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -Dmapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take13
and
hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -D mapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take14
Answering your query on order. It would always be 2>3>1
The option specified in your driver class takes precedence over the ones you specify as an argument to your GenOptionsParser or the ones you specify in your site specific config.
I would recommend debugging the configurations inside your driver class by printing it out before you submit the job. This way , you can be sure what the configurations are , right before you submit the job to the cluster.
Configuration conf = getConf(); // This is available to you since you extended Configured
for(Entry entry: conf)
//Sysout the entries here

Resources