Hive UDTF - unable to fetch field names via getFieldName, it is returning _col0,_col1 - hadoop

I am working on Hive UDTF to transpose each row around primary key. as part of requirement, i need to associate column name and corresponding data with the key.
e.g.
Source Data
Customer_id Customer_name Customer_type
1000000 ABCD Individual
Hive UDTF will convert the data into following
Key att_name att_val
10000000 customer_name ABCD
10000000 customer_type Individual
i have written the udtf and it is working and currently producing below data
Key att_name att_val
10000000 _col0 ABCD
10000000 _col1 Individual
here is the code in which (StructField)inputFields.get(i)).getFieldName() is returning _col0 instead of customer_name.
could this be a probable defect in apache hive or there is yet another mapping for _col0 to actual schema that i should refer to.
public class transposeUDTF extends GenericUDTF {
private Map tableMap = new HashMap();
private MetadataListStructObjectInspector metadataDetails;
#Override
public StructObjectInspector initialize(StructObjectInspector args) throws UDFArgumentException {
List inputFields = args.getAllStructFieldRefs();
((StructObjectInspector)args).getTypeName();
for(int i = 0; i < inputFields.size(); ++i)
{
tableMap.put(i+1,((StructField)inputFields.get(i)).getFieldName());
}
return super.initialize(args);
}
#Override
public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
List<String> fieldNames = new ArrayList<String>(3);
List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(3);
fieldNames.add("key");
fieldNames.add("AttrName");
fieldNames.add("AttrVal");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
#Override
public void process(Object[] record) throws HiveException {
ArrayList<Object[]> results = new ArrayList<Object[]>();
for(int i = 1; i < record.length; ++i) {
results.add(new Object[] {record[0],tableMap.get(i).toString(),record[i]});
}
Iterator<Object[]> it = results.iterator();
while (it.hasNext()){
Object[] r = it.next();
forward(r);
}
}
#Override
public void close() throws HiveException {
// do nothing
}
}

Related

I want to write a Hadoop MapReduce Join in Java

I'm completely new in Hadoop Framework and I want to write a "MapReduce" program (HadoopJoin.java) that joins on x attribute between two tables R and S. The structure of the two tables is :
R (tag : char, x : int, y : varchar(30))
and
S (tag : char, x : int, z : varchar(30))
For example we have for R table :
r 10 r-10-0
r 11 r-11-0
r 12 r-12-0
r 21 r-21-0
And for S table :
s 11 s-11-0
s 21 s-41-0
s 21 s-41-1
s 12 s-31-0
s 11 s-31-1
The result should look like :
r 11 r-11-0 s 11 s-11-0
etc.
Can anyone help me please ?
It will be very difficult to describe join in mapreduce for someone who is new to this Framework but here I provide a working implementation for your situation and I definitely recommend you to read section 9 of Hadoop The Definitive Guide 4th Eddition. It has described how to implement Join in mapreduce very well.
First of all you might consider using higher level frameworks such as Pig, Hive and Spark because they provide join operation in their core part of implementation.
Secondly There are many ways to implement mapreduce depending of the nature of your data. This ways include map-side join and reduce-side join. In this answer I have implemented the reduce-side join:
Implementation:
First of all we should have two different mapper for two different datset notice that in your case same mapper can be used for two dataset but in many situation you need different mappers for different dataset and because of that I have defined two mappers to make this solution more general:
I have used TextPair that have two attributes, one of them is the key that is used to join data and the other one is a tag that specify which dataset this record belongs to. If it belongs to the first dataset this tag will be 0. otherwise it will be 1.
I have implemented TextPair.FirstComparator to ensure that for each key(join by key) the record of the first dataset is the first key which is received by reducer. And all the other records in second dataset with that id are received after that. Actually this line of code will do the trick for us:
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
So in reducer the first record that we will receive is the record from dataset1 and after that we receive record from dataset2. The only thing that should be done is that we have to write those records.
Mapper for dataset1:
public class JoinDataSet1Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "0"), value);
}
}
Mapper for DataSet2:
public class JoinDataSet2Mapper
extends Mapper<LongWritable, Text, TextPair, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split(" ");
context.write(new TextPair(data[1], "1"), value);
}
}
Reducer:
public class JoinReducer extends Reducer<TextPair, Text, NullWritable, Text> {
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
#Override
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
#Override
protected void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(iter.next());
while (iter.hasNext()) {
Text record = iter.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
context.write(NullWritable.get(), outValue);
}
}
}
Custom key:
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
#Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
#Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
#Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
#Override
public String toString() {
return first + "\t" + second;
}
#Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
}
#Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
}
JobDriver:
public class JoinJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "Join two DataSet");
job.setJarByClass(getClass());
Path ncdcInputPath = new Path(getConf().get("job.input1.path"));
Path stationInputPath = new Path(getConf().get("job.input2.path"));
Path outputPath = new Path(getConf().get("job.output.path"));
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinDataSet1Mapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinDataSet2Mapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.setPartitionerClass(JoinReducer.KeyPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
job.setMapOutputKeyClass(TextPair.class);
job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new JoinJob(), args);
System.exit(exitCode);
}
}

Parsing Pipe delimited file and Storing data into DB using Spring/Java

I have a pipe delimited file (excel xlsx) that I need to parse for certain data. the data is all in column A. the first row has the date, the last row has the row count, and everything in between is the row data. I want to take the first three fields of each row and the date from the header and store it into my H2 Table. There is extra data in my file in each row. I Need help creating code that will parse the file and insert it into my db. I have a Entity and some code written but am stuck now.
My file
20200310|
Mn1223|w01192|windows|extra|extra|extra||
Sd1223|w02390|linux|extra|extra|extra||
2
My table
DROP TABLE IF EXISTS Xy_load ;
CREATE TABLE Xy_load (
account_name VARCHAR(250) NOT NULL,
command_name VARCHAR(250) NOT NULL,
system_name VARCHAR (250) NOT NULL,
CREATE_DT date (8) DEFAULT NULL
);
entity class
public class ZyEntity {
#Column(name="account_name")
private String accountName;
#Column(name="command_name")
private String commandName;
#Column(name="system_name")
private String systemName;
#Column(name="CREATE_DT")
private int createDt;
public ZyEntity(String accountName, String commandName, String systemName){
this.accountName=accountName;
this.commandName=commandName;
this.systemName=systemName;
}
public String getAccountName() {
return accountName;
}
public void setAccountName(String accountName) {
this.accountName = accountName;
}
public String getCommandName() {
return commandName;
}
public void setCommandName(String commandName) {
this.commandName = commandName;
}
public String getSystemName() {
return systemName;
}
public void setSystemName(String systemName) {
this.systemName = systemName;
}
public int getCreateDt() {
return createDt;
}
public void setCreateDt(int createDt) {
this.createDt = createDt;
}
}
i was able to figure it out with some help
List<DataToInsert> parseData(String filePath) throws IOException {
List<String> lines = Files.readAllLines(Paths.get(filePath));
// remove date and amount
lines.remove(0);
lines.remove(lines.size() - 1);
return lines.stream()
.map(s -> s.split("[|]")).map(val -> new DataToInsert(val[0], val[1], val[2])).collect(Collectors.toList());
}
public void insertZyData(List<ZyEntity> parseData) {
String sql = "INSERT INTO Landing.load (account_name,command_name,system_name)"+
"VALUES (:account_name,:command_name,:system_name)";
for (ZyEntity zyInfo : parseData){
SqlParameterSource source = new MapSqlParameterSource("account_name", zInfo.getAccountName())
.addValue("command_name", zyInfo.getCommandName())
.addValue("system_name", zyInfo.getSystemName());
jdbcTemplate.update(sql, source);
}
}

FlatFileItemWriterBuilder-headerCallback() get number of rows written

Is it possible to get the total number of rows written from FlatFileItemWriter.headerCallback()?
I am a spring-batch nubee and I looked at putting count of lines into header of flat file and Spring Batch - Counting Processed Rows.
However I can't seem to implement the logic using the advice given there. It makes sense the writer count will only be available after the file is processed. However I am trying to get the row-count just before the file is officially written.
I tried to look for a hook like #AfterStep and grab the total rows, but I keep going in circles.
#Bean
#StepScope
public FlatFileItemWriter<MyFile> generateMyFileWriter(Long jobId,Date eventDate) {
String filePath = "C:\MYFILE\COMPLETED";
Resource file = new FileSystemResource(filePath);
DelimitedLineAggregator<MyFile> myFileLineAggregator = new DelimitedLineAggregator<>();
myFileLineAggregator.setDelimiter(",");
myFileLineAggregator.setFieldExtractor(getMyFileFieldExtractor());
return new FlatFileItemWriterBuilder<MyFile>()
.name("my-file-writer")
.resource(file)
.headerCallback(new MyFileHeaderWriter(file.getFilename()))
.lineAggregator(myFileLineAggregator)
.build();
}
private FieldExtractor<MyFile> getMyFileFieldExtractor() {
final String[] fieldNames = new String[]{
"typeRecord",
"idSystem"
};
return item -> {
BeanWrapperFieldExtractor<MyFile> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(fieldNames);
return extractor.extract(item);
};
}
Notice I am using the MyFileHeaderWriter.java class(below) in the headerCallback(new MyFileHeaderWriter(file.getFilename())) (above). I am trying to initialize the value of qtyRecordsCreated below.
class MyFileHeaderWriter implements FlatFileHeaderCallback {
private final String header;
private String dtxCreated;
private String tmxCreated;
private String fileName;//15 byte file name private String qtyRecordsCreated;//number of rows in file including the header row
MyFileHeaderWriter(String sbfFileName) {
SimpleDateFormat dateCreated = new SimpleDateFormat("YYDDD");
SimpleDateFormat timeCreated = new SimpleDateFormat("HHMM");
Date now = new Date();
this.dtxCreated = dateCreated.format(now);
this.tmxCreated = timeCreated.format(now);
this.fileName = sbfFileName; this.qtyRecordsCreated="";
String[] headerValues = {dtxCreated,tmxCreated,fileName,qtyRecordsCreated};
this.header = String.join(",", headerValues);
}
#Override
public void writeHeader(Writer writer) throws IOException {
writer.write(header);
}
}
How can I get the number of rows in the header row?
Can the FlatFileFooterCallback be used to fetch the number of rows and then update the header with number of rows in the file afterwards?
You can achieve this in ItemProcessor, try this it work for me
public class EmployeeProcessor implements ItemProcessor<Employee, Employee> {
#Override
public Employee process(Employee employee) throws Exception {
return employee;
}
#AfterStep
public void afterStep(StepExecution stepExecution) {
ExecutionContext stepContext = stepExecution.getExecutionContext();
stepContext.put("count", stepExecution.getReadCount());
System.out.println("COUNT" + stepExecution.getReadCount());
}
}
And in you writer to get value
int count = stepContext.getInt("count");
Hope work for you

Part of key changes when iterating through values when using composite key - Hadoop

I have implemented Secondary sort on Hadoop and I don't really understand the behavior of the framework.
I have created a composite key which contains original key and part of value, that is used for sorting.
To achieve this I have implemented my own partitioner
public class CustomPartitioner extends Partitioner<CoupleAsKey, LongWritable>{
#Override
public int getPartition(CoupleAsKey couple, LongWritable value, int numPartitions) {
return Long.hashCode(couple.getKey1()) % numPartitions;
}
My own group comparator
public class GroupComparator extends WritableComparator {
protected GroupComparator()
{
super(CoupleAsKey.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
CoupleAsKey c1 = (CoupleAsKey)w1;
CoupleAsKey c2 = (CoupleAsKey)w2;
return Long.compare(c1.getKey1(), c2.getKey1());
}
}
And defined the couple in the following way
public class CoupleAsKey implements WritableComparable<CoupleAsKey>{
private long key1;
private long key2;
public CoupleAsKey() {
}
public CoupleAsKey(long key1, long key2) {
this.key1 = key1;
this.key2 = key2;
}
public long getKey1() {
return key1;
}
public void setKey1(long key1) {
this.key1 = key1;
}
public long getKey2() {
return key2;
}
public void setKey2(long key2) {
this.key2 = key2;
}
#Override
public void write(DataOutput output) throws IOException {
output.writeLong(key1);
output.writeLong(key2);
}
#Override
public void readFields(DataInput input) throws IOException {
key1 = input.readLong();
key2 = input.readLong();
}
#Override
public int compareTo(CoupleAsKey o2) {
int cmp = Long.compare(key1, o2.getKey1());
if(cmp != 0)
return cmp;
return Long.compare(key2, o2.getKey2());
}
#Override
public String toString() {
return key1 + "," + key2 + ",";
}
}
And here is the driver
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(SSDriver.class);
job.setMapperClass(SSMapper.class);
job.setReducerClass(SSReducer.class);
job.setMapOutputKeyClass(CoupleAsKey.class);
job.setMapOutputValueClass(LongWritable.class);
job.setPartitionerClass(CustomPartitioner.class);
job.setGroupingComparatorClass(GroupComparator.class);
FileInputFormat.addInputPath(job, new Path("/home/marko/WORK/Whirlpool/input.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/marko/WORK/Whirlpool/output"));
job.waitForCompletion(true);
Now, this works, but what is really strange is that while iterating in reducer for a key, second part of the key (the value part) changes in each iteration. Why and how?
#Override
protected void reduce(CoupleAsKey key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
for (LongWritable value : values) {
//key.key2 changes during iterations, why?
context.write(key, value);
}
}
Definition says that "if you want all your relevant rows within a partition of data sent to a single reducer you must implement a grouping comparator". This only ensures that those set of keys will be sent to a single reduce call, and not that the key will change from composite (or whatever) to something that only contains that part of key on which grouping was done.
However, when you iterate over values, the corresponding keys will also change. We normally do not observe this happening, as by default the values are grouped on the same (non-composite) key, and thus, even when the value changes, the (value of-) key remains the same.
You can try printing the object reference of the key, and you shall notice that with every iteration, the object reference of the key is also changing (like this:)
IntWritable#1235ft
IntWritable#6635gh
IntWritable#9804as
Alternatively, you can also try applying a group-comparator on an IntWritable in a following way (you will have to write your own logic to do so):
Group1:
1 a
1 b
2 c
Group2:
3 c
3 d
4 a
and you shall see that with every iteration of value, your key is also changing.

Shouldn't Hadoop group <key, (list of values) in reducer only based on hashCode?

I decided to create my own WritableComparable class to learn how Hadoop works with it. So I create an Order class with two instance variables (orderNumber cliente) and implemented the required methods. I also used Eclipse generators for getters/setters/hashCode/equals/toString.
In compareTo, I've decided to use only the orderNumber variable.
I created a simple MapReduce job only to count the occurrences of an order in a dataset. By mistake one of my test records is Ita instead of Itá, as you can see here:
123 Ita
123 Itá
123 Itá
345 Carol
345 Carol
345 Carol
345 Carol
456 Iza Smith
As I understand the first record should be treated as a different order, because record 1 hashCode is de different from record 2 and 3 hashCodes.
But in reduce phase the 3 records are grouped together. As you can see here:
Order [cliente=Ita, orderNumber=123] 3
Order [cliente=Carol, orderNumber=345] 4
Order [cliente=Iza Smith, orderNumber=456] 1
I thought it should have a line for Itá records with count 2 and Ita should have count 1.
Well as I used only orderNumber in compareTo, I tried to use the String cliente in this method (commented on code below). And then, it worked as I was expecting.
So, is that an expected result? Shouldn't hadoop use only hashCode to group key and its values?
Here is the Order class (I ommited the getters/setters):
public class Order implements WritableComparable<Order>
{
private String cliente;
private long orderNumber;
#Override
public void readFields(DataInput in) throws IOException
{
cliente = in.readUTF();
orderNumber = in.readLong();
}
#Override
public void write(DataOutput out) throws IOException
{
out.writeUTF(cliente);
out.writeLong(orderNumber);
}
#Override
public int compareTo(Order o) {
long thisValue = this.orderNumber;
long thatValue = o.orderNumber;
return (thisValue < thatValue ? -1 :(thisValue == thatValue ? 0 :1));
//return this.cliente.compareTo(o.cliente);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((cliente == null) ? 0 : cliente.hashCode());
result = prime * result + (int) (orderNumber ^ (orderNumber >>> 32));
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Order other = (Order) obj;
if (cliente == null) {
if (other.cliente != null)
return false;
} else if (!cliente.equals(other.cliente))
return false;
if (orderNumber != other.orderNumber)
return false;
return true;
}
#Override
public String toString() {
return "Order [cliente=" + cliente + ", orderNumber=" + orderNumber + "]";
}
Here is the MapReduce code:
public class TesteCustomClass extends Configured implements Tool
{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Order, LongWritable>
{
LongWritable outputValue = new LongWritable();
String[] campos;
Order order = new Order();
#Override
public void configure(JobConf job)
{
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Order, LongWritable> output, Reporter reporter) throws IOException
{
campos = value.toString().split("\t");
order.setOrderNumber(Long.parseLong(campos[0]));
order.setCliente(campos[1]);
outputValue.set(1L);
output.collect(order, outputValue);
}
}
public static class Reduce extends MapReduceBase implements Reducer<Order, LongWritable, Order,LongWritable>
{
#Override
public void reduce(Order key, Iterator<LongWritable> values,OutputCollector<Order,LongWritable> output, Reporter reporter) throws IOException
{
LongWritable value = new LongWritable(0);
while (values.hasNext())
{
value.set(value.get() + values.next().get());
}
output.collect(key, value);
}
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(),TesteCustomClass.class);
conf.setMapperClass(Map.class);
// conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setJobName("Teste - Custom Classes");
conf.setOutputKeyClass(Order.class);
conf.setOutputValueClass(LongWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),new TesteCustomClass(),args);
System.exit(res);
}
}
The default partitioner is the HashPartitioner, which uses the hashCode method to determine which reducer to send the K,V pair to.
Once in the reducer (or if you're using a Combiner which is run map side), the compareTo method is used to sort the keys and then also used (by default) to compare whether sequential keys should be grouped together and their associated values reduced in the same iteration.
If you don't use the cliente Key variable and only your orderNumber variable in your compareTo method, then any key with the same orderNumber will have its values reduced together - regardless of the cliente value (which is what you're currently observing)

Resources