Apache Flink: Skewed data distribution on KeyedStream - parallel-processing

I have this Java code in Flink:
env.setParallelism(6);
//Read from Kafka topic with 12 partitions
DataStream<String> line = env.addSource(myConsumer);
//Filter half of the records
DataStream<Tuple2<String, Integer>> line_Num_Odd = line_Num.filter(new FilterOdd());
DataStream<Tuple3<String, String, Integer>> line_Num_Odd_2 = line_Num_Odd.map(new OddAdder());
//Filter the other half
DataStream<Tuple2<String, Integer>> line_Num_Even = line_Num.filter(new FilterEven());
DataStream<Tuple3<String, String, Integer>> line_Num_Even_2 = line_Num_Even.map(new EvenAdder());
//Join all the data again
DataStream<Tuple3<String, String, Integer>> line_Num_U = line_Num_Odd_2.union(line_Num_Even_2);
//Window
DataStream<Tuple3<String, String, Integer>> windowedLine_Num_U_K = line_Num_U
.keyBy(1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new Reducer());
The problem is that the window should be able to process with parallelism = 2 as there are two diferent groups of data with keys "odd" and "even" in the second String in the Tuple3. Everything is running with parallelism 6 but not the window which is running with parallelism = 1 and I just need it to have parallelism = 2 because of my requirements.
The functions used in the code are the following:
public static class FilterOdd implements FilterFunction<Tuple2<String, Integer>> {
public boolean filter(Tuple2<String, Integer> line) throws Exception {
Boolean isOdd = (Long.valueOf(line.f0.split(" ")[0]) % 2) != 0;
return isOdd;
}
};
public static class FilterEven implements FilterFunction<Tuple2<String, Integer>> {
public boolean filter(Tuple2<String, Integer> line) throws Exception {
Boolean isEven = (Long.valueOf(line.f0.split(" ")[0]) % 2) == 0;
return isEven;
}
};
public static class OddAdder implements MapFunction<Tuple2<String, Integer>, Tuple3<String, String, Integer>> {
public Tuple3<String, String, Integer> map(Tuple2<String, Integer> line) throws Exception {
Tuple3<String, String, Integer> newLine = new Tuple3<String, String, Integer>(line.f0, "odd", line.f1);
return newLine;
}
};
public static class EvenAdder implements MapFunction<Tuple2<String, Integer>, Tuple3<String, String, Integer>> {
public Tuple3<String, String, Integer> map(Tuple2<String, Integer> line) throws Exception {
Tuple3<String, String, Integer> newLine = new Tuple3<String, String, Integer>(line.f0, "even", line.f1);
return newLine;
}
};
public static class Reducer implements ReduceFunction<Tuple3<String, String, Integer>> {
public Tuple3<String, String, Integer> reduce(Tuple3<String, String, Integer> line1,
Tuple3<String, String, Integer> line2) throws Exception {
Long sum = Long.valueOf(line1.f0.split(" ")[0]) + Long.valueOf(line2.f0.split(" ")[0]);
Long sumTS = Long.valueOf(line1.f0.split(" ")[1]) + Long.valueOf(line2.f0.split(" ")[1]);
Tuple3<String, String, Integer> newLine = new Tuple3<String, String, Integer>(String.valueOf(sum) +
" " + String.valueOf(sumTS), line1.f1, line1.f2 + line2.f2);
return newLine;
}
};
Thanks for your help!
SOLUTION: I have changed the content of the keys from "odd" and "even" to "odd0000" and "even1111" and it is working properly now.

Keys are distributed to worker threads by hash partitioning. This means that the key values are hashed and the thread is determined by modulo #workers. With two keys and two threads there is a good chance that both keys are assigned to the same thread.
You can try to use different key values whose hash values distribute across both threads.

Related

Convert Integer to JSON

I'm playing around in SpringBoot and i want to count the amount of users in my database.
UserRepository
public interface UserRepository extends JpaRepository<User, Long> {
#Query(value = "select COUNT(*) from user", nativeQuery = true)
Integer findAllActiveUsers();
}
UserService
public Integer amountOfUsersInDB() {
return userRepository.findAllActiveUsers();
}
UserController
#GetMapping("/myusers")
public ResponseEntity amountOfUsers() {
System.out.println(userService.amountOfUsersInDB());
return ResponseEntity.ok(userService.amountOfUsersInDB());
}
When i make the http get call it returns the amount of users inside my database as an Integer. How can i make it return the value as an JSON so i can later on dsplay it on my frontend?
When you have something like this:
#GetMapping(value = "/count", produces = MediaType.APPLICATION_JSON_VALUE)
public ResponseEntity<Integer> getCount() {
Integer count = 1;
return ResponseEntity.ok(count);
}
You'll have the following response payload which, by the way, is a valid JSON:
1
Now, if you want to produce a JSON object, then you could use a Map<String, Object> for representing the payload:
#GetMapping(value = "/count", produces = MediaType.APPLICATION_JSON_VALUE)
public ResponseEntity<Map<String, Object>> getCount() {
Integer count = 1;
Map<String, Object> payload = new HashMap<>();
payload.put("count", count);
return ResponseEntity.ok(payload);
}
Or you could define a class representing the payload, create and instance of such class and assign a value to the count field:
#Data
public class CountPayload {
private Integer count;
}
#GetMapping(value = "/count", produces = MediaType.APPLICATION_JSON_VALUE)
public ResponseEntity<CountPayload> getCount() {
Integer count = 1;
CountPayload payload = new CountPayload();
payload.setCount(count);
return ResponseEntity.ok(payload);
}

java 8 java.util.function.Function downcast

The below java function is assignable (f(Order) to f(Object))
Function<Order, Order> orderProcessor = (Order order) -> {
System.out.println("Processing Order:"
return order;
};
Function f = orderProcessor;
The question is how do i cast this function back to Function < Order,Order> ?
or better I would like to cast this function back to Function < SomeInterface,SomeInterface>
I am storing these functions in a List< Function> but ideally i would like to store them in a List < SomeInterface,SomeInterface>.
Is it possible ?
you can define a List takes Functions with 2 generic parameters ? extends SomeInterface. then no need casting at all, for example:
List<Function<? extends SomeInterface, ? extends SomeInterface>> functions =
new ArrayList<>();
Function<Order, Order> orderProcessor = (Order order) -> {
System.out.println("Processing Order:" + order);
return order;
};
functions.add(orderProcessor);
IF the result type is same as parameter type, you also can using a UnaryOperator instead, for example:
List<UnaryOperator<? extends SomeInterface>> functions = new ArrayList<>();
UnaryOperator<Order> orderProcessor = (Order order) -> {
System.out.println("Processing Order:" + order);
return order;
};
functions.add(orderProcessor);
THEN you can't accept any SomeInterface at all, since the Function<? extends SomeInterface> can take any other subtypes not only for the Order class.
IF you want your Functions can accept all of its subtypes, you must declare it as write mode ? super SomeInterface, and then you must change the UnaryOperator<Order> definition, for example:
List<UnaryOperator<? super SomeInterface>> functions = new ArrayList<>();
UnaryOperator<SomeInterface> orderProcessor = (SomeInterface item) -> {
if(item instanceof Order){
System.out.println("Processing Order:" + order);
}
return item;
};
functions.add(orderProcessor);
here is a solution wrap a Map in Processors then you don't need any casting or instanceof statement in UnaryOperator body, for example:
Processors processors = new Processors();
UnaryOperator<Order> orderProcessor = (order) -> {
// no need instanceof & casting expression here
System.out.println("Processing Order:" + order);
return order;
};
processors.add(Order.class, orderProcessor);
processors.listing(Order.class).forEach(it -> it.apply(new Order()));
class Processors {
private final Map<Class<?>, List<UnaryOperator<?>>> registry = new HashMap<>();
public <T extends SomeInterface> void add(Class<T> type,
UnaryOperator<T> processor) {
registry.computeIfAbsent(type, aClass -> new ArrayList<>()).add(processor);
}
#SuppressWarnings("unchecked")
public <T extends SomeInterface>
Stream<UnaryOperator<T>> listing(Class<T> type){
return (Stream<UnaryOperator<T>>) lookup(type).stream();
}
private List<?> lookup(Class<?> type) {
if (!SomeInterface.class.isAssignableFrom(type))
return Collections.emptyList();
if (!registry.containsKey(type))
return registry.get(type.getSuperclass());
return registry.get(type);
}
}

Does default sorting in mapreduce uses Comparator defined in WritableComparable class or the comapreTo() method?

How does sort happens in mapreduce before the output is passed from mapper to reducer. If my output key from mapper is of type IntWritable, does it uses the comparator defined in IntWritable class or compareTo method in the class, if yes how the call is made. If not how the sort is performed, how the call is made?
Map job outputs are first collected and then sent to the Partitioner, responsible for determining to which Reducer the data will be sent (it's not yet grouped by reduce() call though). The default Partitioner uses the hashCode() method of the Key and a modulo with the number of Reducers to do that.
After that, the Comparator will be called to perform a sort on the Map outputs. Flow looks like that:
Collector --> Partitioner --> Spill --> Comparator --> Local Disk (HDFS) <-- MapOutputServlet
Each Reducer will then copy the data from the mapper that has been assigned to it by the partitioner, and pass it through a Grouper that will determine how records are grouped for a single Reducer function call:
MapOutputServlet --> Copy to Local Disk (HDFS) --> Group --> Reduce
Before a function call, the records will also go through a Sorting phase to determine in which order they arrive to the reducer. The Sorter (WritableComparator()) will call the compareTo() (WritableComparable() interface) method of the Key.
To give you a better idea, here is how you would implement a basic compareTo(), grouper and sorter for a custom composite key:
public class CompositeKey implements WritableComparable<CompositeKey> {
IntWritable primaryField = new IntWritable();
IntWritable secondaryField = new IntWritable();
public CompositeKey(IntWritable p, IntWritable s) {
this.primaryField.set(p);
this.secondaryField = s;
}
public void write(DataOutput out) throws IOException {
this.primaryField.write(out);
this.secondaryField.write(out);
}
public void readFields(DataInput in) throws IOException {
this.primaryField.readFields(in);
this.secondaryField.readFields(in);
}
// Called by the partitionner to group map outputs to same reducer instance
// If the hash source is simple (primary type or so), a simple call to their hashCode() method is good enough
public int hashCode() {
return this.primaryField.hashCode();
}
#Override
public int compareTo(CompositeKey other) {
if (this.getPrimaryField().equals(other.getPrimaryField())) {
return this.getSecondaryField().compareTo(other.getSecondaryField());
} else {
return this.getPrimaryField().compareTo(other.getPrimaryField());
}
}
}
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getPrimaryField().compareTo(second.getPrimaryField());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator() {
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}
After Mapper framework takes care about comparing for us for all the default datatypes like IntWritable, DoubleWritable e.t.c ... But if you have a user defined keytype you need to implement WritableComparable Interface.
WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
Note that hashCode() is frequently used in Hadoop to partition keys. It's important that your implementation of hashCode() returns the same result across different instances of the JVM. Note also that the default hashCode() implementation in Object does not satisfy this property.
Example:
public class MyWritableComparable implements WritableComparable {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable o) {
int thisValue = this.value;
int thatValue = o.value;
return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + counter;
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
return result
}
}
From :https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/WritableComparable.html

Hibernate CompositeUserType mapping has wrong number of columns

I am new to Hibernate. Writing a CompositeUserType. When I run the code I am getting error.
property
mapping has wrong number of columns:
Please help me what am I missing?
My CompositeUserType goes as follows
public class EncryptedAsStringType implements CompositeUserType {
#Override
public String[] getPropertyNames() {
return new String[] { "stockId", "stockCode", "stockName","stockDescription" };
}
#Override
public Type[] getPropertyTypes() {
//stockId, stockCode,stockName,modifiedDate
return new Type[] {
Hibernate.INTEGER, Hibernate.STRING, Hibernate.STRING,Hibernate.STRING
};
}
#Override
public Object getPropertyValue(final Object component, final int property)
throws HibernateException {
Object returnValue = null;
final Stock auditData = (Stock) component;
if (0 == property) {
returnValue = auditData.getStockId();
} else if (1 == property) {
returnValue = auditData.getStockCode();
} else if (2 == property) {
returnValue = auditData.getStockName();
} return returnValue;
}
#Override
public void setPropertyValue(final Object component, final int property,
final Object setValue) throws HibernateException {
final Stock auditData = (Stock) component;
}
#Override
public Object nullSafeGet(final ResultSet resultSet,
final String[] names,
final SessionImplementor paramSessionImplementor, final Object paramObject)
throws HibernateException, SQLException {
//owner here is of type TestUser or the actual owning Object
Stock auditData = null;
final Integer createdBy = resultSet.getInt(names[0]);
//Deferred check after first read
if (!resultSet.wasNull()) {
auditData = new Stock();
System.out.println(">>>>>>>>>>>>"+resultSet.getInt(names[1]));
System.out.println(">>>>>>>>>>>>"+resultSet.getString(names[2]));
System.out.println(">>>>>>>>>>>>"+resultSet.getString(names[3]));
System.out.println(">>>>>>>>>>>>"+resultSet.getString(names[4]));
}
return auditData;
}
#Override
public void nullSafeSet(final PreparedStatement preparedStatement,
final Object value, final int property,
final SessionImplementor sessionImplementor)
throws HibernateException, SQLException {
if (null == value) {
} else {
final Stock auditData = (Stock) value;
System.out.println("::::::::::::::::::::::::::::::::"+auditData.getStockCode());
System.out.println("::::::::::::::::::::::::::::::::"+auditData.getStockDescription());
System.out.println("::::::::::::::::::::::::::::::::"+auditData.getStockId());
System.out.println("::::::::::::::::::::::::::::::::"+auditData.getStatus());
}
}
My Domain class Stock has five attributes. (stockId,stockCode,StockName,Status , Stock
Description)
I need to declare the field Stock description as Composite field Type.
private Integer stockId;
private String stockCode;
private String stockName;
private String status;
private String stockDescription;
//Constructors
#Column(name = "STOCK_CC", unique = true, nullable = false, length = 20)
#Type(type="com.mycheck.EncryptedAsStringType")
#Columns(columns = { #Column(name="STOCK_ID"),
#Column(name="STOCK_CODE"),
#Column(name="STOCK_NAME")
})
public String getStockDescription() {
return stockDescription;
}
}
When I try to execute a insert for Stock. I am getting the error Error creating bean with name
'sessionFactory' defined in class path resource [spring/config/../database/Hibernate.xml]:
Invocation of init method failed. nested exception is org.hibernate.MappingException:
property mapping has wrong number of columns: com.stock.model.Stock.stockDescription type:
com.mycheck.EncryptedAsStringType
Where am I going wrong ?
One can extract the answer from the code samples and the comments to the original question, but to save everyone some reading, I've compiled a quick summary.
If you declare a CompositeUserType that maps a type to n columns, you have to declare n columns in #Columns besides the #Type annotation. Example:
public class EncryptedAsStringType implements CompositeUserType {
#Override
public String[] getPropertyNames() {
return new String[] { "stockId", "stockCode", "stockName","stockDescription" };
}
// ...
}
This CompositeUserType maps to 4 separate columns, therefore 4 separate #Column annotations have to be declared:
#Type(type="com.mycheck.EncryptedAsStringType")
#Columns(columns = {
#Column(name="STOCK_ID"),
#Column(name="STOCK_CODE"),
#Column(name="STOCK_NAME"),
#Column(name="STOCK_DESCRIPTION")
})
public String getStockDescription() {
return stockDescription;
}
That's it and Hibernate is happy.

Can Hadoop mapper produce multiple keys in output?

Can a single Mapper class produce multiple key-value pairs (of same type) in a single run?
We output the key-value pair in the mapper like this:
context.write(key, value);
Here's a trimmed down (and exemplified) version of the Key:
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class MyKey extends ObjectWritable implements WritableComparable<MyKey> {
public enum KeyType {
KeyType1,
KeyType2
}
private KeyType keyTupe;
private Long field1;
private Integer field2 = -1;
private String field3 = "";
public KeyType getKeyType() {
return keyTupe;
}
public void settKeyType(KeyType keyType) {
this.keyTupe = keyType;
}
public Long getField1() {
return field1;
}
public void setField1(Long field1) {
this.field1 = field1;
}
public Integer getField2() {
return field2;
}
public void setField2(Integer field2) {
this.field2 = field2;
}
public String getField3() {
return field3;
}
public void setField3(String field3) {
this.field3 = field3;
}
#Override
public void readFields(DataInput datainput) throws IOException {
keyTupe = KeyType.valueOf(datainput.readUTF());
field1 = datainput.readLong();
field2 = datainput.readInt();
field3 = datainput.readUTF();
}
#Override
public void write(DataOutput dataoutput) throws IOException {
dataoutput.writeUTF(keyTupe.toString());
dataoutput.writeLong(field1);
dataoutput.writeInt(field2);
dataoutput.writeUTF(field3);
}
#Override
public int compareTo(MyKey other) {
if (getKeyType().compareTo(other.getKeyType()) != 0) {
return getKeyType().compareTo(other.getKeyType());
} else if (getField1().compareTo(other.getField1()) != 0) {
return getField1().compareTo(other.getField1());
} else if (getField2().compareTo(other.getField2()) != 0) {
return getField2().compareTo(other.getField2());
} else if (getField3().compareTo(other.getField3()) != 0) {
return getField3().compareTo(other.getField3());
} else {
return 0;
}
}
public static class MyKeyComparator extends WritableComparator {
public MyKeyComparator() {
super(MyKey.class);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return compareBytes(b1, s1, l1, b2, s2, l2);
}
}
static { // register this comparator
WritableComparator.define(MyKey.class, new MyKeyComparator());
}
}
And this is how we tried to output both keys in the Mapper:
MyKey key1 = new MyKey();
key1.settKeyType(KeyType.KeyType1);
key1.setField1(1L);
key1.setField2(23);
MyKey key2 = new MyKey();
key2.settKeyType(KeyType.KeyType2);
key2.setField1(1L);
key2.setField3("abc");
context.write(key1, value1);
context.write(key2, value2);
Our job's output format class is: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
I'm stating this because in other output format classes I've seen the output not appending and just committing in their implementation of write method.
Also, we are using the following classes for Mapper and Context:
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Context
Writing to the context multiple times in one map task is perfectly fine.
However, you may have several problems with your key class. Whenever you implement WritableComparable for a key, you should also implement equals(Object) and hashCode() methods. These aren't part of the WritableComparable interface, since they are defined in Object, but you must provide implementations.
The default partitioner uses the hashCode() method to decide which reducer each key/value pair goes to. If you don't provide a sane implementation, you can get strange results.
As a rule of thumb, whenever you implement hashCode() or any sort of comparison method, you should provide an equals(Object) method as well. You will have to make sure it accepts an Object as the parameter, as this is how it is defined in the Object class (whose implementation you are probably overriding).

Resources