Custom Partitioner : N number of keys to N different files - hadoop

My requirement is to write a custom partitioner. I have these N number of keys coming from mapper for example('jsa','msa','jbac'). Length is not fixed. It can be anyword infact. My requirement is to write a custom partitioner in such a way that It will collect all the same key data in to same file. Number of keys is not fixed. Thank you in Advance.
Thanks,
Sathish.

So you have multiple keys which mapper is outputting and you want different reducers for each key and have a separate file for each key.
So first thing writing Partitioner can be a way to achieve that. By default hadoop has its own internal logic that it performs on keys and depending on that it calls reducers. So if you want to write a custom partitioner than you have to overwrite that default behaviour by your own logic/algorithm. Unless you know how exactly your keys will vary this logic wont be generic and based on variations you have to figure out the logic.
I am providing you a sample example here you can refer that but its not generic.
public class CustomPartitioner extends Partitioner<Text, Text>
{
#Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
if(key.toString().contains("Key1"))
{
return 1;
}else if(key.toString().contains("Key2"))
{
return 2;
}else if(key.toString().contains("Key3"))
{
return 3;
}else if(key.toString().contains("Key4"))
{
return 4;
}else if(key.toString().contains("Key5"))
{
return 5;
}else
{
return 7;
}
}
}
This should solve your problem. Just replace key1,key2 ..etc by your key name...
In case you don't know the key names you can write your own logic by referring following:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return (key.toString().charAt(0)) % numReduceTasks;
}
}
In above partitioner just to illustrate that how you can write your own logic I have shown that if you take out length of the keys and do % operation with number of reducers than you will get one unique number which will be between 0 to Number of Reducers so by default different reducers get called and gives output in different files. But in this approach you have to make sure that for two keys same value should not be written
This was about Customized partitioner.
Another solution can be you can override the MultipleOutputFormat class methods that will enable to do the job in a generic way. Also using this approach you will be able to generate customized file name for reducer output files in hdfs.
NOTE: Make sure you use same libraries. Don't mix mapred against mapreduce libraries. org.apache.hadoop.mapred are older libraries and org.apache.hadoop.mapreduce are new ones.
Hope this will help.

I imagine the best way to do this since it will give a more even split would be:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return key.hashCode() % numReduceTasks;
}
}

Related

A algorithm to track the status of a number

To design a API,
get(), it will return the random number, also the number should not duplicate, means it always be unique.
put(randomvalue), it will put back the generated random number from get(), if put back, get() function can reuse this number as output.
It has to be efficient, no too much resource is highly used.
Is there any way to implement this algorithm? It is not recommended to use hashmap, because if this API generate for billions of requests, saving the generated the random number still use too much space.
I could no work out this algorithm, please help give a clue, thanks in advance!
I cannot think of any solution without extra space. With space, one option could be to use TreeMap, firstly add all the elements in treeMap with as false. When element is accessed, mark as true. Similarly for put, change the value to false.
Code snippet below...
public class RandomNumber {
public static final int SIZE = 100000;
public static Random rand;
public static TreeMap<Integer, Boolean> treeMap;
public RandomNumber() {
rand = new Random();
treeMap = new TreeMap<>();
}
static public int getRandom() {
while (true) {
int random = rand.nextInt(SIZE);
if (!treeMap.get(random)) {
treeMap.put(random, true);
return random;
}
}
}
static public void putRandom(int number) {
treeMap.put(number, false);
}
}

How to handle duplicate messages using Kafka streaming DSL functions

My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.
Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.
It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}

Building a multi-record flat file

I'm struggling to find a proper solution for generating a flat file.
Here are some criteria I need to take care of:
The file has a header with summary of its following records
there could be multiple Collection Header Records with multiple Batch Header Records which contain multiple records of different types.
All records within a Batch have a checksum which has to be added to a batch checksum. This one has to be added to the collection Header checksum and that again to the file checksum. Also each entry in the file has a counter value.
So my plan was to create a class for each record. but what now? I have the records and the "summary records", the next step would be to bring them all in order, count the sums and then set the counters.
How should I proceed from here, should I put everything in a big SortedList? If so, how do I know where to add the latest record (It has to be added to its representing batch summary)?
My first idea was to do something like this:
SortedList<HeaderSummary, SortedList<BatchSummary, SortedList<string, object>>>();
But it is hard to navigate through the HeaderSummaries and BatchSummaries to add a object in the inner Sorted list, bearing in mind that I may have to create and add a HeaderSummary / BachtSummary.
Having several different ArrayLists like one for Header, one for Batch and one for the rest gives me problems when combining them to a flat file because of the order and the - yet to set - counters, while keeping the order etc.
Do you have any clever solution for such a flat file?
Consider using classes to represent levels of your tree structure.
interface iBatch {
public int checksum { get; set; }
}
class BatchSummary {
int batchChecksum;
List<iBatch> records;
public void WriteBatch() {
WriteBatchHeader();
foreach (var record in records)
batch.WriteRecord();
}
public void Add(iBatch rec) {
records.Add(rec); // or however you find the appropriate batch
}
}
class CollectionSummary {
int collectionChecksum;
List<BatchSummary> batches;
public void WriteCollection() {
WriteCollectionHeader();
foreach (var batch in batches)
batch.WriteBatch();
}
public void Add(int WhichBatch, iBatch rec) {
batches[whichBatch].Add(rec); // or however you find the appropriate batch
}
}
class FileSummary {
// ... file summary info
int fileChecksum;
List<CollectionSummary> collections;
public void WriteFile() {
WriteFileHeader();
foreach (var collection in collections)
collection.WriteCollection();
}
public void Add(int whichCollection, int WhichBatch, iBatch rec) {
collections[whichCollection].Add(whichBatch, rec); // or however you find the appropriate collection
}
}
Of course, you could use a common Summary class to be more DRY, if not necessarily more clear.

Use hashCode() for sorting Objects in java, not in HashTable and ect

I need your help.
If i want to sort a PriorityQeueu in java, with out connection to it's attributes - could i use the hashCode's Objects to compare?
This how i did it:
comp = new Comparator<Person>() {
#Override
public int compare(Person p1, Person p2) {
if(p1.hashCode() < p2.hashCode()) return 1;
if(p1.hashCode() == p2.hashCode()) return 0;
return -1;
}
};
collector = new PriorityQueue<Person>(comp);
It doesn't sound like a good approach.
Default hashCode() is typically implemented by converting the internal address of the object into an integer. So the order of objects will differ between application executions.
Also, 2 objects with the same set of attribute values will not return the same hashCode value unless you override the implementation. This actually breaks the expected contract of Comparable.

Hadoop MultipleOutputFormat.generateFileNameForKeyValue with many keys

I am trying to play with MultipleOutputFormat.generateFileNameForKeyValue() .
The idea is to create directory for each of my keys.
This is the code:
static class MyMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
arr = key.toString().split("_");
return arr[0]+"/"+name;
}
}
This code works only if the emitted records are few. If i run the code against my real input, it just hangs on reducer around 70%.
What might be the problem here - working on small number of keys, not working on many .

Resources