Can you please provide me sample code snippet to retrieve the cluster members of K mean?
The below code prints cluster centers. I needed cluster members belonging to each center.
val clusters = KMeans.train(parsedData, numClusters, numIterations)
new VoidFunction<Vector>() {
public void call(Vector vector) throws Exception {
Running 3 Kafka Streams instances with exactly-once, but experiencing loss of data when restarting one of the streams instances (the other 2 doing re-balance).
If I restart the instance quickly (within, without the other 2 doing re-balance, everything is working as expected.
Input and output topics are created with 6 partitions.
Running 3 Kafka brokers.
Producing data with a single python producer in a loop (acks='all').
Outputting data to SQL with a single Kafka Connect configured with consumer.override.isolation.level=read_committed
I am expecting the aggregated data to have the same count as the output of my python loop. And this works just fine as long as Kafka Streams is not going into re-balance state.
In short the streams instance does:
Collect session data, and updating a session state.
Delta updates on the session state are then re-partitioned and summed using windowed
Grepping through my own debug output I'm inclined to believe the problem is related to transferring the aggregation state:
Record A which is an update to session X is adding 0 to the aggregation.
Output from the aggregation is now 6
Record B which is an update to session X is adding 1 to the aggregation.
Output from the aggregation is now 7
Update to session X (which may or may not be a replay or Record A) is adding 0 to the aggregation.
Output from the aggregation is now 6
Simplified and stripped out version of the code: (Not really a Java developer, so sorry for non-optimal syntax)
public static void main(String[] args) throws Exception {
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
final StoreBuilder<KeyValueStore<MediaKey, SessionState>> storeBuilder = Stores.keyValueStoreBuilder(
KStream<String, IncomingData> incomingData =
SESSION_TOPIC, Consumed.with(Serdes.String(), mediaDataSerde));
KGroupedStream<MediaKey, AggregatedData> mediaData = incomingData
.transform(new SessionProcessingSupplier(SESSION_STATE_STORE), SESSION_STATE_STORE)
KTable<Windowed<MediaKey>, AggregatedData> aggregatedMedia = mediaData
new Initializer<AggregatedData>() {...},
new Aggregator<MediaKey, AggregatedData, AggregatedData>() {
public AggregatedData apply(MediaKey key, AggregatedData input, AggregatedData aggregated) {
// ... Add stuff to "aggregated"
return aggregated
Materialized.<MediaKey, AggregatedData, WindowStore<Bytes, byte[]>>as("aggregated-media")
.map(new KeyValueMapper<Windowed<MediaKey>, AggregatedData, KeyValue<MediaKey, PostgresOutput>>() {
public KeyValue<MediaKey, PostgresOutput> apply(Windowed<MediaKey> mediaidKey, AggregatedData data) {
// ... Some re-formatting and then
return new KeyValue<>(mediaidKey.key(), output);
.to(POSTGRES_TOPIC, Produced.with(mediaKeySerde, postgresSerde));
final Topology topology =;
final KafkaStreams streams = new KafkaStreams(topology, props);
// Shutdown hook
public class SessionProcessingSupplier implements TransformerSupplier<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> {
public Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> get() {
return new Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>>() {
public void init(ProcessorContext processorContext) {
this.context = processorContext;
this.stateStore = (KeyValueStore<String, Processing.SessionState>) context.getStateStore(sessionStateStoreName);
public KeyValue<String, Processing.AggregatedData> transform(String sessionid, Processing.IncomingData data) {
Processing.SessionState state = this.stateStore.get(sessionid);
// ... Update or create session state
return new KeyValue<String, Processing.AggregatedData>(sessionid, output);
My map looks like this
private Map<String, LinkedHashSet<String>> map = new HashMap<>();
in traditional approach I can add value to map with key check as below
public void addEdge(String node1, String node2) {
LinkedHashSet<String> adjacent = map.get(node1);
if (adjacent == null) {
adjacent = new LinkedHashSet();
map.put(node1, adjacent);
with java 8, I can do something like this, with this one also I'm getting same output.
map.compute(node1, (k,v)-> {
if(v==null) {
v=new LinkedHashSet<>();
return v;
is there any better way to do with java 8?
map.computeIfAbsent(node1, k -> new LinkedHashSet<>()).add(node2);
If node1 is already found in the map, it will be equivalent to:
If node1 is not already in the map, it will be equivalent to:
map.put(node1, new LinkedHashSet<>()).add(node2);
This is exactly what you're looking for, and is even described as a use case in the documentation.
you can also use
map.merge(node1,new LinkedHashSet<>(),(v1,v2)->v1!=null?v1:v2).add(node2);
and also
map.compute(node1,(k,v)->v!=null?v:new LinkedHashSet<>()).add(node2);
I'm in a situation where I have a dataset that consists of the classical UserID, ItemID and preference values, however they are all strings.
I have managed to read the UserID and ItemID strings by Overriding the readItemIDFromString() and readUserIDFromString() methods in the FileDataModel class (which is a part of the Mahout library) however, there doesnt seem to be any support for the conversion of preference values if I am not mistaken.
If anyone has some input to what an approach to this problem could be I would greatly appreciate it.
To illustrate what I mean, here is an example of my UserID string "Conversion":
protected long readUserIDFromString(String value) {
if (memIdMigtr == null) {
memIdMigtr = new ItemMemIDMigrator();
long retValue = memIdMigtr.toLongID(value);
if (null == memIdMigtr.toStringID(retValue)) {
try {
} catch (TasteException e) {
return retValue;
String getUserIDAsString(long userId) {
return memIdMigtr.toStringID(userId);
And the implementation of the AbstractIDMigrator:
public class ItemMemIDMigrator extends AbstractIDMigrator {
private FastByIDMap<String> longToString;
public ItemMemIDMigrator() {
this.longToString = new FastByIDMap<String>(10000);
public void storeMapping(long longID, String stringID) {
longToString.put(longID, stringID);
public void singleInit(String stringID) throws TasteException {
storeMapping(toLongID(stringID), stringID);
public String toStringID(long longID) {
return longToString.get(longID);
Mahout is deprecating the old recommenders based on Hadoop. We have a much more modern offering based on a new algorithm called Correlated Cross-Occurrence (CCO). Its is built using Spark for 10x greater speed and gives real-time query results when combined with a query server.
This method ingests strings for user-id and item-id and produces results with the same ids so you don't need to manage those anymore. You really should have look at the new system, not sure how long the old one will be supported.
Mahout docs here: and here:
The entire system described, with SDK, input storage, training of model and real-time queries is part of the Apache PredictionIO project and docs for the PIO and "Universal Recommender" and here: and here:
I am doing a secondary sort in Hadoop 2.6.0, I am following this tutorial:
I have the exact same code, but now I am trying to improve performance so I have decided to add a combiner. I have added two modifications:
Main file:
Combiner file:
public class CombinerK extends Reducer<KeyWritable, KeyWritable, KeyWritable, KeyWritable> {
public void reduce(KeyWritable key, Iterator<KeyWritable> values, Context context) throws IOException, InterruptedException {
Iterator<KeyWritable> it = values;
System.err.println("combiner " + key);
KeyWritable first_value =;
System.err.println("va: " + first_value);
while (it.hasNext()) {
sum +=;
context.write(key, first_value);
But it seems that it is not run because I can't find any logs file which have the word "combiner". When I saw counters after running, I could see:
Combine input records=4040000
Combine output records=4040000
The combiner seems like it is being executed but it seems as it has been receiving a call for each key and by this reason it has the same number in input as output.
I am not able to understand the use of the Tuple.getStringByField("ABC") in Apache Storm.
The following is the code:
Public Void execute(Tuple input){
if (input.getSourceStreamId.equals("signals"))
if ("refresh".equals(str))
Here what is input.getStringByField("action") is doing exactly..
Thank you.
In storm, both spout and bolt emit tuple. But the question is what are contained in each tuple. Each spout and bolt can use the below method to define the tuple schema.
public void declareOutputFields(
OutputFieldsDeclarer outputFieldsDeclarer)
// tell storm the schema of the output tuple
// tuple consists of columns called 'mycolumn1' and 'mycolumn2'
outputFieldsDeclarer.declare(new Fields("mycolumn1", "mycolumn2"));
The subsequent bolt then can use getStringByField("mycolumn1") to retrieve the value based on column name.
getStringByField() is like getString(), except it looks up the field by it's field name instead of position.