Spark K-Mean Retrieve Cluster Members - apache-spark-mllib

Can you please provide me sample code snippet to retrieve the cluster members of K mean?
The below code prints cluster centers. I needed cluster members belonging to each center.
val clusters = KMeans.train(parsedData, numClusters, numIterations)
clusters.clusterCenters.foreach(println)

parsedData.foreach(
new VoidFunction<Vector>() {
#Override
public void call(Vector vector) throws Exception {
System.out.println(clusters.predict(vector));
}
}
);

Related

Kafka Streams exactly-once re-balance aggregation state data loss

Running 3 Kafka Streams instances with exactly-once, but experiencing loss of data when restarting one of the streams instances (the other 2 doing re-balance).
If I restart the instance quickly (within session.timeout.ms), without the other 2 doing re-balance, everything is working as expected.
Input and output topics are created with 6 partitions.
Running 3 Kafka brokers.
Producing data with a single python producer in a loop (acks='all').
Outputting data to SQL with a single Kafka Connect configured with consumer.override.isolation.level=read_committed
I am expecting the aggregated data to have the same count as the output of my python loop. And this works just fine as long as Kafka Streams is not going into re-balance state.
In short the streams instance does:
Collect session data, and updating a session state.
Delta updates on the session state are then re-partitioned and summed using windowed
aggregation.
Grepping through my own debug output I'm inclined to believe the problem is related to transferring the aggregation state:
Record A which is an update to session X is adding 0 to the aggregation.
Output from the aggregation is now 6
Record B which is an update to session X is adding 1 to the aggregation.
Output from the aggregation is now 7
Rebalance
Update to session X (which may or may not be a replay or Record A) is adding 0 to the aggregation.
Output from the aggregation is now 6
Simplified and stripped out version of the code: (Not really a Java developer, so sorry for non-optimal syntax)
public static void main(String[] args) throws Exception {
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
final StoreBuilder<KeyValueStore<MediaKey, SessionState>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(SESSION_STATE_STORE),
mediaKeySerde,
sessionStateSerde
);
builder.addStateStore(storeBuilder);
KStream<String, IncomingData> incomingData = builder.stream(
SESSION_TOPIC, Consumed.with(Serdes.String(), mediaDataSerde));
KGroupedStream<MediaKey, AggregatedData> mediaData = incomingData
.transform(new SessionProcessingSupplier(SESSION_STATE_STORE), SESSION_STATE_STORE)
.selectKey(...)
.groupByKey(...);
KTable<Windowed<MediaKey>, AggregatedData> aggregatedMedia = mediaData
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
new Initializer<AggregatedData>() {...},
new Aggregator<MediaKey, AggregatedData, AggregatedData>() {
#Override
public AggregatedData apply(MediaKey key, AggregatedData input, AggregatedData aggregated) {
// ... Add stuff to "aggregated"
return aggregated
}
},
Materialized.<MediaKey, AggregatedData, WindowStore<Bytes, byte[]>>as("aggregated-media")
.withValueSerde(aggregatedDataSerde)
);
aggregatedMedia.toStream()
.map(new KeyValueMapper<Windowed<MediaKey>, AggregatedData, KeyValue<MediaKey, PostgresOutput>>() {
#Override
public KeyValue<MediaKey, PostgresOutput> apply(Windowed<MediaKey> mediaidKey, AggregatedData data) {
// ... Some re-formatting and then
return new KeyValue<>(mediaidKey.key(), output);
}
})
.to(POSTGRES_TOPIC, Produced.with(mediaKeySerde, postgresSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
// Shutdown hook
}
and:
public class SessionProcessingSupplier implements TransformerSupplier<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> {
#Override
public Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> get() {
return new Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>>() {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
this.stateStore = (KeyValueStore<String, Processing.SessionState>) context.getStateStore(sessionStateStoreName);
}
Override
public KeyValue<String, Processing.AggregatedData> transform(String sessionid, Processing.IncomingData data) {
Processing.SessionState state = this.stateStore.get(sessionid);
// ... Update or create session state
return new KeyValue<String, Processing.AggregatedData>(sessionid, output);
}
};
}
}

Java 8 map compute with set as value

My map looks like this
private Map<String, LinkedHashSet<String>> map = new HashMap<>();
in traditional approach I can add value to map with key check as below
public void addEdge(String node1, String node2) {
LinkedHashSet<String> adjacent = map.get(node1);
if (adjacent == null) {
adjacent = new LinkedHashSet();
map.put(node1, adjacent);
}
adjacent.add(node2);
}
with java 8, I can do something like this, with this one also I'm getting same output.
map.compute(node1, (k,v)-> {
if(v==null) {
v=new LinkedHashSet<>();
}
v.add(node2);
return v;
});
is there any better way to do with java 8?
Use
map.computeIfAbsent(node1, k -> new LinkedHashSet<>()).add(node2);
If node1 is already found in the map, it will be equivalent to:
map.get(node1).add(node2);
If node1 is not already in the map, it will be equivalent to:
map.put(node1, new LinkedHashSet<>()).add(node2);
This is exactly what you're looking for, and is even described as a use case in the documentation.
you can also use
map.merge(node1,new LinkedHashSet<>(),(v1,v2)->v1!=null?v1:v2).add(node2);
and also
map.compute(node1,(k,v)->v!=null?v:new LinkedHashSet<>()).add(node2);

Apache Mahout - Read preference value from String

I'm in a situation where I have a dataset that consists of the classical UserID, ItemID and preference values, however they are all strings.
I have managed to read the UserID and ItemID strings by Overriding the readItemIDFromString() and readUserIDFromString() methods in the FileDataModel class (which is a part of the Mahout library) however, there doesnt seem to be any support for the conversion of preference values if I am not mistaken.
If anyone has some input to what an approach to this problem could be I would greatly appreciate it.
To illustrate what I mean, here is an example of my UserID string "Conversion":
#Override
protected long readUserIDFromString(String value) {
if (memIdMigtr == null) {
memIdMigtr = new ItemMemIDMigrator();
}
long retValue = memIdMigtr.toLongID(value);
if (null == memIdMigtr.toStringID(retValue)) {
try {
memIdMigtr.singleInit(value);
} catch (TasteException e) {
e.printStackTrace();
}
}
return retValue;
}
String getUserIDAsString(long userId) {
return memIdMigtr.toStringID(userId);
}
And the implementation of the AbstractIDMigrator:
public class ItemMemIDMigrator extends AbstractIDMigrator {
private FastByIDMap<String> longToString;
public ItemMemIDMigrator() {
this.longToString = new FastByIDMap<String>(10000);
}
public void storeMapping(long longID, String stringID) {
longToString.put(longID, stringID);
}
public void singleInit(String stringID) throws TasteException {
storeMapping(toLongID(stringID), stringID);
}
public String toStringID(long longID) {
return longToString.get(longID);
}
}
Mahout is deprecating the old recommenders based on Hadoop. We have a much more modern offering based on a new algorithm called Correlated Cross-Occurrence (CCO). Its is built using Spark for 10x greater speed and gives real-time query results when combined with a query server.
This method ingests strings for user-id and item-id and produces results with the same ids so you don't need to manage those anymore. You really should have look at the new system, not sure how long the old one will be supported.
Mahout docs here: http://mahout.apache.org/users/algorithms/recommender-overview.html and here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
The entire system described, with SDK, input storage, training of model and real-time queries is part of the Apache PredictionIO project and docs for the PIO and "Universal Recommender" and here: http://predictionio.incubator.apache.org/ and here: http://actionml.com/docs/ur

Hadoop KeyComposite and Combiner

I am doing a secondary sort in Hadoop 2.6.0, I am following this tutorial:
https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
I have the exact same code, but now I am trying to improve performance so I have decided to add a combiner. I have added two modifications:
Main file:
job.setCombinerClass(CombinerK.class);
Combiner file:
public class CombinerK extends Reducer<KeyWritable, KeyWritable, KeyWritable, KeyWritable> {
public void reduce(KeyWritable key, Iterator<KeyWritable> values, Context context) throws IOException, InterruptedException {
Iterator<KeyWritable> it = values;
System.err.println("combiner " + key);
KeyWritable first_value = it.next();
System.err.println("va: " + first_value);
while (it.hasNext()) {
sum += it.next().getSs();
}
first_value.setS(sum);
context.write(key, first_value);
}
}
But it seems that it is not run because I can't find any logs file which have the word "combiner". When I saw counters after running, I could see:
Combine input records=4040000
Combine output records=4040000
The combiner seems like it is being executed but it seems as it has been receiving a call for each key and by this reason it has the same number in input as output.

what is use of Tuple.getStringByField("ABC") in Storm

I am not able to understand the use of the Tuple.getStringByField("ABC") in Apache Storm.
The following is the code:
Public Void execute(Tuple input){
try{
if (input.getSourceStreamId.equals("signals"))
{
str=input.getStringByField("action")
if ("refresh".equals(str))
{....}
}
}...
Here what is input.getStringByField("action") is doing exactly..
Thank you.
In storm, both spout and bolt emit tuple. But the question is what are contained in each tuple. Each spout and bolt can use the below method to define the tuple schema.
#Override
public void declareOutputFields(
OutputFieldsDeclarer outputFieldsDeclarer)
{
// tell storm the schema of the output tuple
// tuple consists of columns called 'mycolumn1' and 'mycolumn2'
outputFieldsDeclarer.declare(new Fields("mycolumn1", "mycolumn2"));
}
The subsequent bolt then can use getStringByField("mycolumn1") to retrieve the value based on column name.
getStringByField() is like getString(), except it looks up the field by it's field name instead of position.

Resources