MeterRegistry creating tags dynamically for gauge and updating data based on tag id for Prometheus - spring-boot

So I have a monitoring service that is essentially trying to monitor timestamps in my Kafka clusters. Each topic has some number of partitions and in my tags, I want to display the partition numbers as well as the topic name, and cluster. Is there a way for me to create the tags dynamically and also check to see if the tags already exist and if it does then we just update the gauge values? If the tags do not exist we will create a new gauge with the appropriate tags?
//Some pseudo-code in java
Map<Int, Int> partitionMap = new HashMap<>() // Key is partition and value is some arbitrary data
for(every Kafka cluster : kc){
for(every key value pair in partitionMap){
AtomicLong myGauge = new AtomicLong(-1);
Tags someTags = Tags.of("topic", kc.topicName, "cluster", kc.clusterName, "partition", key);
if(meterRegistry.get("name.of.query") != null && meterRegistry.get("name.of.query").contains(tags){
myGauge.get(myTags).set(partitionMap.get(key)); // Update the data point at the partition based on value
}else{
myGauge = meterRegistry.gauge("name.of.query", someTags, new AtomicLong(partitionMap.get(key));
}
}
}

Related

kafka streams DSL: add an option parameter to disable repartition when using `map` `selectByKey` `groupBy`

According to the documents, streams will be marked for repartition when applied map selectKey groupBy even though the new key has been partitioned appropriately. Is it possible to add an option parameter to disable repartition ?
Here is my user case:
there is a topic has been partitioned by user_id.
# topic 'user', format '%key,%value'
partition-1:
user1,{'user_id':'user1', 'device_id':'device1'}
user1,{'user_id':'user1', 'device_id':'device1'}
user1,{'user_id':'user1', 'device_id':'device2'}
partition-2:
user2,{'user_id':'user2', 'device_id':'device3'}
user2,{'user_id':'user2', 'device_id':'device4'}
I want to count user_id-device_id pairs using DSL as follow:
stream
.groupBy((user_id, value) -> {
JSONObject event = new JSONObject(value);
String userId = event.getString('user_id');
String deviceId = event.getString('device_id');
return String.format("%s&%s", userId,deviceId);
})
.count();
Actually the new key has been partitioned indirectly. There is no need to do it again.
If you use .groupBy(), it always causes data re-partitioning. If possible use groupByKey instead, which will re-partition data only if required.
In your case, you are changing the keys anyways, so that will create a re-partition topic.

Enrich each existing value in a cache with the data from another cache in an Ignite cluster

What is the best way to update a field of each existing value in a Ignite cache with data from another cache in the same cluster in the most performant way (tens of millions of records about a kilobyte each)?
Pseudo code:
try (mappings = getCache("mappings")) {
try (entities = getCache("entities")) {
entities.foreach((key, entity) -> entity.setInternalId(mappings.getValue(entity.getExternalId());
}
}
I would advise to use compute and send a closure to all the nodes in the cache topology. Then, on each node you would iterate through a local primary set and do the updates. Even with this approach you would still be better off batching up updates and issuing them with a putAll call (or maybe use IgniteDataStreamer).
NOTE: for the example below, it is important that keys in "mappings" and "entities" caches are either identical or colocated. More information on collocation is here:
https://apacheignite.readme.io/docs/affinity-collocation
The pseudo code would look something like this:
ClusterGroup cacheNodes = ignite.cluster().forCache("mappings");
IgniteCompute compute = ignite.compute(cacheNodes.nodes());
compute.broadcast(() -> {
IgniteCache<> mappings = getCache("mappings");
IgniteCache<> entities = getCache("entities");
// Iterate over local primary entries.
entities.localEntries(CachePeekMode.PRIMARY).forEach((entry) -> {
V1 mappingVal = mappings.get(entry.getKey());
V2 entityVal = entry.getValue();
V2 newEntityVal = // do enrichment;
// It would be better to create a batch, and then call putAll(...)
// Using simple put call for simplicity.
entities.put(entry.getKey(), newEntityVal);
}
});

StateMap keys across different instances of the same processor

Nifi 1.2.0.
In a custom processor, an LSN is used to fetch data from a SQL Server db table.
Following are the snippets of the code used for:
Storing a key-value pair
final StateManager stateManager = context.getStateManager();
try {
StateMap stateMap = stateManager.getState(Scope.CLUSTER);
final Map<String, String> newStateMapProperties = new HashMap<>();
String lsnUsedDuringLastLoadStr = Base64.getEncoder().encodeToString(lsnUsedDuringLastLoad);
//Just a constant String used as key
newStateMapProperties.put(ProcessorConstants.LAST_MAX_LSN, lsnUsedDuringLastLoadStr);
if (stateMap.getVersion() == -1) {
stateManager.setState(newStateMapProperties, Scope.CLUSTER);
} else {
stateManager.replace(stateMap, newStateMapProperties, Scope.CLUSTER);
}
}
Retrieving the key-value pair
final StateManager stateManager = context.getStateManager();
final StateMap stateMap;
final Map<String, String> stateMapProperties;
byte[] lastMaxLSN = null;
try {
stateMap = stateManager.getState(Scope.CLUSTER);
stateMapProperties = new HashMap<>(stateMap.toMap());
lastMaxLSN = (stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN) == null
|| stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN).isEmpty()) ? null
: Base64.getDecoder()
.decode(stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN).getBytes());
}
When a single instance of this processor is running, the LSN is stored and retrieved properly and the logic of fetching data from SQL Server tables works fine.
As per the NiFi doc. about state management :
Storing and Retrieving State State is stored using the StateManager’s
getState, setState, replace, and clear methods. All of these methods
require that a Scope be provided. It should be noted that the state
that is stored with the Local scope is entirely different than state
stored with a Cluster scope. If a Processor stores a value with the
key of My Key using the Scope.CLUSTER scope, and then attempts to
retrieve the value using the Scope.LOCAL scope, the value retrieved
will be null (unless a value was also stored with the same key using
the Scope.CLUSTER scope). Each Processor’s state, is stored in
isolation from other Processors' state.
When two instances of this processor are running, only one is able to fetch the data. This has led to the following question:
Is the StateMap a 'global map' which must have unique keys across the instances of the same processor and also the instances of different processors? In simple words, whenever a processor puts a key in the statemap, the key should be unique across the NiFi processors(and other services, if any, that use the State API) ? If yes, can anyone suggest what unique key should I use in my case?
Note: I quickly glanced at the standard MySQL CDC processor code class(CaptureChangeMySQL.java) and it has a similar logic to store and retrieve the state but then am I overlooking something ?
The StateMap for a processor is stored underneath the id of the component, so if you have two instances of the same type of processor (meaning you can see two processors on the canvas) you would have something like:
/components/1111-1111-1111-1111 -> serialized state map
/components/2222-2222-2222-2222 -> serialized state map
Assuming 1111-1111-1111-1111 was the UUID of processor 1 and 2222-2222-22222-2222 was the UUID of processor 2. So the keys in the StateMap don't have to be unique across all instances because they are scoped per component id.
In a cluster, the component id of each component is the same on all nodes. So if you have a 3 node cluster and processor 1 has id 1111-1111-1111-1111, then there is a processor with that id on each node.
If that processor is scheduled to run on all nodes and stores cluster state, then all three instances of the processor are going to be updating the same StateMap in the clustered state provider (ZooKeeper).

Kafka Streams API: I am joining two KStreams of empmodel

final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application

How to get keySet() and size() for entire GridGain cluster?

GridCache.keySet(), .primarySize(), and .size() only return information for that node.
How do I get these information but for the whole cluster?
Scanning the entire cluster "works", but all I need is the keys or the count, not the values.
The problem is SQL query works if I want to find based on an indexed field, but I can't find based on the grid cache entry key itself.
My workaround that works but far from elegant and performant is:
Set<String> ruleIds = FluentIterable.from(cache.queries().createSqlFieldsQuery("SELECT property FROM YagoRule").execute().get())
.<String>transform((it) -> (String) it.iterator().next()).toSet();
This requires the key is the same as one of the field, and the field need to be indexed for performance reasons.
Next release of GridGain (6.2.0) will have globalSize() and globalPrimarySize() methods which will ask the cluster for the sizes.
For now you can use the following code:
// Only grab nodes on which cache "mycache" is started.
GridCompute compute = grid.forCache("mycache").compute();
Collection<Integer> res = compute.broadcast(
// This code will execute on every caching node.
new GridCallable<Integer>() {
#Override public Integer call() {
return grid.cache("mycache").size();
}
}
).get();
int sum = 0;
for (Integer i : res)
sum += i;

Resources