This question already has answers here:
Are Java streams stages sequential?
(4 answers)
Closed 4 years ago.
There is a simple query that filters a list and gets a field value of the item found.
myList.getParents().stream()
.filter(x -> x.getSomeField() == 1)
.map(x -> x.getOtherField())
.findFirst();
Are operations executed one after one as in code: from initial list we filter all where someField is 1, after it we create new list with a value of another field and after it we take the first one in this new list?
Let's imagine that there are 1 000 000 items in this list and after filtering they are 1000. Will it map those 1000 items to get only first one of them?
If I change the order will it optimize the performance or is it smart enough itself?
myList.getParents().stream()
.filter(x -> x.getSomeField() == 1)
.findFirst()
.map(x -> x.getOtherField());
No, in Java8 stream processing pipeline one data item is processed in one single pass. That way we can perform short circuit evaluations and gives us more room to optimize.
For instance in your case we take the first item, applies the filter, assume it satisfies the filter criteria. Then we go ahead and do the mapping and push that element. We don't need to access any other element in the stream source, since we process it in one pass. This short circuit evaluation allows us to do more optimizations.
However the second representation of processing pipeline is wrong. You can't have map at the end. The terminal operation findFirst should be at the end of the pipeline though.
Related
I have a question regarding doing pattern matching in Redis keys. Currently, I am storing a set of subscriptions, where keys are composite of different events. For example, if a subscription comes in as
S1 - {event:created, userId:1234, stateId:xyz}
It's stored in cache for matching as (events are sorted before creating the key)
event:created#stateId:xyz#userId:1234 = {S1}
Now there can be other events that can subscribe to this exact combination. But if an event comes if any of the three attributes, it will be matched with all the keys to which they are a substring. Example
event:created#stateId:xyz#userId:1234 = {S1,S2,S3}
event:started#stateId:xyz#userId:1234 = {S4,S5,S6}
event:created#stateId:abc#userId:1234 = {S7,S8,S9}
The following will be the event and subscription chart.
event:created -> S1,S2,S3,S7,S8,S9
event:started -> S4,S5,S6
state:xyz -> S1,S2,S3,S4,S5,S6
userId:1234 -> S1,S2,S3,S4,S5,S6,S7,S8,S9
stateId:abc and userId:1234 -> S1,S2,S3,S4,S5,S6,S7,S8,S9
I tried using a SCAN on Redis with a pattern match, but it takes a long time as my cache can have a lot of entries, and SCAN takes O(N) time.
Any idea how I can do this efficiently? Maybe by using a secondary structure in Redis like a Tree or something? Or any other Redis data structure I should look at?
Thanks
Summary
I wish to be able to measure time elapsed in milliseconds, on the GPU, of running the entire graphics pipeline. The goal: To be able to save benchmarks before/after optimizing the code (next step would be mipmapping textures) to see improvements. This was really simple in OpenGL, but I'm new to Vulkan, and could use some help.
I have browsed related existing answers (here and here), but they aren't really of much help. And I cannot find code samples anywhere, so I dare ask here.
Through documentation pages I have found a couple of functions that I think I should be using, so I have in place something like this:
1: Creating query pool
void CreateQueryPool()
{
VkQueryPoolCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
createInfo.pNext = nullptr; // Optional
createInfo.flags = 0; // Reserved for future use, must be 0!
createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
createInfo.queryCount = mCommandBuffers.size() * 2; // REVIEW
VkResult result = vkCreateQueryPool(mDevice, &createInfo, nullptr, &mTimeQueryPool);
if (result != VK_SUCCESS)
{
throw std::runtime_error("Failed to create time query pool!");
}
}
I had the idea of queryCount = mCommandBuffers.size() * 2 to have space for a separate query timestamp before and after rendering, but I have no clue whether this assumption is correct or not.
2: Recording command buffers
// recording command buffer i:
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i);
// render pass ...
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i);
vkCmdCopyQueryPoolResults(/* many parameters here */);
I'm looking for a couple of clarifications:
What is the concequence of writing to the same query index? Do I need two separate query pools - one for before render time and one for after render time?
How should I handle synchronization? I assume having a separate query for each command buffer.
For the destination buffer containing the query result, is it good enough to store somewhere with "host visible bit", or do I need staging memory for "device visible only"? I'm a bit lost on this one as well.
I have not been able to find any online examples of how to measure render time, but I just assume it's such a common task that surely there must be an example out there somewhere.
So, thanks to #karlschultz, I managed to get something working. So in case other people will be looking for the same answer, I decided to post my findings here. For the Vulkan experts out there: Please let me know if I make obvious mistakes, and I will correct them here!
Query Pool Creation
I fill out a VkQueryPoolCreateInfo struct as described in my question, and let its queryCount field equal twice the number of command buffers, to store space for a query before and after rendering.
Important here is to reset all entries in the query pool before using the queries, and to reset a query after writing to it. This necessitates a few changes:
1) Asking graphics queue if timestamps are supported
When picking the graphics queue family, the struct VkQueueFamilyProperties has a field timestampValidBits which must be greater than 0, otherwise the queue family cannot be used for timestamp queries!
2) Determining the timestamp period
The physical device contains a special value which indicates the number of nanoseconds it takes for a timestamp query to be incremented by 1. This is necessary to interpret the query result as e.g. nanoseconds or milliseconds. That value is a float, and can be retrieved by calling vkGetPhysicalDeviceProperties and looking at the field VkPhysicalDeviceProperties.limits.timestampPeriod.
3) Asking for query reset support
During logical device creation, one must fill out a struct and add it to the pNext chain to enable the host query reset feature:
VkDeviceCreateInfo createInfo{};
VkPhysicalDeviceHostQueryResetFeatures resetFeatures;
resetFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_HOST_QUERY_RESET_FEATURES;
resetFeatures.pNext = nullptr;
resetFeatures.hostQueryReset = VK_TRUE;
createInfo.pNext = &resetFeatures;
4) Recording command buffers
Timestamp queries should be outside the scope of the render pass, as seen below. It is not possible to measure running time of a single shader (e.g. fragment shader), only the entire pipeline or whatever is outside the scope of the render pass, due to (potential) temporal overlap of pipeline stages.
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i * 2);
vkCmdBeginRenderPass(/* ... */);
// render here...
vkCmdEndRenderPass(mCommandBuffers[i]);
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i * 2 + 1);
5) Retrieving query result
We have two methods for this: vkCmdCopyQueryPoolResults and vkGetQueryPoolResults. I chose to go with the latter since is greatly simplifies the setup and does not require synchronization with GPU buffers.
Given that I have a swapchain index (in my scenario same is command buffer index!), I have a setup like this:
void FetchRenderTimeResults(uint32_t swapchainIndex)
{
uint64_t buffer[2];
VkResult result = vkGetQueryPoolResults(mDevice, mTimeQueryPool, swapchainIndex * 2, 2, sizeof(uint64_t) * 2, buffer, sizeof(uint64_t),
VK_QUERY_RESULT_64_BIT);
if (result == VK_NOT_READY)
{
return;
}
else if (result == VK_SUCCESS)
{
mTimeQueryResults[swapchainIndex] = buffer[1] - buffer[0];
}
else
{
throw std::runtime_error("Failed to receive query results!");
}
// Queries must be reset after each individual use.
vkResetQueryPool(mDevice, mTimeQueryPool, swapchainIndex * 2, 2);
}
The variable mTimeQueryResults refers to an std::vector<uint64_t> which contains a result for each swapchain. I use it to calculate an average rendering time each second by using the timestamp period determined in step 2).
And one must not forget to cleanup to query pool by calling vkDestroyQueryPool.
There are a lot of details omitted, and for a total Vulkan noob like me this setup was frightening and took several days to figure out. Hopefully this will spare someone else the headache.
More info in documentation.
Writing to the same query index is bad because you are overwriting your "before" timestamp with the "after" timestamp at the same query index. You might want to change the last parameter in your write timestamp calls to i * 2 for the "before" call and to i * 2 + 1 for the "after". You are already allocating 2 timestamps for each command buffer, but only using half of them. This scheme ends up producing a pair of before/after timestamps for each command buffer i.
I don't have any experience using vkCmdCopyQueryPoolResults(). If you can idle your queue, then after idle, call vkGetQueryPoolResults() which will probably be much easier for what you are doing here. It copies the query results back into host memory and you don't have to mess with synchronizing writes to another buffer and then mapping/reading it back.
I tried a little experiment, and I'm wondering how to explain what I'm seeing. The purpose
of the experiment was to try to understand how Kafka Streams is doing multithreading. I
created and populated an input Topic with three partitions. Then I created a Streams graph
that included the following, and configured it to run with three threads.
kstream = kstream.mapValues(tsdb_object -> {
System.out.println( "mapValues: Thread " + Thread.currentThread().getId());
return tsdb_object;
});
// Add operator to print results to stdout:
Printed<Long, TsdbObject> printed = Printed.toSysOut();
kstream.print(printed);
KGroupedStream<Long, TsdbObject> kstream_grouped_by_key = kstream.groupByKey(Serialized.with(Serdes.Long(), TsdbObject.getSerde()));
KTable<Long, TsdbObject> summation =
kstream_grouped_by_key.reduce((tsdb_object1, tsdb_object2) -> {
System.out.println("reducer: Thread " + Thread.currentThread().getId());
return tsdb_object1;
});
I figured that the first print statement would print out messages with three different
thread id's, and that's what happened. However, I figured that the second print
statement, being issued in the middle of an aggregation (reducer) operation, would
print out messages listing only one thread id, under the assumption that the reduction
would NOT be multithreaded. This turned out not to be true: the second print
produced messages listing three different thread id's.
Can someone please explain briefly how the aggregation (reducer) is running in
three different threads? Are they running in parallel?
Yes, the aggregation is execute with 3 threads as well, and each thread does the aggregation for about 1/3 of all keys.
Why would you assume that the aggregation is not multithreaded? Note, that it's an aggregation per key, thus the result for each key is independent of the result of all others keys. This allows to parallelize the computation.
How do I implement a sliding window aggregation (or transformation) with a fixed-size count-based window?
For e.g: If I have stream data like the following
input stream = 1,2,3,4,5,6,7,8...
Assume that time is not relevant here. And say my aggregate function is AVERAGE and window size is fixed at 3 records (not 3 millis, 3 secs, 3 hours etc), I would like my output stream to be
output stream = avg(1,2,3), avg(2,3,4), avg(3,4,5), avg(4,5,6), avg(5,6,7)... = 2,3,4,5,6...
The Windows documented in Kafka streams work are "time-based". Even the constructor of base class Window has following signature:
Window(long startMs, long endMs)
So I was not sure if it's the right tool to do non-time based windowing aggregating.
Apache Flink supports count-based sliding and tumbling windows. That's exactly what I need, but I'm looking for a similar feature in Kafka Streams.
If time-ordering is no concern for you, you can implement a custom Transformer with attached state.
StreamsBuilder builder = new StreamsBuilder();
builder.addStoreStore(...); // add KeyValueStore here
KStream result = builder.stream("topic").transform(...); // pass in name of your KeyValueStore, too
For you custom Transformer you can maintain a List per key with the list being your window -- as long as the list is smaller than you window-size you append new record to the list -- if it's exactly the size, you trigger the computation -- if it exceeds the size, you trim it and trigger the computation afterwards.
See the docs for more details: https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html (Note, that a Processor and a Transformer are basically the same thing.)
If you wish to use Apache Storrm which is also an streaming engine, kafka can be connected as a data source to it. Storm new version provides a concept called Tumbling Window, which delivers exact number of tuple to your topology. This can easily be used to solve your problem.
For more have a look at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_storm-component-guide/content/storm-windowing-concepts.html
I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
}
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.