Use Case: our data structure is like below:
tp1 "i1" : {object hash}, "i2" : {object hash}
tp2 "i3" : {object hash}, "i4" : {object hash}
tp1 and tp2 are hmset keys. we are referring as tp keys.
Each tp key can have 100-200 records in it. And each hash has a size of 1-1.5 KB.
Below is our implementation with spring data:
public Map<String, Map<String, T>> getAllMulti(List<String> keys) {
long start = System.currentTimeMillis();
log.info("Redis pipeline fetch started with keys size :{}", keys.size());
Map<String, Map<String, T>> responseMap = new HashMap<>();
if (CollectionUtils.isNotEmpty(keys)) {
List<Object> resultSet = redisTemplate.executePipelined((RedisCallback<T>) connection -> {
for (String key : keys) {
connection.hGetAll(key.getBytes());
}
return null;
});
responseMap = IntStream.range(0, keys.size())
.boxed()
.collect(Collectors.toMap(keys::get, i -> (Map<String, T>) resultSet.get(i)));
}
long timeTaken = System.currentTimeMillis() - start;
log.info("Time taken in redis pipeline fetch: {}", timeTaken);
return responseMap;
}
Objective: Our objective is to load hashes of around 500-600 tp keys. We thought of using redis pipeline for this purpose. But as we are increasing the number of tp keys, the response time is increasing significantly. And it is not consistent also.
For response time improvement we have tried compression/messagePack, still no benefit.
One more solution we have tried, where we have partitioned our tpkeys into multiple partition and run the above implementation in parallel. Observation is if the number of tpkeys is small then the batch takes less time. if tpkeys size is increasing,time taken for the batch with same number of keys is increasing.
Any help/lead will be appreciated. Thanks
Related
Using Spring
private JdbcTemplate jdbcTemplate;
RowMapperResultSetExtractor<DataPoint<Arpu>> resultSetExtractor = new RowMapperResultSetExtractor(rowMapper);
In ResultSetExtractor in extractData(ResultSet rs) Method
#Override
public List<T> extractData(ResultSet rs) throws SQLException {
System.out.println("Inside result set extractor." + LocalDateTime.now());
List<T> results = (this.rowsExpected > 0 ? new ArrayList<>(this.rowsExpected) : new ArrayList<>());
long start = System.currentTimeMillis();
int rowNum = 0;
while (rs.next()) {
results.add(this.rowMapper.mapRow(rs, rowNum++));
}
System.out.println("ResultSet Extractor time => "+ (System.currentTimeMillis() - start) / 1000 + "s");
System.out.println("Total size of the results" + results.size());
return results;
}
I am seeing the ResultSet is the instance of SnowflakeResultSetV1 and fetching the data in chunks, which is slowing extractData() method inside the ResultSetExtractor.
How to manipulate the chunks in snowflake-jdbc or fetch the whole data-set. To improve the timing.
Thanks
There are 2 parameters that can be used for fetching result sets:
CLIENT_PREFETCH_THREADS - Parameter that specifies the number of threads used by the client to pre-fetch large result sets. The driver will attempt to honor the parameter value, but defines the minimum and maximum values (depending on your system’s resources) to improve performance.
CLIENT_RESULT_CHUNK_SIZE - Parameter that specifies the maximum size of each set (or chunk) of query results to download (in MB). The JDBC driver downloads query results in chunks.
Be aware that playing with these parameters can impact the memory usage.
I'm looking for a solution on how to assign a random UUID to a key only on its first occurrence in a stream.
Example:
time key value assigned uuid
| 1 A fff17a1e-9943-11eb-a8b3-0242ac130003
| 2 B f01d2c42-9943-11eb-a8b3-0242ac130003
| 3 C f8f1e880-9943-11eb-a8b3-0242ac130003
| 1 X fff17a1e-9943-11eb-a8b3-0242ac130003 (same as above)
v 1 Y fff17a1e-9943-11eb-a8b3-0242ac130003 (same as above)
As you can see fff17a1e-9943-11eb-a8b3-0242ac130003 is assigned to key "1" on its first occurrence. This uuid is subsequently reused on its second and third occurrence. The order doesn't matter, though. There is no seed for the generated uuid either.
My idea was to use a leftJoin() with a KStream and a KTable with key/uuid mappings. If the right side of the leftJoin is null I have to create a new UUID and add it to the mapping table. However, I think this does not work when there are several new entries with the same key in a short period of time. I guess this will create several UUIDs for the same key.
Is there an easy solution for this or is this simply not possible with streaming?
I don't think you need a join in your use case because joins are to merge to different streams that arrive with equal IDs. You said that you receive just one stream of events. So, your use case is an aggregation over one stream.
What I understood of your question is that you receive events: A, B, C, ... Then you want to assign some ID. You say that the ID is random. So, this is very uncertain. If it is random how would you know that A -> fff17a1e-9943-11eb-a8b3-0242ac130003 and X -> fff17a1e-9943-11eb-a8b3-0242ac130003 (the same). I suppose that you might have a seed to generate this UUID. And then you create a key based also on this seed.
I suggest you start with this sample of word count. then on the first map:
.map((key, value) -> new KeyValue<>(value, value))
you replace it with your map function. Something like this:
.map((k, v) -> {
if (v.equalsIgnoreCase("A")) {
return new KeyValue<String, ValueWithUUID>("1", new ValueWithUUID(v));
} else if (v.equalsIgnoreCase("B")) {
return new KeyValue<String, ValueWithUUID>("2", new ValueWithUUID(v));
} else {
return new KeyValue<String, ValueWithUUID>("0", new ValueWithUUID(v));
}
})
...
class ValueWithUUID {
String value;
String uuid;
public ValueWithUUID(String value) {
this.value = value;
// generate your UUID based on the value. It is random, but as you show in your question it might have a seed.
this.uuid = generateRandomUUIDWithSeed();
}
public String generateRandomUUIDWithSeed() {
return "fff17a1e-9943-11eb-a8b3-0242ac130003";
}
}
Then you decide if you want to use a windowed aggregation, every 30 seconds for instance. Or a non-windowing aggregation that updates the results for every event that arrives. Here is one nice example.
You can aggregate the raw stream as ktable, in the processing, generate or reuse the uuid; then use the stream of ktable.
final KStream<String, String> streamWithoutUUID = builder.stream("topic_name");
KTable<String, String> tableWithUUID = streamWithoutUUID.groupByKey().aggregate(
() -> "",
(k, v, t) -> {
if (!t.startsWith("uuid:")) {
return "uuid:" + "call your buildUUID function here" + ";value:" + v;
} else {
return t.split(";", 2)[0] + ";value:" + v;
}
},
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as("state_name")
.withKeySerde(Serdes.String()).withValueSerde(Serdes.String()));
final KStream<String, String> streamWithUUID = tableWithUUID.toStream();
I need to count the frequency of words in array. And handle result(I think it must be an entrySet )... An order of the entrySet is also important. So I suppose, that array must be converted to LinkedHashMap...
Map<String, Integer> map = new LinkedHashMap<>();
for(String word : words) {
Integer count = map.get(word);
count = (count == null) ? 1: ++count;
map.put(word, count);
}
I found next solution but the order is not respected.
And is it possible use that stream without collect operation(but with map or flatMap)?
Map<String, Long> collect =
wordsList.stream().collect(groupingBy(Function.identity(), counting()));
Thank you.
you're close but you'll need to use the groupingBy collector that takes 3 arguments like this:
LinkedHashMap<String, Long> resultSet =
wordsList.stream()
.collect(groupingBy(Function.identity(),
LinkedHashMap::new,
counting()));
The second argument to the groupingBy collector being the supplier and in your case since you want a LinkedHashMap then that's what you'll need to provide as shown above.
I have written a spark job. Which looks like below :
public class TestClass {
public static void main(String[] args){
String masterIp = args[0];
String appName = args[1];
String inputFile = args[2];
String output = args[3];
SparkConf conf = new SparkConf().setMaster(masterIp).setAppName(appName);
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rdd = sparkContext.textFile(inputFile);
Integer[] keyColumns = new Integer[] {0,1,2};
Broadcast<Integer[]> broadcastJob = sparkContext.broadcast(keyColumns);
Function<Integer,Long> createCombiner = v1 -> Long.valueOf(v1);
Function2<Long, Integer, Long> mergeValue = (v1,v2) -> v1+v2;
Function2<Long, Long, Long> mergeCombiners = (v1,v2) -> v1+v2;
JavaPairRDD<String, Long> pairRDD = rdd.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = -6293440291696487370L;
#Override
public Tuple2<String, Integer> call(String t) throws Exception {
String[] record = t.split(",");
Integer[] keyColumns = broadcastJob.value();
StringBuilder key = new StringBuilder();
for (int index = 0; index < keyColumns.length; index++) {
key.append(record[keyColumns[index]]);
}
key.append("|id=1");
Integer value = new Integer(record[4]);
return new Tuple2<String, Integer>(key.toString(),value);
}}).combineByKey(createCombiner, mergeValue, mergeCombiners).reduceByKey((v1,v2) -> v1+v2);
pairRDD.saveAsTextFile(output);
}
}
The program calculates the sum of values for each key.
As per my understanding, the local combiner should run on each node and add up the values for same keys and
then shuffling occurs with little amount of data.
But on SparkUI it is showing huge amount of shuffle read and shuffle write(almost 58GB).
Am I doing anything wrong?
How to know if the local combiner is working?
Cluster Details :-
20 Nodes cluster
Each Node having 80GB HardDisk, 8GB RAM, 4 cores
Hadoop-2.7.2
Spark-2.0.2(prebuild-with-Hadoop-2.7.x distribution)
Input file details :-
input file is stored on hdfs
input file size : 400GB
number of records : 16,129,999,990
record columns : String(2 char),int,int,String(2 char),int,int,String(2 char),String(2 char),String(2 char)
Note :
Max Number of distinct keys is 1081600.
In spark logs I see the task running with localitylevel NODE_LOCAL.
Let's decompose this problem and see what get. To simplify computations lets assume that:
Total number of records is 1.6e8
Number of unique keys is 1e6
Split size is 128MB (this seems to be consistent with the number of task in you UI).
With these values data will be spitted into ~3200 partitions (3125 in your case). This gives you around 51200 records per split. Furthermore if distribution of number of values per key is uniform there should ~160 records per key on average.
If data is randomly distributed (it is not sorted by key for example) you can expect that on average number of records per key per partition will be close to one*. This is basically the worst case scenario where map side combine doesn't reduce amount of data at all.
Furthermore you have to remember that size of a flat file typically will be significant lower that size of the serialized objects.
With real life data you can typically expect some type of order emerging from data collection process so things should be better than what we calculated above but the bottom line is that, if data is not already grouped by partition, map side combine may provide no improvements at all.
You could probably decrease amount of shuffled data by using a bit larger split (256MB would give you a bit over 100K per partition) but it comes at price of longer GC pauses and possibly other GC issues.
* You can either simulate this by taking samples with replacement:
import pandas as pd
import numpy as np
(pd
.DataFrame({"x": np.random.choice(np.arange(3200), size=160, replace=True)})
.groupby("x")
.x.count()
.mean())
or just think about the problem of randomly assigning 160 balls to 3200 buckets.
Is there a better way of sorting a collection in Java-8 without checking first if collection is empty or null?
if (institutions != null && !institutions.isEmpty()) {
Collections.sort(institutions);
}
Though the question is old, just adding another way of doing it.
First of all, the collection shouldn't be null. If so:
institutions.sort(Comparator.comparing(Institutions::getId));
I can only think of 3 (4) ways:
Use a SortedSet (e.g. TreeSet) and insert it there. Elements will be sorted right away, however insertion time may be bad. Also, you can not have equal elements in there (e.g. 3x 1), so it might not be the best solution.
Then there is the normal Collections.sort(). You don't have to check that your list is empty, however you do have to make sure it is not null. Frankly though, do you ever have a use case where your list is null and you want to sort it? This sounds like it might be a bit of a design issue.
Finally you can use streams to return sorted streams. I wrote up a little test that measures the time of this:
public static void main(String[] args) {
List<Integer> t1 = new ArrayList<>();
List<Integer> t2 = new ArrayList<>();
List<Integer> t3 = new ArrayList<>();
for(int i = 0; i< 100_000_00; i++) {
int tmp = new Random().nextInt();
t1.add(tmp);
t2.add(tmp);
t3.add(tmp);
}
long start = System.currentTimeMillis();
t1.sort(null); // equivalent to Collections.sort() - in place sort
System.out.println("T1 Took: " + (System.currentTimeMillis() - start));
start = System.currentTimeMillis();
List<Integer> sortedT2 = t2.stream().sorted().collect(Collectors.toList());
System.out.println("T2 Took: " + (System.currentTimeMillis() - start));
start = System.currentTimeMillis();
List<Integer> sortedT3 = t3.parallelStream().sorted().collect(Collectors.toList());
System.out.println("T3 Took: " + (System.currentTimeMillis() - start));
}
Sorting random integers results in: (on my box obviously)
Collections.sort() -> 4163
stream.sorted() -> 4485
parallelStream().sorted() -> 1620
A few points:
Collections.sort() and List#sort will sort the existing list in place. The streaming API (both parallel and normal) will created new sorted lists.
Again - the stream can be empty, but it can't be null. It appears that parallel streams are the quickest, however you have to keep in mind the pitfalls of parallel streams. Read some info e.g. here: Should I always use a parallel stream when possible?
Finally, if you want to check for null before, you can write your own static helper, for example:
public static <T extends Comparable<? super T>> void saveSort(final List<T> myList) {
if(myList != null) {
myList.sort(null);
}
}
public static <T> void saveSort(final List<T> myList, Comparator<T> comparator) {
if(myList != null) {
myList.sort(comparator);
}
}
I hope that helps!
Edit: Another Java8 advantage for sorting is to supply your comparator as lambda:
List<Integer> test = Arrays.asList(4,2,1,3);
test.sort((i1, i2) -> i1.compareTo(i2));
test.forEach(System.out::println);