Spark is shuffling large amount of data

Spark is shuffling large amount of data - hadoop

I have written a spark job. Which looks like below :
public class TestClass {
public static void main(String[] args){
String masterIp = args[0];
String appName = args[1];
String inputFile = args[2];
String output = args[3];
SparkConf conf = new SparkConf().setMaster(masterIp).setAppName(appName);
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rdd = sparkContext.textFile(inputFile);
Integer[] keyColumns = new Integer[] {0,1,2};
Broadcast<Integer[]> broadcastJob = sparkContext.broadcast(keyColumns);
Function<Integer,Long> createCombiner = v1 -> Long.valueOf(v1);
Function2<Long, Integer, Long> mergeValue = (v1,v2) -> v1+v2;
Function2<Long, Long, Long> mergeCombiners = (v1,v2) -> v1+v2;
JavaPairRDD<String, Long> pairRDD = rdd.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = -6293440291696487370L;
#Override
public Tuple2<String, Integer> call(String t) throws Exception {
String[] record = t.split(",");
Integer[] keyColumns = broadcastJob.value();
StringBuilder key = new StringBuilder();
for (int index = 0; index < keyColumns.length; index++) {
key.append(record[keyColumns[index]]);
}
key.append("|id=1");
Integer value = new Integer(record[4]);
return new Tuple2<String, Integer>(key.toString(),value);
}}).combineByKey(createCombiner, mergeValue, mergeCombiners).reduceByKey((v1,v2) -> v1+v2);
pairRDD.saveAsTextFile(output);
}
}
The program calculates the sum of values for each key.
As per my understanding, the local combiner should run on each node and add up the values for same keys and
then shuffling occurs with little amount of data.
But on SparkUI it is showing huge amount of shuffle read and shuffle write(almost 58GB).
Am I doing anything wrong?
How to know if the local combiner is working?
Cluster Details :-
20 Nodes cluster
Each Node having 80GB HardDisk, 8GB RAM, 4 cores
Hadoop-2.7.2
Spark-2.0.2(prebuild-with-Hadoop-2.7.x distribution)
Input file details :-
input file is stored on hdfs
input file size : 400GB
number of records : 16,129,999,990
record columns : String(2 char),int,int,String(2 char),int,int,String(2 char),String(2 char),String(2 char)
Note :
Max Number of distinct keys is 1081600.
In spark logs I see the task running with localitylevel NODE_LOCAL.

Let's decompose this problem and see what get. To simplify computations lets assume that:
Total number of records is 1.6e8
Number of unique keys is 1e6
Split size is 128MB (this seems to be consistent with the number of task in you UI).
With these values data will be spitted into ~3200 partitions (3125 in your case). This gives you around 51200 records per split. Furthermore if distribution of number of values per key is uniform there should ~160 records per key on average.
If data is randomly distributed (it is not sorted by key for example) you can expect that on average number of records per key per partition will be close to one*. This is basically the worst case scenario where map side combine doesn't reduce amount of data at all.
Furthermore you have to remember that size of a flat file typically will be significant lower that size of the serialized objects.
With real life data you can typically expect some type of order emerging from data collection process so things should be better than what we calculated above but the bottom line is that, if data is not already grouped by partition, map side combine may provide no improvements at all.
You could probably decrease amount of shuffled data by using a bit larger split (256MB would give you a bit over 100K per partition) but it comes at price of longer GC pauses and possibly other GC issues.
* You can either simulate this by taking samples with replacement:
import pandas as pd
import numpy as np
(pd
.DataFrame({"x": np.random.choice(np.arange(3200), size=160, replace=True)})
.groupby("x")
.x.count()
.mean())
or just think about the problem of randomly assigning 160 balls to 3200 buckets.

Related

Redis Bulk Fetch of 5-10 MB From HMSET

Use Case: our data structure is like below:
tp1 "i1" : {object hash}, "i2" : {object hash}
tp2 "i3" : {object hash}, "i4" : {object hash}
tp1 and tp2 are hmset keys. we are referring as tp keys.
Each tp key can have 100-200 records in it. And each hash has a size of 1-1.5 KB.
Below is our implementation with spring data:
public Map<String, Map<String, T>> getAllMulti(List<String> keys) {
long start = System.currentTimeMillis();
log.info("Redis pipeline fetch started with keys size :{}", keys.size());
Map<String, Map<String, T>> responseMap = new HashMap<>();
if (CollectionUtils.isNotEmpty(keys)) {
List<Object> resultSet = redisTemplate.executePipelined((RedisCallback<T>) connection -> {
for (String key : keys) {
connection.hGetAll(key.getBytes());
}
return null;
});
responseMap = IntStream.range(0, keys.size())
.boxed()
.collect(Collectors.toMap(keys::get, i -> (Map<String, T>) resultSet.get(i)));
}
long timeTaken = System.currentTimeMillis() - start;
log.info("Time taken in redis pipeline fetch: {}", timeTaken);
return responseMap;
}
Objective: Our objective is to load hashes of around 500-600 tp keys. We thought of using redis pipeline for this purpose. But as we are increasing the number of tp keys, the response time is increasing significantly. And it is not consistent also.
For response time improvement we have tried compression/messagePack, still no benefit.
One more solution we have tried, where we have partitioned our tpkeys into multiple partition and run the above implementation in parallel. Observation is if the number of tpkeys is small then the batch takes less time. if tpkeys size is increasing,time taken for the batch with same number of keys is increasing.
Any help/lead will be appreciated. Thanks

Iteratively running queries on Apache Spark

I've been trying to execute 10,000 queries over a relatively large dataset 11M. More specifically I am trying to transform an RDD using filter based on some predicate and then compute how many records conform to that filter by applying the COUNT action.
I am running Apache Spark on my local machine having 16GB of memory and an 8-core CPU. I have set the --driver-memory to 10G in order to cache the RDD in memory.
However, because I have to re-do this operation 10,000 times it takes unusually long for this to finish. I am also attaching my code hoping it will make things more clear.
Loading the queries and the dataframe I am going to query against.
//load normalized dimensions
val df = spark.read.parquet("/normalized.parquet").cache()
//load query ranges
val rdd = spark.sparkContext.textFile("part-00000")
Parallelizing the execution of queries
In here, my queries are collected in a list and using par are executed in parallel. I then collect the required parameters that my query needs, to filter the Dataset. The isWithin function calls a function and tests whether the Vector contained in my dataset is within the given bounds by my queries.
Now after filtering my dataset, I execute count to get the number of records that exist in the filtered dataset and then create a string reporting how many that was.
val results = queries.par.map(q => {
val volume = q(q.length-1)
val dimensions = q.slice(0, q.length-1)
val count = df.filter(row => {
val v = row.getAs[DenseVector]("scaledOpen")
isWithin(volume, v, dimensions)
}).count
q.mkString(",")+","+count
})
Now, what I have in mind is that this task is generally really hard given the large dataset that I have and trying to run such thing on a single machine. I know this could be much faster on something running on top of Spark or by utilizing an index. However, I am wondering if there is a way to make it faster as it is.

Just because you parallelize access to a local collection it doesn't mean that anything is executed in parallel. Number of jobs that can be executed concurrently is limited by the cluster resources not driver code.
At the same time Spark is designed for high latency batch jobs. If number of jobs goes into tens of thousands you just cannot expect things to be fast.
One thing you can try is to push filters down into a single job. Convert DataFrame to RDD:
import org.apache.spark.mllib.linalg.{Vector => MLlibVector}
import org.apache.spark.rdd.RDD
val vectors: RDD[org.apache.spark.mllib.linalg.DenseVector] = df.rdd.map(
_.getAs[MLlibVector]("scaledOpen").toDense
)
map vectors to {0, 1} indicators:
import breeze.linalg.DenseVector
// It is not clear what is the type of queries
type Q = ???
val queries: Seq[Q] = ???
val inds: RDD[breeze.linalg.DenseVector[Long]] = vectors.map(v => {
// Create {0, 1} indicator vector
DenseVector(queries.map(q => {
// Define as before
val volume = ???
val dimensions = ???
// Output 0 or 1 for each q
if (isWithin(volume, v, dimensions)) 1L else 0L
}): _*)
})
aggregate partial results:
val counts: breeze.linalg.DenseVector[Long] = inds
.aggregate(DenseVector.zeros[Long](queries.size))(_ += _, _ += _)
and prepare final output:
queries.zip(counts.toArray).map {
case (q, c) => s"""${q.mkString(",")},$c"""
}

partitioning not working in hadoop

so in my code i have partition the data in three parts but in output i m only getting the ouput that is retuned by 0th partition even if i set no of reducers to 3
my code
public static class customPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
String country = value.toString();
if(numReduceTasks==0)
return 0;
if(key.equals(new Text("key1")) && !value.equals(new Text("valuemy")))
return 1%numReduceTasks;
if(value.equals(new Text("valueother")) && key.equals(new Text("key1")) )
return 0;
else
return 2%numReduceTasks;
}
}
and set no of reducers as
job.setNumReduceTasks(3);
it is giving me the output of only 0th partition i.e., return 0

i was doing a very silly mistake ....the partitioning was working fine in my code...but i thought the output is only in part-r-00000 file i thought that it is just to reduce load that it divides the file..but in output it shows the file by combining but i was wrong the different partitions have different outputs.

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)

def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}

I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers

You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

algorithm - equally fill different size containers based on two criterias

I am trying to wrap my head around an algorithm. I've never coded for an algorithm before and not sure how to go about this issue. Here is the jist of it:
I can have n number of containers, each container has two sets of numbers that are important to me: the amount of memory (x) and the number of logical processors (y) each container can have different values.
Each virtual machine has an amount of memory (x) and a number of logical processors (y). I am trying to create an algorithm that will balance the load of memory (x) and a number of logical processors (y) across all hosts in the cluster equally. It will not be a true equal among all hosts but all hosts will be within 10% +/- of each host.
How would I go about this problem I would suppose mathematically.

If I understood your problem correctly, you want to minimize the relative load of the hosts, so that each one has a load that deviates no more than 10% from the others. So we want to optimize the "relative load" between hosts by finding a minimum value.
To do so, you could use some sort of Combinatorial Optimization to reach an acceptable or optimal solution. A classic metaheuristic like Simulated Annealing or Tabu Search would do the job.
Example generic steps for your problem :
define an initial state by randomly assigning each VM to a host
find new states by iteratively swapping VM's between hosts until:
some acceptable solution is found, or
the number of iterations is exhausted, or
some other condition is met(like simulated annealing's "temperature")
develop a compare function to decide when to switch states (solutions) in each iteration
In your case, you should measure the relative load between all hosts and only swap states when the relative load of the new state is lower than the current state.
This of course assumes that you will do this algorithm with some form of logical representation and not the actual VM's. Once you found the solution simulating your real conditions, then you would apply them physically to your VM's/hosts configuration.
Hope this helps!

You've probably moved on by now, but if you ever come back to this issue, this answer may be useful. If any part is confusing, let me know and I'll try to clarify.
Your problem is a case of 2D variable size bin packing without rotation. Your dimensions are Memory and CPU, rather than length and width (hence the lack of rotation).
I would use a simple offline packing algorithm. (offline means that your VMs and hosts are all known beforehand)
The simple packing I use is:
sort your unassigned VMs by memory required
sort your set of Hosts by available memory
find the Host with the most available memory that the VM will fit on and assign it to that Host (be sure to check CPU capacity, too. the Host with the most available RAM may not have enough CPU resources)
remove the VM from the list
reduce the Host's available memory and CPU capacity
if you still have VMs, go to 2
Here is how I defined VMs and Hosts:
[DebuggerDisplay("{Name}: {MemoryUsage} | {ProcessorUsage}")]
class VirtualMachine
{
public int MemoryUsage;
public string Name;
public int ProcessorUsage;
public VirtualMachine(string name, int memoryUsage, int processorUsage)
{
MemoryUsage = memoryUsage;
ProcessorUsage = processorUsage;
Name = name;
}
}
[DebuggerDisplay("{Name}: {Memory} | {Processor}")]
class Host
{
public readonly string Name;
public int Memory;
public Host Parent;
public int Processor;
public Host(string name, int memory, int processor, Host parent = null)
{
Name = name;
Memory = memory;
Processor = processor;
Parent = parent;
}
public bool Fits(VirtualMachine vm) { return vm.MemoryUsage <= Memory && vm.ProcessorUsage <= Processor; }
public Host Assign(VirtualMachine vm) { return new Host(Name + "_", Memory - vm.MemoryUsage, Processor - vm.ProcessorUsage, this); }
}
The Host Fits and Assign methods are important for checking if a VM can fit, and reducing the Host available memory/CPU. I create a "Host-Prime" to represent the host with reduced resources, removing the original Host and inserting Host-Prime into the Host list.
Here is the bin pack solving algorithm. If you are running against a large data set, there should be plenty of opportunities for speeding up execution, but this is good enough for small data sets.
class Allocator
{
readonly List<Host> Bins;
readonly List<VirtualMachine> Items;
public Allocator(List<Host> bins, List<VirtualMachine> items)
{
Bins = bins;
Items = items;
}
public Dictionary<Host, List<VirtualMachine>> Solve()
{
var bins = new HashSet<Host>(Bins);
var items = Items.OrderByDescending(item => item.MemoryUsage).ToList();
var result = new Dictionary<Host, List<VirtualMachine>>();
while (items.Count > 0)
{
var item = items.First();
items.RemoveAt(0);
var suitableBin = bins.OrderByDescending(b => b.Memory).FirstOrDefault(b => b.Fits(item));
if (suitableBin == null)
return null;
bins.Remove(suitableBin);
var remainder = suitableBin.Assign(item);
bins.Add(remainder);
var rootBin = suitableBin;
while (rootBin.Parent != null)
rootBin = rootBin.Parent;
if (!result.ContainsKey(rootBin))
result[rootBin] = new List<VirtualMachine>();
result[rootBin].Add(item);
}
return result;
}
}
So you have a packing algorithm now, but you still don't have a load balancing solution. Since this algorithm will pack the VMs onto hosts without concern of balancing the memory usage, we need another level of solving. To achieve some rough memory balance, I take a brute force approach. Reduce the initial memory on each Host to represent a target usage goal. Then solve to see if your VMs fit into the reduced memory available. If no solution is found, relax the memory constraint. Repeat this until a solution is found, or none is possible (using the given algorithm). This should give a rough approximation of the optimal memory load.
class Program
{
static void Main(string[] args)
{
//available hosts, probably loaded from a file or database
var hosts = new List<Host> {new Host("A", 4096, 4), new Host("B", 8192, 8), new Host("C", 3072, 8), new Host("D", 3072, 8)};
var hostLookup = hosts.ToDictionary(h => h.Name);
//VMs required to run, probably loaded from a file or database
var vms = new List<VirtualMachine>
{
new VirtualMachine("1", 512, 1),
new VirtualMachine("2", 1024, 2),
new VirtualMachine("3", 1536, 5),
new VirtualMachine("4", 1024, 8),
new VirtualMachine("5", 1024, 1),
new VirtualMachine("6", 2048, 1),
new VirtualMachine("7", 2048, 2)
};
var solution = FindMinumumApproximateSolution(hosts, vms);
if (solution == null)
Console.WriteLine("No solution found.");
else
foreach (var hostAssigment in solution)
{
var host = hostLookup[hostAssigment.Key.Name];
var vmsOnHost = hostAssigment.Value;
var xUsage = vmsOnHost.Sum(itm => itm.MemoryUsage);
var yUsage = vmsOnHost.Sum(itm => itm.ProcessorUsage);
var pctUsage = (xUsage / (double)host.Memory);
Console.WriteLine("{0} used {1} of {2} MB {5:P2} | {3} of {4} CPU", host.Name, xUsage, host.Memory, yUsage, host.Processor, pctUsage);
Console.WriteLine("\t VMs: " + String.Join(" ", vmsOnHost.Select(vm => vm.Name)));
}
Console.ReadKey();
}
static Dictionary<Host, List<VirtualMachine>> FindMinumumApproximateSolution(List<Host> hosts, List<VirtualMachine> vms)
{
for (var targetLoad = 0; targetLoad <= 100; targetLoad += 1)
{
var solution = GetTargetLoadSolution(hosts, vms, targetLoad / 100.0);
if (solution == null)
continue;
return solution;
}
return null;
}
static Dictionary<Host, List<VirtualMachine>> GetTargetLoadSolution(List<Host> hosts, List<VirtualMachine> vms, double targetMemoryLoad)
{
//create an alternate host list that reduces memory availability to the desired target
var hostsAtTargetLoad = hosts.Select(h => new Host(h.Name, (int) (h.Memory * targetMemoryLoad), h.Processor)).ToList();
var allocator = new Allocator(hostsAtTargetLoad, vms);
return allocator.Solve();
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Spark is shuffling large amount of data - hadoop

Related

Redis Bulk Fetch of 5-10 MB From HMSET

Iteratively running queries on Apache Spark

partitioning not working in hadoop

Writing to hadoop distributed file system multiple times with Spark

algorithm - equally fill different size containers based on two criterias

Categories

Resources