Performance tuning Spark's Word2Vec - performance

I want to improve the performance of Spark's Word2Vec model on a EMR cluster. I have around 54 GB of cleaned patent text data, and I want to train a Spark's Word2Vec on it. It looks like it is running, but I think the performance can be improved. Can someone give me advice on how to do this?
Preprocessing steps taken:
Stripping special characters from the texts, and reducing unnecessary whitespaces.
Tokenize words
Remove stopwords from tokens
Lemmatize words
Remove too frequently occuring words (words that occur in more than 30% of the documents)
Sample of the cleaned data
+----------------------------------------------------------------------------------------------------+
|[water, cooling, cooled, type, pre, burning, present, invention, provides, kind, water, cooling, ...|
|[new, energetic, liquid, invention, discloses, kind, new, energetic, liquid, made, head, outlet, ...|
|[pre, assembly, pre, disclosed, pre, cylindrical, body, member, extending, axially, opposite, pre...|
|[part, feed, ozone, feed, form, difference, ozone, concentration, space, wise, time, wise, premix...|
|[homogeneous, charge, thereof, invention, discloses, homogeneous, type, thereof, cover, arranged,...|
|[gasoline, pre, plug, pre, communicating, plug, associated, pre, respectively, gasoline, injected...|
|[pre, pre, homogeneous, charge, hcci, mode, providing, pre, fluidly, creating, radical, pre, achi...|
|[pre, 105, 351, another, aspect, pre, equal, greater, main, 107, 355, ieast, prior, main, aspect,...|
|[energy, apparatus, energy, apparatus, presented, herein, energy, conversion, module, containing,...|
|[diesel, invention, provides, inlet, processing, diesel, diesel, inlet, treatment, diesel, charac...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows
EMR hardware settings:
Master: m5.2xlarge
8 vCore, 32 GiB memory, EBS only storage
EBS Storage:128 GiB
Core (10x) : m5.4xlarge
16 vCore, 64 GiB memory, EBS only storage
EBS Storage:256 GiB
spark-submit settings:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
Word2Vec settings (if not mentioned, I use default):
vectorSize=200
minCount=5
numIterations=15
numPartitions=120
Some more notes:
The cluster utilizes approximately 70% cpu during estimation
Total ram usage is about 50-60% during estimation
Should I increase the numPartitions to make use of approx 100% cpu utilization? How much (or to what extend) will it reduce the accuracy of the model? How should I set numIterations? What would be sufficient in this case?
Can anybody help me out?
Thanks in advance!

Related

How to match CatBoost GPU metrics to CPU metrics?

I have been using CatBoost on CPU and got good results, but wanted to speed it up by using GPU. However, all metrics from GPU are worse than those from CPU. I did search around and found a suggestion that one could try to increase border_count to 255. I tried that but it did not help.
Some have claimed that GPU output would yield variations. While it might be true, I have found these variations are always worse than the CPU metrics.
Any pointer that I can try? It is a binary classification project. It has 1.2M rows for training and 120K rows for validattion. It has 218 features and 42 of them are categorical. The rest are floats or integers.

How to get detailed memory breakdown in the TensorFlow profiler?

I'm using the new TensorFlow profiler to profile memory usage in my neural net, which I'm running on a Titan X GPU with 12GB RAM. Here's some example output when I profile my main training loop:
==================Model Analysis Report======================
node name | requested bytes | ...
Conv2DBackpropInput 10227.69MB (100.00%, 35.34%), ...
Conv2D 9679.95MB (64.66%, 33.45%), ...
Conv2DBackpropFilter 8073.89MB (31.21%, 27.90%), ...
Obviously this adds up to more than 12GB, so some of these matrices must be in main memory while others are on the GPU. I'd love to see a detailed breakdown of what variables are where at a given step. Is it possible to get more detailed information on where various parameters are stored (main or GPU memory), either with the profiler or otherwise?
"Requested bytes" shows a sum over all memory allocations, but that memory can be allocated and de-allocated. So just because "requested bytes" exceeds GPU RAM doesn't necessarily mean that memory is being transferred to CPU.
In particular, for a feedforward neural network, TF will normally keep around the forward activations, to make backprop efficient, but doesn't need to keep the intermediate backprop activations, i.e. dL/dh at each layer, so it can just throw away these intermediates after it's done with these. So I think in this case what you care about is the memory used by Conv2D, which is less than 12 GB.
You can also use the timeline to verify that total memory usage never exceeds 12 GB.

EBS baseline performance too high?

I am trying to benchmark an RDS instance (postgres) on AWS.
I created the instance with a 30 GB "general purpose" SSD volume ("gp2"). according to the AWS docs, this should provide a baseline performance of 100 IOPS:
Between a minimum of 100 IOPS (at 33.33 GiB and below) and a maximum
of 10,000 IOPS (at 3,334 GiB and above), baseline performance scales
linearly at 3 IOPS per GiB of volume size.
but in addition to that, there is burst performance:
When using General Purpose (SSD) storage, your DB instance receives an
initial I/O credit balance of 5.4 million I/O credits, which is enough
to sustain a burst performance of 3,000 IOPS for 30 minutes.
As I'm interested in sustained database performance (= the baseline case), I have to get rid of all I/O credits before starting my tests. I did this by running pgbench.
In the following screenshot, you can see that I start pgbench at 11:00, and around 3 hours later the burst balance is finally used up, and write IOPS drops off:
So far, so good. the timing makes sense -- 3 * 60 * 60 * 600 = 6.48 million (I/O credits are also refilled during the burst).
What I don't understand: why doesn't IOPS drop down to the baseline rate (100), but stay at 380 instead? Is the documented formula for baseline performance not valid any more?
UPDATE: i've shut down this test instance now, but here are the details:
sorry for the delay in my response
Why the extra performance?
With the db.m3.xlarge (which falls under Standard - Previous Edition header) - you have an extra 500 Mbps of additional, dedicated capacity for Amazon Elastic Block Store. This is per the chart and details at this link.
In the first section of Amazon EBS Performance Tips, it says to use EBS optimized instances for increased performance. So, I'd say this was the main reason you were getting the extra IOPS over the 100, after you exhausted your burst credits.
Cost Considerations:
According to the end of the paragraph, having your M3, you will incur extra cost for the extra performance. However, if you were to select the M4, the extra performance incurs no extra cost.
So in sustained database performance cost analysis, I would consider just the base price of the M4 vs. base price of M3 + incurred performance cost the M3 will bring you.
Good luck.

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance. In my experiment, I measured the performance with increasing matrix dimension N, from 250 up to 10000.
In my algorithm I have applied caching (with tuned blocking factor), and data are always accessed with unit stride during computation, so cache performance is optimal; TLB and paging problem are eliminated;
I have 8GB available RAM, and the maximum memory footprint during experiment is under 800MB, so no swapping comes across;
During experiment, no resource demanding process like web browser is running at the same time. Only some really cheap background process is running to record CPU frequency as well as CPU temperature data every 2s.
I would expect the performance (in GFLOPs) should maintain at around 10.5 for whatever N I am testing. But a significant performance drop is observed in the middle of the experiment as shown in the first figure.
CPU frequency and CPU temperature are seen in the 2nd and 3rd figure. The experiment finishes in 400s. Temperature was at 51 degree when experiment started, and quickly rose up to 72 degree when CPU got busy. After that it grew slowly to the highest at 78 degree. CPU frequency is basically stable, and it did not drop when temperature got high.
So, my question is:
since CPU frequency did not drop, why performance suffers?
how exactly does temperature affect CPU performance? Does the increment from 72 degree to 78 degree really make things worse?
CPU info
System: Ubuntu 14.04 LTS
Laptop model: Lenovo-YOGA-3-Pro-1370
Processor: Intel Core M-5Y71 CPU # 1.20 GHz * 2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0,1
Off-line CPU(s) list: 2,3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Stepping: 4
CPU MHz: 1474.484
BogoMIPS: 2799.91
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0,1
CPU 0, 1
driver: intel_pstate
CPUs which run at the same hardware frequency: 0, 1
CPUs which need to have their frequency coordinated by software: 0, 1
maximum transition latency: 0.97 ms.
hardware limits: 500 MHz - 2.90 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 500 MHz and 2.90 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.40 GHz.
boost state support:
Supported: yes
Active: yes
update 1 (control experiment)
In my original experiment, CPU is kept busy working from N = 250 to N = 10000. Many people (primarily those whose saw this post before re-editing) suspected that the overheating of CPU is the major reason for performance hit. Then I went back and installed lm-sensors linux package to track such information, and indeed, CPU temperature rose up.
But to complete the picture, I did another control experiment. This time, I give CPU a cooling time between each N. This is achieved by asking the program to pause for a number of seconds at the start of iteration of the loop through N.
for N between 250 and 2500, the cooling time is 5s;
for N between 2750 and 5000, the cooling time is 20s;
for N between 5250 and 7500, the cooling time is 40s;
finally for N between 7750 and 10000, the cooling time is 60s.
Note that the cooling time is much larger than the time spent for computation. For N = 10000, only 30s are needed for Cholesky factorization at peak performance, but I ask for a 60s cooling time.
This is certainly a very uninteresting setting in high performance computing: we want our machine to work all the time at peak performance, until a very large task is completed. So this kind of halt makes no sense. But it helps to better know the effect of temperature on performance.
This time, we see that peak performance is achieved for all N, just as theory supports! The periodic feature of CPU frequency and temperature is the result of cooling and boost. Temperature still has an increasing trend, simply because as N increases, the work load is getting bigger. This also justifies more cooling time for a sufficient cooling down, as I have done.
The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?
OK, here are the new set of plots:
I don't know why I could not upload the 6th figure. SO simply does not allow me to submit the edit when adding the 6th figure. So I am sorry I can't attach the figure for CPU frequency.
update 2 (how I measure CPU frequency and temperature)
Thanks to Zboson for adding the x86 tag. The following bash commands are what I used for measurement:
while true
do
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> cpu0_freq.txt ## parameter "freq0"
cat sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq >> cpu1_freq.txt ## parameter "freq1"
sensors | grep "Core 0" >> cpu0_temp.txt ## parameter "temp0"
sensors | grep "Core 1" >> cpu1_temp.txt ## parameter "temp1"
sleep 2
done
Since I did not pin the computation to 1 core, the operating system will alternately use two different cores. It makes more sense to take
freq[i] <- max (freq0[i], freq1[i])
temp[i] <- max (temp0[i], temp1[i])
as the overall measurement.
TL:DR: Your conclusion is correct. Your CPU's sustained performance is nowhere near its peak. This is normal: the peak perf is only available as a short term "bonus" for bursty interactive workloads, above its rated sustained performance, given the light-weight heat-sink, fans, and power-delivery.
You can develop / test on this machine, but benchmarking will be hard. You'll want to run on a cluster, server, or desktop, or at least a gaming / workstation laptop.
From the CPU info you posted, you have a dual-core-with-hyperthreading Intel Core M with a rated sustainable frequency of 1.20 GHz, Broadwell generation. Its max turbo is 2.9GHz, and it's TDP-up sustainable frequency is 1.4GHz (at 6W).
For short bursts, it can run much faster and make much more heat than it requires its cooling system to handle. This is what Intel's "turbo" feature is all about. It lets low-power ultraportable laptops like yours have snappy UI performance in stuff like web browsers, because the CPU load from interactive is almost always bursty.
Desktop/server CPUs (Xeon and i5/i7, but not i3) do still have turbo, but the sustained frequency is much closer to the max turbo. e.g. a Haswell i7-4790k has a sustained "rated" frequency of 4.0GHz. At that frequency and below, it won't use (and convert to heat) more than its rated TDP of 88W. Thus, it needs a cooling system that can handle 88W. When power/current/temperature allow, it can clock up to 4.4GHz and use more than 88W of power. (The sliding window for calculating the power history to keep the sustained power with 88W is sometimes configurable in the BIOS, e.g. 20sec or 5sec. Depending on what code is running, 4.4GHz might not increase the electrical current demand to anywhere near peak. e.g. code with lots of branch mispredicts that's still limited by CPU frequency, but that doesn't come anywhere near saturating the 256b AVX FP units like Prime95 would.)
Your laptop's max turbo is a factor of 2.4x higher than rated frequency. That high-end Haswell desktop CPU can only upclock by 1.1x. The max sustained frequency is already pretty close to the max peak limits, because it's rated to need a good cooling system that can keep up with that kind of heat production. And a solid power supply that can supply that much current.
The purpose of Core M is to have a CPU that can limit itself to ultra low power levels (rated TDP of 4.5 W at 1.2GHz, 6W at 1.4GHz). So the laptop manufacturer can safely design a cooling and power delivery system that's small and light, and only handles that much power. The "Scenario Design Power" is only 3.5W, and that's supposed to represent the thermal requirements for real-world code, not max-power stuff like Prime95.
Even a "normal" ULV laptop CPU is rated for 15W sustained, and high power gaming/workstation laptop CPUs at 45W. And of course laptop vendors put those CPUs into machines with beefier heat-sinks and fans. See a table on wikipedia, and compare desktop / server CPUs (also on the same page).
The achievement of peak performance seems to rule out all effects
other than temperature. But this is really annoying. Basically it says
that computer will get tired in HPC, so we can't get expected
performance gain. Then what is the point of developing HPC algorithm?
The point is to run them on hardware that's not so badly thermally limited! An ultra-low-power CPU like a Core M makes a decent dev platform, but not a good HPC compute platform.
Even a laptop with an xxxxM CPU, rather than a xxxxU CPU, will do ok. (e.g. a "gaming" or "workstation" laptop that's designed to run CPU-intensive stuff for sustained periods). Or in Skylake-family, "xxxxH" or "HK" are the 45W mobile CPUs, at least quad-core.
Further reading:
Modern Microprocessors
A 90-Minute Guide!
[Power Delivery in a Modern Processor] - general background, including the "power wall" that Pentium 4 ran into.
(https://www.realworldtech.com/power-delivery/) - really deep technical dive into CPU / motherboard design and the challenges of delivering stable low-voltage to very bursty demands, and reacting quickly to the CPU requesting more / less voltage as it changes frequency.

Understanding Negative Virtual Memory Pressure

I was re-reading Poul-Henning Kamp's paper entitled, "You're Doing It Wrong" and one of the diagrams confused me.
The x-axis of Figure 1 is labeled as "VM pressure in megabytes". The author clarifies the x-axis as being "measured in the amount of address space not resident in primary memory, because the kernel paged it out to secondary storage".
I can understand zero MB of VM pressure (all of the address space is resident in primary memory).
I can understand a positive VM pressure but I'm having a tough time picturing what negative 8 megabytes of VM pressure looks like (see the left of the x-axis of Figure 1). Putting negative 8 in the author's description leaves me with, "- 8 MB of address space not resident in primary memory". That doesn't make sense to me.
If I just conclude that the author accidentally negated positive numbers, the chart makes more sense but I'm not ready to conclude that the author has made the mistake. It's more likely that I have. But then as the pressure decreases, the runtime increases? That sounds counterintuitive.
I'm also not sure why there is a drastic change to the curves around -8 MB of VM memory pressure.
Thanks in advance!
Read "measured in the difference between amount of address space resident in primary memory and total required amount".
Word "not" somehow represents that minus sign.

Resources