Hadoop 3.2.1 ErasureCoding ISA-L Question? - hadoop

I am testing hadoop 3.0 Erasure coding..
For the test, I uploaded 100GB in hadoop 3.2.1, and the results were shown below. (5 datanode)
3 COPY : 150 minute
E C : 250 minute (RS-3-2-1024k)
To increase the speed of the EC by applying ISA-L, I set it up and tested the operation, but the speed came out the same.
zlib: true /lib64/libz.so.1
zstd : false
snappy: false
lz4: true revision:10301
bzip2: false
openssl: false build does not support openssl.
ISA-L: true /lib64/libisal.so.2
(1) it is an old device, so I wonder if the CPU does not support it.
Where can I check the list of CPUs that support ISA-L?
CPU : Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
(2) Please advise if there is a method of applying ISA-L to be added.

In respect to your question
(1) Tt is an old device, so I wonder if the CPU does not support it.
Where can I check the list of CPUs that support ISA-L?
I were going through
Code Sample: Intel® ISA-L Erasure Code and Recovery
Intel(R) Intelligent Storage Acceleration Library
and haven't found a hint regarding specific CPUs.
Optimizing Storage Solutions Using the Intel® Intelligent Storage Acceleration Library
mention only
Depending on the platform capability, Intel ISA-L can run on various Intel® processor families.
Improvements are obtained by speeding up the computations through the use of the following instruction sets:
Intel® AES-NI – Intel® Advanced Encryption Standard - New Instruction
Intel® SSE – Intel® Streaming SIMD Extensions
Intel® AVX – Intel® Advanced Vector Extensions
Intel® AVX2 - Intel® Advanced Vector Extensions 2
Intel® ISA-L also includes unit tests, performance tests and samples written in C which can be used as usage examples.*

Related

OpenACC - What does -ta in pgcc compiler mean?

I am struggling with "-ta" flag in pgi compiler in order to use GPU acceleration using OpenACC. I did not find any comprehensive answer.
Yes, I know that it is called target accelerator to boost using information about the hardware. So, what -ta should I set, if my GPU hardware is:
weugene#landau:~$ sudo lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104GL [10de:1bb1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GP104GL [Quadro P4000] [10de:11a3]
Physical Slot: 4
Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 0
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
CUDA versions for pgi compiler (/opt/pgi/linux86-64/2019/cuda) are: 9.2, 10.0, 10.1
As you note, "-ta" stands for "target accelerator" and is a way for you to override the default target device when using "-acc" ("-acc" tells the compiler to use OpenACC and using just "-ta" implies "-acc"). PGI currently supports two targets, "multicore" to target a mult-core CPU, or "tesla" to target an NVIDIA Tesla device. Other NVIDIA products such as Quadro and GeForce will also work under the "tesla" flag provided they share the same architecture as a Tesla product.
By default when using "-ta=tesla", the PGI compiler will create a unified binary supporting multiple NVIDIA architectures. The exact set of architectures will depend on the compiler version and the CUDA device driver on the build system. For example with PGI 19.4 on a system with a CUDA 9.2 driver, the compiler will target Kepler (cc35), Maxwell (cc50), Pascal (cc60), and Volta (cc70) architectures. "cc" stands for the compute capability. Note if no CUDA driver can be found on the system, then the 19.4 compiler default to use CUDA 10.0.
In your case, a Quadro P4000 uses the Pascal architecture (cc60) so would be targeted by default. If you wanted to have the compiler only target your device, as opposed to creating a unified binary, you'd used the option "-ta=tesla:cc60"
You can also override which Cuda version to use as a sub-option. For example "-ta=tesla:cuda10.1". For a complete list of sub-options please run "pgcc -help -ta" from the command line or consult PGI's documentation.
If you don't know the compute capability of the device, run the PGI utility "pgaccelinfo" which will give you this information. For example, here's the output for my system which has a V100:
% pgaccelinfo
CUDA Driver Version: 10010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.67 Sat Apr 6 03:07:24 CDT 2019
Device Number: 0
Device Name: Tesla V100-PCIE-16GB
Device Revision Number: 7.0
Global Memory Size: 16914055168
Number of Multiprocessors: 80
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1380 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 877 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 6291456 bytes
Max Threads Per SMP: 2048
Async Engines: 7
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc70
Hope this helps!

Tensorflow Object Detection Training Performance

I am training the ssd_mobilenet_v1_coco network on my own custom classes using the object detection API for tensorflow.
I have used the CPU (i7-6700) and GPU (NVIDIA Quadro K620) to train:
Processor Batch size sec/step sec/image
K620 1 0,45 0,450
K620 10 2,22 0,222
i7-6700 1 0,66 0,660
i7-6700 24 9,3 0,388
However, the GPU is only about 70% faster than the CPU.
I expected the GPU to be significantly faster.
Is this performance adequate for my hardware or is there something wrong?
maybe you can try Tensorflow SERVER

AVX-512 extensions supported on new Skylake-X (Core i9, 79xxX/XE) CPUs

AVX-512 standard consists of many extensions, and only one (AVX-512F) is mandatory. What exactly is supported by new Skylake-X (Core i9, 79xxX/XE) CPUs? Wikipedia page about AVX has details about Skylake Xeon CPUs (E5-26xx V5), but not about i9. Google also was not very helpful. I also tried to search for some dump of /proc/cpuinfo for this CPU, but without luck.
AVX512 has its own wikipedia page, which says Skylake X supports AVX-512 F, CD, BW, DQ, and VL.
The Aida64 dump found here agrees.

Which[neon/vfp/vfp3] should I specify for the mfpu when evaluate and compare float performance in ARM processor?

I want to evaluate some different ARM Processor float performance. I use the lmbench and pi_css5, I confuse in the float test.
From cat /proc/cpuinfo(below), I guess there're 3 types of float features: neon,vfp,vfpv3? From this question&answer, it seems it's depend to the compiler.
Still I don't know which I should to specify in compille flag(-mfpu=neon/vfp/vfpv3), or I should compile the program with each of that, or just do not specify the -mfpu?
cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 532.00
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 4
It might be even a little bit more complicated then you anticipated. GCC arm options page doesn't explain fpu versions, however ARM's manual for their compiler does. You should also notice that Linux doesn't provide whole story about fpu features, only telling about vfp, vfpv3, vfpv3d16, or vfpv4.
Back to your question, you should select the greatest common factor among them, compile your code towards it and compare the results. On the other hand if a cpu has vfpv4 and other has vfpv3 which one would you think is better?
If your question is as simple as selecting between neon, vfp or vfpv3. Select neon (source).
-mfpu=neon selects VFPv3 with NEON coprocessor extensions.
From the gcc manual,
If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=neon), note that floating-point operations will
not be used by GCC's auto-vectorization pass unless
`-funsafe-math-optimizations' is also specified. This is because
NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are
treated as zero), so the use of NEON instructions may lead to a
loss of precision.
See for instance, Subnormal IEEE-754 floating point numbers support on ios... for more on this topic.
I have tried each one of them, and it seems using the -mfpu=neon and to specify the -march=armv7-a and -mfloat-abi=softfp is the proper configuration.
Besides, a referrence(ARM Cortex-A8 vs. Intel Atom) is of great useful for ARM BenchMark.
Another helpful article is about ARM Cortex-A Processors and gcc command lines, this clears the SIMD coprocessor configuration.

Unsupported Atomic Operation in OpenCL

I am Using Atomic operations for OpenCL. same code is working for intel CPU but is giving error on Nvidia GPU. I have enabled Atomics for 32 bit and 64 bit both.
int cidx=idx%10;
int i=1;
C[idx]=In1[idx] & In2[idx];
atomic_add(R,i);
This is just portion of overall code. its giving build error "Unsupported Operation" While running on Nvidia Quadro GPU rather it's working all fine on Intel i3, Xeon, and AMD Processors.
atomic_add did not appear in OpenCL 1.0, it was added in a later revision of the spec. You might be running on two different implementations which conform to different OpenCL versions.

Resources