TensorFlow: different results on different GPUs - gpgpu

System information
What is the top-level directory of the model you are using:
Using unmodified pretrained coco models: faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017, faster_rcnn_resnet101_coco_11_06_2017, rfcn_resnet101_coco_11_06_2017
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
UPDATE: tested on two machines now, both reproduce it:
Machine 1: Linux Ubuntu 14.04.4 LTS
Machine 2: Linux Ubuntu 16.04.2 LTS
TensorFlow installed from (source or binary):
official docker container, with last commit 58fb6d7e257f28cd7934316d6ae7a81ec42a533a
docker version from 2017-08-24T02:37:57.51182742Z
cuda_memtest.log.txt
TensorFlow version (use command below):
('v1.2.0-5-g435cdfc', '1.2.1')
Bazel version (if compiling from source):
N/A
CUDA/cuDNN version:
From official docker: CUDA 8., cuDNN 5.1.10
GPU model and memory:
Machine 1: Three nVIDIA GeForce GTX 1080, 12 GB
Machine 2: Two nVIDIA GeForce GTX 1080, 12 GB
Exact command to reproduce:
Running object_detection_tutorial.ipynb with different GPUs, either with export CUDA_VISIBLE_DEVICES=, or by setting it in the session config. Version that runs through 3 GPUs several times and compares output is included.
Describe the problem
Running on different GPUs yields different results, and GPUs 1 and 2 are not deterministic. This is using frozen pretrained networks from this repository's linked model zoo and the supplied object_detection_tutorial.ipynb with no modifications other than setting the cuda visible_device_list. The SSD frozen models, however, give identical outputs on the 3 GPUs from what I have seen.
I have also run cuda_memtest on all 3 GPUs, logs attached
UPDATE: I just tested on a second machine with 2 GPUs, and reproduced the issue. GPU 0 is deterministic, GPU 1 is not (and often produces bad results).
Source code / logs
I've attached the diff of the modified object_detection_tutorial.ipynb which loops over 3 GPUs 3 times and prints out the top box scores, which change depending on the run. Also attached is a PDF of that ipynb with detections drawn on it. Text output:
> Evaluating image 0
>
> Running on GPU 0
> Top 4 box scores:
> Iter 1: [ 0.99978215 0.99857557 0.95300484 0.91580492]
> Iter 2: [ 0.99978215 0.99857557 0.95300484 0.91580492]
> Iter 3: [ 0.99978215 0.99857557 0.95300484 0.91580492]
>
> Running on GPU 1
> Top 4 box scores:
> Iter 1: [ 0.68702352 0.16781448 0.13143283 0.12993629]
> Iter 2: [ 0.18502565 0.16854601 0.08074528 0.07859289]
> Iter 3: [ 0.18502565 0.16854601 0.05546702 0.05111229]
>
> Running on GPU 2
> Top 4 box scores:
> Iter 1: [ 0.68702352 0.16781448 0.13143283 0.12993629]
> Iter 2: [ 0.18941374 0.18502565 0.16854601 0.16230994]
> Iter 3: [ 0.18502565 0.16854601 0.05546702 0.05482833]
>
>
> Evaluating image 1
>
> Running on GPU 0
> Top 4 box scores:
> Iter 1: [ 0.99755412 0.99750346 0.99380219 0.99067008]
> Iter 2: [ 0.99755412 0.99750346 0.99380219 0.99067008]
> Iter 3: [ 0.99755412 0.99750346 0.99380219 0.99067008]
>
> Running on GPU 1
> Top 4 box scores:
> Iter 1: [ 0.96881998 0.96441168 0.96164131 0.96006596]
> Iter 2: [ 0.9377929 0.91686022 0.80374646 0.79758978]
> Iter 3: [ 0.90396696 0.89217037 0.85456908 0.85334581]
>
> Running on GPU 2
> Top 4 box scores:
> Iter 1: [ 0.9377929 0.91686022 0.80374646 0.79758978]
> Iter 2: [ 0.9377929 0.91686022 0.80374646 0.79758978]
> Iter 3: [ 0.9377929 0.91686022 0.80374646 0.79758978]
object_detection_tutorial.diff.txt
gpu_output_differences.pdf
Updated with longer run:
cuda_memtest.log.txt

Related

How to eliminate JIT overhead in a Julia executable (with MWE)

I'm using PackageCompiler hoping to create an executable that eliminates just-in-time compilation overhead.
The documentation explains that I must define a function julia_main to call my program's logic, and write a "snoop file", a script that calls functions I wish to precompile. My julia_main takes a single argument, the location of a file containing the input data to be analysed. So to keep things simple my snoop file simply makes one call to julia_main with a particular input file. So I'd hope to see the generated executable run nice and fast (no compilation overhead) when executed against that same input file.
But alas, that's not what I see. In a fresh Julia instance julia_main takes approx 74 seconds for the first execution and about 4.5 seconds for subsequent executions. The executable file takes approx 50 seconds each time it's called.
My use of the build_executable function looks like this:
julia> using PackageCompiler
julia> build_executable("d:/philip/source/script/julia/jsource/SCRiPTMain.jl",
"testexecutable",
builddir = "d:/temp/builddir4",
snoopfile = "d:/philip/source/script/julia/jsource/snoop.jl",
compile = "all",
verbose = true)
Questions:
Are the above arguments correct to achieve my aim of an executable with no JIT overhead?
Any other advice for me?
Here's what happens in response to that call to build_executable. The lines from Start of snoop file execution! to End of snoop file execution! are emitted by my code.
Julia program file:
"d:\philip\source\script\julia\jsource\SCRiPTMain.jl"
C program file:
"C:\Users\Philip\.julia\packages\PackageCompiler\CJQcs\examples\program.c"
Build directory:
"d:\temp\builddir4"
Executing snoopfile: "d:\philip\source\script\julia\jsource\snoop.jl"
Start of snoop file execution!
┌ Warning: The 'control file' contains the key 'InterpolateCovariance' with value 'true' but that is not supported. Pass a value of 'false' or omit the key altogether.
└ # ValidateInputs d:\Philip\Source\script\Julia\JSource\ValidateInputs.jl:685
Time to build model 20.058000087738037
Saving c:/temp/SCRiPT/SCRiPTModel.jls
Results written to c:/temp/SCRiPT/SCRiPTResultsJulia.json
Time to write file: 3620 milliseconds
Time in method runscript: 76899 milliseconds
End of snoop file execution!
[ Info: used 1313 out of 1320 precompile statements
Build static library "testexecutable.a":
atexit_hook_copy = copy(Base.atexit_hooks) # make backup
# clean state so that any package we use can carelessly call atexit
empty!(Base.atexit_hooks)
Base.__init__()
Sys.__init__() #fix https://github.com/JuliaLang/julia/issues/30479
using REPL
Base.REPL_MODULE_REF[] = REPL
Mod = #eval module $(gensym("anon_module")) end
# Include into anonymous module to not polute namespace
Mod.include("d:\\\\temp\\\\builddir4\\\\julia_main.jl")
Base._atexit() # run all exit hooks we registered during precompile
empty!(Base.atexit_hooks) # don't serialize the exit hooks we run + added
# atexit_hook_copy should be empty, but who knows what base will do in the future
append!(Base.atexit_hooks, atexit_hook_copy)
Build shared library "testexecutable.dll":
`'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root\mingw\bin\gcc.exe' --sysroot 'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root' -shared '-DJULIAC_PROGRAM_LIBNAME="testexecutable.dll"' -o testexecutable.dll -Wl,--whole-archive testexecutable.a -Wl,--no-whole-archive -std=gnu99 '-IC:\Users\philip\AppData\Local\Julia-1.2.0\include\julia' -DJULIA_ENABLE_THREADING=1 '-LC:\Users\philip\AppData\Local\Julia-1.2.0\bin' -Wl,--stack,8388608 -ljulia -lopenlibm -m64 -Wl,--export-all-symbols`
Build executable "testexecutable.exe":
`'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root\mingw\bin\gcc.exe' --sysroot 'C:\Users\Philip\.julia\packages\WinRPM\Y9QdZ\deps\usr\x86_64-w64-mingw32\sys-root' '-DJULIAC_PROGRAM_LIBNAME="testexecutable.dll"' -o testexecutable.exe 'C:\Users\Philip\.julia\packages\PackageCompiler\CJQcs\examples\program.c' testexecutable.dll -std=gnu99 '-IC:\Users\philip\AppData\Local\Julia-1.2.0\include\julia' -DJULIA_ENABLE_THREADING=1 '-LC:\Users\philip\AppData\Local\Julia-1.2.0\bin' -Wl,--stack,8388608 -ljulia -lopenlibm -m64`
Copy Julia libraries to build directory:
7z.dll
BugpointPasses.dll
libamd.2.4.6.dll
libamd.2.dll
libamd.dll
libatomic-1.dll
libbtf.1.2.6.dll
libbtf.1.dll
libbtf.dll
libcamd.2.4.6.dll
libcamd.2.dll
libcamd.dll
libccalltest.dll
libccolamd.2.9.6.dll
libccolamd.2.dll
libccolamd.dll
libcholmod.3.0.13.dll
libcholmod.3.dll
libcholmod.dll
libclang.dll
libcolamd.2.9.6.dll
libcolamd.2.dll
libcolamd.dll
libdSFMT.dll
libexpat-1.dll
libgcc_s_seh-1.dll
libgfortran-4.dll
libgit2.dll
libgmp.dll
libjulia.dll
libklu.1.3.8.dll
libklu.1.dll
libklu.dll
libldl.2.2.6.dll
libldl.2.dll
libldl.dll
libllvmcalltest.dll
libmbedcrypto.dll
libmbedtls.dll
libmbedx509.dll
libmpfr.dll
libopenblas64_.dll
libopenlibm.dll
libpcre2-8-0.dll
libpcre2-8.dll
libpcre2-posix-2.dll
libquadmath-0.dll
librbio.2.2.6.dll
librbio.2.dll
librbio.dll
libspqr.2.0.9.dll
libspqr.2.dll
libspqr.dll
libssh2.dll
libssp-0.dll
libstdc++-6.dll
libsuitesparseconfig.5.4.0.dll
libsuitesparseconfig.5.dll
libsuitesparseconfig.dll
libsuitesparse_wrapper.dll
libumfpack.5.7.8.dll
libumfpack.5.dll
libumfpack.dll
libuv-2.dll
libwinpthread-1.dll
LLVM.dll
LLVMHello.dll
zlib1.dll
All done
julia>
EDIT
I was afraid that creating a minimal working example would be hard, but it was straightforward:
TestBuildExecutable.jl contains:
module TestBuildExecutable
Base.#ccallable function julia_main(ARGS::Vector{String}=[""])::Cint
#show sum(myarray())
return 0
end
#Function which takes approx 8 seconds to compile. Returns a 500 x 20 array of 1s
function myarray()
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;
# PLEASE EDIT TO INSERT THE MISSING 496 LINES, EACH IDENTICAL TO THE LINE ABOVE!
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1;
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
end
end #module
SnoopFile.jl contains:
module SnoopFile
currentpath = dirname(#__FILE__)
push!(LOAD_PATH, currentpath)
unique!(LOAD_PATH)
using TestBuildExecutable
println("Start of snoop file execution!")
TestBuildExecutable.julia_main()
println("End of snoop file execution!")
end # module
In a fresh Julia instance, julia_main takes 8.3 seconds for the first execution and half a millisecond for the second execution:
julia> #time TestBuildExecutable.julia_main()
sum(myarray()) = 10000
8.355108 seconds (425.36 k allocations: 25.831 MiB, 0.06% gc time)
0
julia> #time TestBuildExecutable.julia_main()
sum(myarray()) = 10000
0.000537 seconds (25 allocations: 82.906 KiB)
0
So next I call build_executable:
julia> using PackageCompiler
julia> build_executable("d:/philip/source/script/julia/jsource/TestBuildExecutable.jl",
"testexecutable",
builddir = "d:/temp/builddir15",
snoopfile = "d:/philip/source/script/julia/jsource/SnoopFile.jl",
verbose = false)
Julia program file:
"d:\philip\source\script\julia\jsource\TestBuildExecutable.jl"
C program file:
"C:\Users\Philip\.julia\packages\PackageCompiler\CJQcs\examples\program.c"
Build directory:
"d:\temp\builddir15"
Start of snoop file execution!
sum(myarray()) = 10000
End of snoop file execution!
[ Info: used 79 out of 79 precompile statements
All done
Finally, in a Windows Command Prompt:
D:\temp\builddir15>testexecutable
sum(myarray()) = 1000
D:\temp\builddir15>
which took (by my stopwatch) 8 seconds to run, and it takes 8 seconds to run every time it's executed, not just the first time. This is consistent with the executable doing a JIT compile every time it's run, but the snoop file is designed to avoid that!
Version information:
julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-6700 CPU # 3.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 8
JULIA_EDITOR = "C:\Users\Philip\AppData\Local\Programs\Microsoft VS Code\Code.exe"
Looks like you are using Windows.
At some point PackageCompiler.jl will be mature for Windows at which you can try it.
The solution was indeed to wait for progress on PackageCompilerX, as suggested by #xiaodai.
On 10 Feb 2020 what was formerly PackageCompilerX became a new (version 1.0 of) PackageCompiler, with a significantly changed API, and more thorough documentation.
In particular, the MWE above (mutated for the new API to PackageCompiler) now works correctly without any JIT overhead.

GRASS 7.4: r.cross unexpectedly produces category zero

I have a problem with the output of r.cross. I hope you can follow my description without MWE:
I have 3 rasters I want to cross with the following characteristics:
GRASS 7.4.0 (Bengue):~ > r.stats soil_t,lcov,watermask -N
100%
4 8 0
4 8 1
4 9 0
[...]
I would expect r.cross to create a raster with a category for each line shown above. However, I get the following:
GRASS 7.4.0 (Bengue):~ > r.cross input=soil_t,lcov,watermask output=svc
GRASS 7.4.0 (Bengue):~ > r.category svc
0
1 category 4; category 8; category 1
2 category 4; category 9; category 0
[...]
Why is the first line just zero when one would rather expect something like: 1 category 4; category 8; category 0?
EDIT: Just noticed that under GRASS version 6.4 it runs as expected:
GRASS 6.4.6 (Bengue):~ > r.category svc
0
1 category 4; category 8; category 0
2 category 4; category 8; category 1
3 category 4; category 9; category 0
So, something must be wrong with the 7.4 version of r.cross?!
Thanks for your help!
System infos:
GRASS version 7.4.0
Ubuntu MATE 16.04 (xenial)
just in case somebody comes across this post: It was also asked in the mailing list shortly after this post by somebody else: https://lists.osgeo.org/pipermail/grass-user/2018-February/077934.html. As it seems, it is a bug and not yet fixed in the latest release version of GRASS.

Julia parallel computing in IPython Jupyter

I'm preparing a small presentation in Ipython where I want to show how easy it is to do parallel operation in Julia.
It's basically a Monte Carlo Pi calculation described here
The problem is that I can't make it work in parallel inside an IPython (Jupyter) Notebook, it only uses one.
I started Julia as: julia -p 4
If I define the functions inside the REPL and run it there it works ok.
#everywhere function compute_pi(N::Int)
"""
Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
Returns estimate of pi
"""
n_landed_in_circle = 0
for i = 1:N
x = rand() * 2 - 1 # uniformly distributed number on x-axis
y = rand() * 2 - 1 # uniformly distributed number on y-axis
r2 = x*x + y*y # radius squared, in radial coordinates
if r2 < 1.0
n_landed_in_circle += 1
end
end
return n_landed_in_circle / N * 4.0
end
 
function parallel_pi_computation(N::Int; ncores::Int=4)
"""
Compute pi in parallel, over ncores cores, with a Monte Carlo simulation throwing N total darts
"""
# compute sum of pi's estimated among all cores in parallel
sum_of_pis = #parallel (+) for i=1:ncores
compute_pi(int(N/ncores))
end
return sum_of_pis / ncores # average value
end
 
julia> #time parallel_pi_computation(int(1e9))
elapsed time: 2.702617652 seconds (93400 bytes allocated)
3.1416044160000003
But when I do:
using IJulia
notebook()
And try to do the same thing inside the Notebook it only uses 1 core:
In [5]: #time parallel_pi_computation(int(10e8))
elapsed time: 10.277870808 seconds (219188 bytes allocated)
Out[5]: 3.141679988
So, why isnt Jupyter using all the cores? What can I do to make it work?
Thanks.
Using addprocs(4) as the first command in your notebook should provide four workers for doing parallel operations from within your notebook.
One way to solve this is to create a kernel that always uses 4 cores. For that some manual work is required. I assume that you are on a unix machine.
In the folder ~/.ipython/kernels/julia-0.x, you will find following kernel.json file:
{
"display_name": "Julia 0.3.9",
"argv": [
"/usr/local/Cellar/julia/0.3.9_1/bin/julia",
"-i",
"-F",
"/Users/ch/.julia/v0.3/IJulia/src/kernel.jl",
"{connection_file}"
],
"language": "julia"
}
If you copy the whole folder cp -r julia-0.x julia-0.x-p4, and modify the newly copied kernel.json file:
{
"display_name": "Julia 0.3.9 p4",
"argv": [
"/usr/local/Cellar/julia/0.3.9_1/bin/julia",
"-p",
"4",
"-i",
"-F",
"/Users/ch/.julia/v0.3/IJulia/src/kernel.jl",
"{connection_file}"
],
"language": "julia"
}
The paths will probably be different for you. Note that I only gave the kernel a new name and added the command line argument `-p 4.
You should see a new kernel named Julia 0.3.9 p4 which should always use 4 cores.
Also note that this kernel file will not get updated when you update IJulia, so you have to update it manually whenever you update julia or IJulia.
You can add new kernels using this command:
using IJulia
#for 4 cores
installkernel("Julia_4_threads", env=Dict("JULIA_NUM_THREADS"=>"4"))
#or for 8 cores
installkernel("Julia_8_threads", env=Dict("JULIA_NUM_THREADS"=>"8"))
After restart your VSCode this options will apear you your select kernel option.

WINBUGS : adding time and product fixed effects in a hierarchical data

I am working on a Hierarchical panel data using WinBugs. Assuming a data on school performance - logs with independent variable logp & rank. All schools are divided into three categories (cat) and I need beta coefficient for each category (thus HLM). I am wanting to account for time-specific and school specific effects in the model. One way can be to have dummy variables in the list of variables under mu[i] but that would get messy because my number of schools run upto 60. I am sure there must be a better way to handle that.
My data looks like the following:
school time logs logp cat rank
1 1 4.2 8.9 1 1
1 2 4.2 8.1 1 2
1 3 3.5 9.2 1 1
2 1 4.1 7.5 1 2
2 2 4.5 6.5 1 2
3 1 5.1 6.6 2 4
3 2 6.2 6.8 3 7
#logs = log(score)
#logp = log(average hours of inputs)
#rank - rank of school
#cat = section red, section blue, section white in school (hierarchies)
My WinBUGS code is given below.
model {
# N observations
for (i in 1:n){
logs[i] ~ dnorm(mu[i], tau)
mu[i] <- bcons +bprice*(logp[i])
+ brank[cat[i]]*(rank[i])
}
}
}
# C categories
for (c in 1:C) {
brank[c] ~ dnorm(beta, taub)}
# priors
bcons ~ dnorm(0,1.0E-6)
bprice ~ dnorm(0,1.0E-6)
bad ~ dnorm(0,1.0E-6)
beta ~ dnorm(0,1.0E-6)
tau ~ dgamma(0.001,0.001)
taub ~dgamma(0.001,0.001)
}
As you can see in the data sample above, I have multiple observations for school over time. How can I modify the code to account for time and school specific fixed effects. I have used STATA in the past and we get fe,be,i.time options to take care of fixed effects in a panel data. But here I am lost.

Calculating average value of temp from lm-sensors using bash script

As the title says, I am trying to calculate the temperature of a cpu to use it in a conky. The acpi command is strangely not giving me a information about temperature of this laptop... So I am using a lm-sensor.
cho:~$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +54.0°C (high = +95.0°C, crit = +105.0°C)
Core 2: +57.0°C (high = +95.0°C, crit = +105.0°C)
First, I am not sure what Core 0 and Core 2 represent... I am thinking that they represent each core of my dual core cpu.
Will it be possible to have a one line code that can calculate the average of those temp and get
55.5°C
as an output?
Thanks in advance.
You can pipe your output with this awk:
awk '/^Core /{++r; gsub(/[^[:digit:]]+/, "", $3); s+=$3} END{print s/(10*r) "°C"}'
55.5°C

Resources