Set spark.local.dir to different drive - windows

I'm trying to setup standalone Spark on Windows 10. I would like to set spark.local.dir to D:\spark-tmp\tmp, as currently it appears to be using C:\Users\<me>\AppData\Local\Temp, which in my case is on an SSD drive which might not have enough space given the size of some datasets.
So I changed the file %SPARK_HOME%\conf\spark-defaults.conf to the following, without success
spark.eventLog.enabled true
spark.eventLog.dir file:/D:/spark-tmp/log
spark.local.dir file:/D:/spark-tmp/tmp
I also tried to run %HADOOP_HOME\bin\winutils.exe chmod -R 777 D:/spark-tmp, but it didn't change anything.
The error that I get is the following:
java.io.IOException: Failed to create a temp directory (under file:/D:/spark-tmp/tmp) after 10 attempts!
If I start the path with file://D:/... (note the double slash) nothing changes. If I remove the scheme at all, a different exception says that the scheme D: is not recognized.
I also noticed this warning:
WARN SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
So I tried to put the following line in %SPARK_HOME%\conf\spark-env.sh:
SPARK_LOCAL_DIRS=file:/D:/spark-tmp/tmp
If I put this line and comment the spark.local.dir line in the .conf file, Spark works perfectly, but the temporary files are still saved in my AppData\Local\Temp folder. So the SPARK_LOCAL_DIRS line is not read.
What's strange is that, if I let it run, it actually puts logs in D:/spark-tmp/log, which means that it's not a problem of syntax or permissions.

On windows you will have to make those environment variables
Add the key value pair
SPARK_LOCAL_DIRS -> d:\spark-tmp\tmp
to your systems environment variables

Related

Using gitpython to get current hash does not work when using qsub for job submission on a cluster

I use python to do my data analysis and lately I came up with the idea to save the current git hash in a log file so I can later check which code version created my results (in case I find inconsistencies or whatever).
It works fine as long as I do it locally.
import git
import os
rep = git.Repo(os.getcwd(), search_parent_directories=True)
git_hash = rep.head.object.hexsha
with open ('logfile.txt', 'w+') as writer:
writer.write('Code version: {}'.format(git_hash))
However, I have a lot of heavy calculations that I run on a cluster to speed things up (run analyses of subjects parallel), using qsub, which looks more or less like this:
qsub -l nodes=1:ppn=12 analysis.py -q shared
This always results in a git.exc.InvalidGitRepositoryError.
EDIT
Printing os.getcwd() showed me, that on the cluster the current working dir is always my $HOME directory no matter from where I submit the job.
My next solution was to get the directory where the file is located using some of the solutions suggested here.
However, these solutions result in the same error because (that's how I understand it) my file is somehow copied to a directory deep in the root structure of the cluster's headnode (/var/spool/torque/mom_priv/jobs).
I could of course write down the location of my file as a hardcoded variable, but I would like a general solution for all my scripts.
So after I explained my problem to IT in detail, they could help me solve the problem.
Apparently the $PBS_O_WORKDIR variable stores the directory from which the job was committed.
So I adjusted my access to the githash as follows:
try:
script_file_directory = os.environ["PBS_O_WORKDIR"]
except KeyError:
script_file_directory = os.getcwd()
try:
rep = git.Repo(script_file_directory, search_parent_directories=True)
git_hash = rep.head.object.hexsha
except git.InvalidGitRepositoryError:
git_hash = 'not-found'
# create a log file, that saves some information about the run script
with open('logfile.txt'), 'w+') as writer:
writer.write('Codeversion: {} \n'.format(git_hash))
I first check if the PBS_O_WORKDIR variable exists (hence if I run the script as a job on the cluster). If it does get the githash from this directory if it doesn't use the current working directory.
Very specific, but maybe one day someone has the same problem...

Atom editor & Processing - 'processing-java' not recognised

I am trying to use bleikamp's Processing package to run Processing sketches from the Atom editor. The package has installed correctly, but running a sketch produces the below error:
'processing-java' is not recognized as an internal or external command, operable program or batch file.
I have added the path to my Processing directory to the PATH environment variable. Can anyone suggest why this is not working?
The problem is almost certainly to do with your PATH. As you've checked the Processing path is correct, there may be something preventing it from being recognised (there are some known issues, such as the PATH variable being too long or having invalid characters).
Solution 1:
Try using FixPath to fix some of the more common problems.
Solution 2:
Try modifying the processing package itself to point directly to processing-java.exe (it points to the PATH variable by default).
In Atom's Settings > Packages, select the processing package and click on 'View Code'.
Make a backup of \lib\processing.coffee to be safe.
In \lib\processing.coffee, search for the following code (probably near the top):
module.exports = Processing =
config:
'processing-executable':
type:"string",
default:"processing-java"
Modify the value of default to point to the exact Processing directory and processing-java.exe, for example:
module.exports = Processing =
config:
'processing-executable':
type:"string",
default:"c:\\program files\\processing\\processing-java.exe"
As Chris rightly points out in the comments below, backslash \ is an escape character in JavaScript and CoffeeScript, so itself needs to be escaped in the file path (hence the double-backslashes \\).
You have to install processing-java command line tool.
In processing, go to Tools -> Install "processing-java"

Reduce file path when calling a file from terminal

I'm using Lua in interactive mode on a Mac (thanks to rudix.org).
When I want to load a file I do:
dofile("/my/long/path/to/my/directory/file.lua")
I want to do a different thing, that is:
put all my files in a desktop directory myDirectory;
then call the file from the terminal this way dofile("file.lua");
Is this possible? How?
If the path is fixed, you can just redefine dofile:
local _dofile=dofile
local path=("/my/long/path/to/my/directory/")
function dofile(x)
return _dofile(path..x)
end
You may put this (and other initializations) in a file and set the environment variable LUA_INIT to its location. After this, every invocation of lua will see the version of dofile redefined above and the users will be able to say simply dofile("foo.lua").
Alternatively, you can use require, which looks for modules in a list of paths in package.path or LUA_PATH.

No space left on device - write chef remote_file

I get a strange error when a chef-client tries to execute remote_resource for a big local file.
From stack trace I guess ruby copy files itself. My disk has a lot of free space. Also var and tmp folders has at leas 2 Gbytes. If I do this job myself with cp command or I replace remote_file resource with execute one it's okay.
Chef complains about lack of disk space.
This resource fails for a file of 4G size with message No space on device.
remote_file "/tmp/copy.img" do
source "file://tmp/origin.img"
action :create
end
I made workaround with bash resource and it works.
execute "disk-image" do
command "cp /tmp/origin.img /tmp/copy.img"
not_if "cmp /tmp/origin.img /tmp/copy.img"
end
It's not going to work. remote_file downloads the remote file to somewhere within /var/chef IIRC, then copies to its destination.
Since /var has only 2Gb of space and the file is 4Gb big, it correctly throws the No space left on device error.
Thank you #lamont for the explanation. To cut to the chase a bit, the only solution that worked for me was to add the following to my Chef recipe, prior to any calls to remote_file:
ENV['TMP'] = '/mytmp'
ENV['TMPDIR'] = '/mytmp'
where /mytmp is a directory on a volume with enough space to hold my file.
The promising feature of adding:
file_staging_uses_destdir true
to /etc/chef/client.rb currently does not work, due to this bug: https://tickets.opscode.com/browse/CHEF-5311.
9/20/2016: Chef 12.0 shipped with file_stating_uses_destdir being defaulted to true so this should no longer be an issue (the remote_file but where it streams to /tmp may still exist).
First the real simple statement: If you've got a 4GB file in /tmp and you only have 2GB left in /tmp, then obviously copying the 4GB will fail, and nothing can help you. I'm assuming you have at least 4GB in /tmp and only 2GB left in /var which is the only interesting case to address.
As of 11.6.0 (to 11.10.2 at least) chef-client will create a tempfile using ruby's Tempfile.new and will copy the contents to that temp file and then will mv it into place. The tempfile location will be determined by ENV['TMPDIR'] and that differs based on your O/S distro (e.g. on a Mac that will be something like /var/folders/rs/wqwmj_1n59135dhhbvfdrlqh0001yj/T/ and not just /tmp or even /var/tmp), so it may not be obvious where the intermediate tempfile is created. You may be running into that problem. You should be able to see from the chef-client -l debug output what tempfile location chef is using and if you df -k that directory you might see that it is 100%.
Also, look at df -i to see if you've run out of inodes somehow which will also throw a no space left on device error.
You can set chef-client globally to use the destination directory as the tmpdir for creating files via adding this to client.rb:
file_staging_uses_destdir true
Then if your destination dir is '/tmp' the tempfile will get created there and then will simply get renamed in the directory in order to deploy it. That ensures that if there's enough space on the target device to hold the result, then the resource should always succeed to write the tempfile. It also avoids the problem if /tmp and the destdir are on different filesystems that the mv to rename and deploy the file will get translated into a copy-and-unlink-src operation which can fail in several different ways.
The answer by #cassianoleal is not correct in stating that chef-client always uses /var/cache as a temp location. Changing file_cache_path will also not have an effect. That is confusing a common pattern of downloading remote_files into the Chef file_cache_path directory for how remote_file works internally -- those are not the same thing. There is no file_cache_path in the question, so there should not be any file_cache_path in the answer.
The behavior of remote_file with file:// URLs is a bit round-a-bout, but that is because they're necessary for all other URLs (as #cassianoleal correctly mentioned). The behavior with file_staging_uses_destdir is probably correct, however, since you do want to take into account edge conditions where you run out of room and truncate the file or the server crashes in the middle of a copy operation and you don't want a half-populated file left over. By writing to a tempfile and closing it and then renaming a lot of those edge conditions are avoided.

Creating empty directories / folders in InstallAnywhere 2011

I have a script which collected together a number of files to be installed. This includes a number of empty directories.
Previously I would use the D flag in the manifest file which would copy empty directories. However due to the way I generate the manifest files (as part of our build process) I can sometimes end up with two D entries with the same destination folder. e.g:
D;${A_LIB}/all/pysys/${PYSYS_VERSION}/lib/python2.7/site-packages;./third_party/python/lib/python2.7/site-packages;COMMON;${UNIX}
D;${A_LIB_BT}/python/${PYTHON_VERSION};./third_party/python;COMMON;${ALL}
This causes InstallAnywhere to fail to build the installer.
To get around this I rewrote the manifest generation code to parse the directories previously pointed to by a D and replace the D entry with F entries for each file in the directory.
Unfortunately this will not include empty directories (which we may / may not need in the installer but in general it's just safer to create them than have some piece of code fail because they're not there).
I've tried the following in the manifest. Reference, Reference3 and Reference4 are empty, Reference2 contains a single directory (which is itself empty). Only Reference2 is present in the install - the other three which are empty directories seem to get excluded.
D,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference,./samples/pysys/cor_002/Reference
D,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference2,./samples/pysys/cor_002/Reference2
D,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference3/.,./samples/pysys/cor_002/Reference3/.
D,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference4/../Reference4,./samples/pysys/cor_002/Reference4/../Reference4
I've also tried increasing the log level but this has not revealed anything. Is there a way to increase this log level?
export LAX_DEBUG=true
Any suggestions?
DISCLAIMER: I've cross posted this to InstallAnywhere's forums but I will do my best to keep the answers in sync and spread the knowledge.
I can't speak to your manifest challenges. However, my first thought is to change the manifest generator to be sensitive to duplicate output locations -- maybe by storing them in a Map or Set -- and then handling collisions when they occur by failing the build or adjusting the output location(s).
On the other hand, I can tell you how to increase the verbosity of your installer.
Make the installer more verbose by adding:
-Dlax.debug.all=true -Dlax.debug.level=3
to Project > JVM Settings > Installer Settings (tab) > Optional Installer Arguments > Additional Arguments. You'll want to remove these before you ship. You can also add these to the command line when you start the installer. Level values of 4 and 5 work, too, and are even more verbose.
You can also make your installer print its progress to the console by going to Project > JVM Settings > Log Settings. Here, uncheck Include debug output (stderr and stdout). Then enter the word console in Send stderr to: and Send stdout to:. Rather than console, you can also set a specific file name. You'll also want to undo these settings before you ship.
The solution turns out to be so blindingly simple that I never tried it.
To get EMPTY directories installed by Install Anywhere you have to specify the directories as files in the manifest. So with the following directory structure:
Reference <empty>
Reference2
testdir <empty>
Reference3 <empty>
Reference4 <empty>
You need to specify the entries in the manifest as F. Specifying then as D will only result in the "Reference2" directory being included.
F,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference,./samples/pysys/cor_002/Reference
F,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference2,./samples/pysys/cor_002/Reference2
F,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference3/.,./samples/pysys/cor_002/Reference3/.
F,$IA_PROJECT_DIR$/samples/pysys/cor_002/Reference4/../Reference4,./samples/pysys/cor_002/Reference4/../Reference4
Sorry to answer my own question, really wasn't the plan!

Resources