WindowsError when calling sc.parallelize() - windows

I want to use the sc.parallelize() function, but whenever I try to call it, I get the below error:
File "V:/PyCharmProjects/sample.py", line 9, in <module>
input_data = sc.parallelize(sc.textFile("C:\Users\Spider\Desktop\GM_coding\Sample Data.csv"))
File "V:\spark-2.2.0-bin-hadoop2.7\python\pyspark\context.py", line 497, in parallelize os.unlink(tempFile.name)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: u'C:\\Users\\Spider\\AppData\\Local\\Temp\\spark-fef6debd-ff91-4fb6-85dc-8c3a1da9690a\\pyspark-6ed523e7-358f-4e3c-ad83-a479fb8ecc52\\tmpxffhfi'

Not sure if it's relevant to your error (and cannot test it in Windows), but you are trying to parallelize something that is already an RDD (i.e. "parallelized"); from the docs:
textFile(name, minPartitions=None, use_unicode=True)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.
You don't need (and shouldn't use) sc.parallelize() here; the output of sc.textFile is already an RDD. You should simply go for
input_data = sc.textFile("C:\Users\Spider\Desktop\GM_coding\Sample Data.csv")
See also the example in the quick start guide.

Related

OSError: No space on device for weird reason

I am running some code that wasn't initially mine using GPT2-XL; it was initially implemented for GPT-Neo-1.3B and T5-Large. Running an experiment using GPT2-XL, I got the following error:
File "/data/username/directory", line 631, in to_tsr_gpt_entity_inference _bleu_score = BLEU.compute(predictions=definition, File "/data/username/miniconda3/lib/python3.9/site-packages/evaluate/module.py", line 433, in compute self._finalize() File "/data/username/miniconda3/lib/python3.9/site-packages/evaluate/module.py", line 372, in _finalize self.writer.finalize() File "/data/username/miniconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 589, in finalize self.stream.close() File "/data/username/miniconda3/lib/python3.9/site-packages/fsspec/implementations/local.py", line 358, in close return self.f.close() OSError: [Errno 28] No space left on device
The issue here is that I'm not quite sure what is running out of space; I have checked df -h and df -I for example, and I tried setting the TMPDIR to be different with export TMPDIR=/new_directory, but that didn't work, so I can only assume something else is causing this to run out of space. It appears to have something to do with BLEU from the evaluate module of hugging_face, however, the same code (including the BLEU bit) works when running GPT-Neo/T5, so I'm not quite sure what the issue is.
I'd greatly appreciate any help in answering this question.

Failing to read a named pipe being written to

I have a stream of data that I’m writing to a named pipe:
named_pipe = '/tmp/pipe' # Location of named pipe
File.mkfifo(named_pipe) # Create named pipe
File.open(named_pipe, 'w+') # Necessary to not get a broken pipe when ⌃C from another process later on
system('youtube-dl', '--newline', 'https://www.youtube.com/watch?v=aqz-KE-bpKQ', out: named_pipe) # Output download progress, one line at a time
Trouble is, while I can cat /tmp/pipe and get the information, I’m unable to read the file from another Ruby process. I’ve tried File.readlines, File.read with seeking, File.open then reading, and other stuff I no longer remember. Some of those hang, others error out.
How can I get the same result as with cat, in pure Ruby?
Note I don’t have to use system to send to the pipe (Open3 would be acceptable), but any solution requiring external dependencies is a no-go.
it looks like File.readlines/IO.readlines, File.read/IO.read need to load the whole temp file first so you don't see any be printed out.
try File#each/IO.foreach which process a file line by line and it does not require the whole file be loaded into memory
File.foreach("/tmp/pipe") { |line| p line }
# or
File.open('/tmp/pipe','r').each { |line| p line }

UDF python in Hive with File Json

I have two problem with Hive view in Ambari.
1. Problem 1:I have script :
<br>DELETE FILE /user/admin/hive/scripts/MKT_UDF/fb_audit_ads_creatives.py;
<br>ADD FILE /user/admin/hive/scripts/MKT_UDF/fb_audit_ads_creatives.py;
<br>SELECT TRANSFORM (line) USING 'python fb_audit_ads_creatives.py' as (ad_id) FROM stg_fb_audit_ads_creatives_json where date_time='2018-05-05';
I run it many times, It's run smoothly. But on some times, I get error:
You can see it: Here's Log error
I think it's due config (Hive, Ambari, ...time out...v.v.)
Problem 2: I have file json:
{
"body": "https://www.facebook.com/groupkiemhieptinhduyen #truongsinhquyet #sapramat #tsq #game3d",
"thumbnail_url": "https://external.xx.fbcdn.net/safe_image.php?d=AQDU01asRxdnCObW&w=64&h=64&url=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fv%2Ft15.0-10%",
"campaign_id": "23842841688740666"
}
I use script HQL above and UDF Python:
for line in sys.stdin:
data = json.loads(line)
print (data)
print(data['thumbnail_url']
I run it's okay.
But with UDF Python:
for line in sys.stdin:
data = json.loads(line)
print (data)
print(data['body']
I get error: Log error
Can you help me?
Instead of working with python, I recommend to try this UDTF that allows working on json columns within hive It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.

Downloading a zipfile in binary format using python's ftplib

I have gone through quite a few questions that have been posted which appear to be related, but not entirely the same issue that I am having:
I am using python's ftplib module along with zipfile to download a zip file from ftp in binary format. However, for some reason, the downloaded zip file appears to be in ascii.
I have ensured that a leading / does not exist in the path of the file I am downloading (to match the zip specifications).
outFile = zipfile.ZipFile(local_file_path, 'w')
myftp.retrbinary('RETR %s' %i, outFile.write(i)) #i - target file path on ftp server
This code fails giving me the following error:
st = os.stat(filename)
OSError: [Errno 2] No such file or directory: //$i
I tried adding the 'b' option for binary, but zipfile doesn't seem to like it:
outFile = zipfile.ZipFile(local_file_path, 'wb')
This raises error:
RuntimeError: ZipFile() requires mode "r", "w", or "a"
I am using python v2.6.
What am I doing wrong and how to fix it?
According to python doc (http://docs.python.org/2/library/ftplib.html) seems retrbinary takes a callback as second parameter:
>>> ftp.retrbinary('RETR README', open('README', 'wb').write)
'226 Transfer complete.'
>>> ftp.quit()
Documentation says:
FTP.retrbinary(command, callback[, maxblocksize[, rest]])
Retrieve a file in binary transfer mode. command should be an
appropriate RETR command: 'RETR filename'. The callback function is
called for each block of data received, with a single string argument
giving the data block. [...]
In your example it should be outfile.write (instead of outfile.write(i)).
>>> ftp.retrbinary('RETR %s' % i, outFile.write)

How to get R script line numbers at error?

If I am running a long R script from the command line (R --slave script.R), then how can I get it to give line numbers at errors?
I don't want to add debug commands to the script if at all possible; I just want R to behave like most other scripting languages.
This won't give you the line number, but it will tell you where the failure happens in the call stack which is very helpful:
traceback()
[Edit:] When running a script from the command line you will have to skip one or two calls, see traceback() for interactive and non-interactive R sessions
I'm not aware of another way to do this without the usual debugging suspects:
debug()
browser()
options(error=recover) [followed by options(error = NULL) to revert it]
You might want to look at this related post.
[Edit:] Sorry...just saw that you're running this from the command line. In that case I would suggest working with the options(error) functionality. Here's a simple example:
options(error = quote({dump.frames(to.file=TRUE); q()}))
You can create as elaborate a script as you want on an error condition, so you should just decide what information you need for debugging.
Otherwise, if there are specific areas you're concerned about (e.g. connecting to a database), then wrap them in a tryCatch() function.
Doing options(error=traceback) provides a little more information about the content of the lines leading up to the error. It causes a traceback to appear if there is an error, and for some errors it has the line number, prefixed by #. But it's hit or miss, many errors won't get line numbers.
Support for this will be forthcoming in R 2.10 and later. Duncan Murdoch just posted to r-devel on Sep 10 2009 about findLineNum and setBreapoint:
I've just added a couple of functions to R-devel to help with
debugging. findLineNum() finds which line of which function
corresponds to a particular line of source code; setBreakpoint() takes
the output of findLineNum, and calls trace() to set a breakpoint
there.
These rely on having source reference debug information in the code.
This is the default for code read by source(), but not for packages.
To get the source references in package code, set the environment
variable R_KEEP_PKG_SOURCE=yes, or within R, set
options(keep.source.pkgs=TRUE), then install the package from source
code. Read ?findLineNum for details on how to tell it to search
within packages, rather than limiting the search to the global
environment.
For example,
x <- " f <- function(a, b) {
if (a > b) {
a
} else {
b
}
}"
eval(parse(text=x)) # Normally you'd use source() to read a file...
findLineNum("<text>#3") # <text> is a dummy filename used by
parse(text=)
This will print
f step 2,3,2 in <environment: R_GlobalEnv>
and you can use
setBreakpoint("<text>#3")
to set a breakpoint there.
There are still some limitations (and probably bugs) in the code; I'll
be fixing thos
You do it by setting
options(show.error.locations = TRUE)
I just wonder why this setting is not a default in R? It should be, as it is in every other language.
Specifying the global R option for handling non-catastrophic errors worked for me, along with a customized workflow for retaining info about the error and examining this info after the failure. I am currently running R version 3.4.1.
Below, I've included a description of the workflow that worked for me, as well as some code I used to set the global error handling option in R.
As I have it configured, the error handling also creates an RData file containing all objects in working memory at the time of the error. This dump can be read back into R using load() and then the various environments as they existed at the time of the error can be inspected interactively using debugger(errorDump).
I will note that I was able to get line numbers in the traceback() output from any custom functions within the stack, but only if I used the keep.source=TRUE option when calling source() for any custom functions used in my script. Without this option, setting the global error handling option as below sent the full output of the traceback() to an error log named error.log, but line numbers were not available.
Here's the general steps I took in my workflow and how I was able to access the memory dump and error log after a non-interactive R failure.
I put the following at the top of the main script I was calling from the command line. This sets the global error handling option for the R session. My main script was called myMainScript.R. The various lines in the code have comments after them describing what they do. Basically, with this option, when R encounters an error that triggers stop(), it will create an RData (*.rda) dump file of working memory across all active environments in the directory ~/myUsername/directoryForDump and will also write an error log named error.log with some useful information to the same directory. You can modify this snippet to add other handling on error (e.g., add a timestamp to the dump file and error log filenames, etc.).
options(error = quote({
setwd('~/myUsername/directoryForDump'); # Set working directory where you want the dump to go, since dump.frames() doesn't seem to accept absolute file paths.
dump.frames("errorDump", to.file=TRUE, include.GlobalEnv=TRUE); # First dump to file; this dump is not accessible by the R session.
sink(file="error.log"); # Specify sink file to redirect all output.
dump.frames(); # Dump again to be able to retrieve error message and write to error log; this dump is accessible by the R session since not dumped to file.
cat(attr(last.dump,"error.message")); # Print error message to file, along with simplified stack trace.
cat('\nTraceback:');
cat('\n');
traceback(2); # Print full traceback of function calls with all parameters. The 2 passed to traceback omits the outermost two function calls.
sink();
q()}))
Make sure that from the main script and any subsequent function calls, anytime a function is sourced, the option keep.source=TRUE is used. That is, to source a function, you would use source('~/path/to/myFunction.R', keep.source=TRUE). This is required for the traceback() output to contain line numbers. It looks like you may also be able to set this option globally using options( keep.source=TRUE ), but I have not tested this to see if it works. If you don't need line numbers, you can omit this option.
From the terminal (outside R), call the main script in batch mode using Rscript myMainScript.R. This starts a new non-interactive R session and runs the script myMainScript.R. The code snippet given in step 1 that has been placed at the top of myMainScript.R sets the error handling option for the non-interactive R session.
Encounter an error somewhere within the execution of myMainScript.R. This may be in the main script itself, or nested several functions deep. When the error is encountered, handling will be performed as specified in step 1, and the R session will terminate.
An RData dump file named errorDump.rda and and error log named error.log are created in the directory specified by '~/myUsername/directoryForDump' in the global error handling option setting.
At your leisure, inspect error.log to review information about the error, including the error message itself and the full stack trace leading to the error. Here's an example of the log that's generated on error; note the numbers after the # character are the line numbers of the error at various points in the call stack:
Error in callNonExistFunc() : could not find function "callNonExistFunc"
Calls: test_multi_commodity_flow_cmd -> getExtendedConfigDF -> extendConfigDF
Traceback:
3: extendConfigDF(info_df, data_dir = user_dir, dlevel = dlevel) at test_multi_commodity_flow.R#304
2: getExtendedConfigDF(config_file_path, out_dir, dlevel) at test_multi_commodity_flow.R#352
1: test_multi_commodity_flow_cmd(config_file_path = config_file_path,
spot_file_path = spot_file_path, forward_file_path = forward_file_path,
data_dir = "../", user_dir = "Output", sim_type = "spot",
sim_scheme = "shape", sim_gran = "hourly", sim_adjust = "raw",
nsim = 5, start_date = "2017-07-01", end_date = "2017-12-31",
compute_averages = opt$compute_averages, compute_shapes = opt$compute_shapes,
overwrite = opt$overwrite, nmonths = opt$nmonths, forward_regime = opt$fregime,
ltfv_ratio = opt$ltfv_ratio, method = opt$method, dlevel = 0)
At your leisure, you may load errorDump.rda into an interactive R session using load('~/path/to/errorDump.rda'). Once loaded, call debugger(errorDump) to browse all R objects in memory in any of the active environments. See the R help on debugger() for more info.
This workflow is enormously helpful when running R in some type of production environment where you have non-interactive R sessions being initiated at the command line and you want information retained about unexpected errors. The ability to dump memory to a file you can use to inspect working memory at the time of the error, along with having the line numbers of the error in the call stack, facilitate speedy post-mortem debugging of what caused the error.
First, options(show.error.locations = TRUE) and then traceback(). The error line number will be displayed after #

Resources