Spark: Silently execute sc.wholeTextFiles

Spark: Silently execute sc.wholeTextFiles - hadoop

I am loading about 200k text files in Spark using input = sc.wholeTextFiles(hdfs://path/*)
I then run a println(input.count)
It turns out that my spark shell outputs a ton of text (which are the path of every file) and after a while it just hangs without returning my result.
I believe this may be due to the amount of text outputted by wholeTextFiles. Do you know of any way to run this command silently? or is there a better workaround?
Thanks!

How large are your files?
From the wholeTextFiles API:
Small files are preferred, large files are also allowable, but may
cause bad performance.
In conf/log4j.properties, you can suppress excessive logging, like this:
# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
That way, you'll get back only res to the repl, just like in the Scala (the language) repl.
Here are all other logging levels you can play with: log4j API.

Related

Jmeter wont write on CSV

I've been scratching my head on this issue. It seems that Jmeter wont write on one of my empty CSV file and throws this error. But one of my CSV file in the same thread doesn't replicate the error and writes properly.
These is how I called my CSV and how I printed it.

I don't see anything which could cause the error in your partial screenshot, if you need further assistance consider:
Adding Debug Sampler to visualize all your variables which are being used in the script
Adding the full code as text, not as the screenshot so the issue could be reproduced.
Going forward it might be a better idea to use i.e. Flexible File Writer instead of Groovy for writing the variables into the file as if you run your script with 2+ threads you might run into a race condition with several threads concurrently writing into the file which most probably will result in data corruption/loss

Producing comprehensible logs for shell script ran through xargs

I am currently reindexing Elasticsearch indices in a shell script using curl to do REST API calls.
To improve performance, I am running this script with xargs over 10 processes.
However, all of my scripts are outputting into a single log, making this log useless/incomprehensible, since output is being written async which makes the log unordered.
The client would like to know progress made on reindexing by looking at logs (i.e they want to know if 50/100 index is done when they look at the log). And we would like to have the curl outputs etc for debugging purposes.
What are some ways I can make comprehensible logs?

You could try specifying a separate log file for each process, then concatenating the logs at the end.
But if you really want them all to access the same logfile, then I'm afraid your only way to do that is with a lock, which is not easy with bash/xargs.

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext.
I use pyspark. I tried
os.system("hadoop fs -test -e %s" %path)
but as I have a lot of paths to check, the job crashed.
I tried also sc.wholeTextFiles(parent_path) and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files.
Could you help me?

Rigth how it says Tristan Reid:
...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path
Once you have the list of files in a directory, it is easy to check if a particular file exist.
I hope it can help somehow.

Have you tried using pydoop? The exists function should work

One possibility is that you can use hadoop fs -lsr your_path to get all the paths, and then check if the paths you're interested in are in that set.
Regarding your crash, it's possible it was a result of all the calls to os.system, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).
One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.
It may also be a good idea to switch to the subprocess module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system:
process = subprocess.check_output(
args=your_script,
stdout=PIPE,
shell=True
)
Note that you can switch stdout to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True argument to False unless you're going to call an actual script or use shell-specific things like pipes or redirection.

Locking output file for shell script invoked multiple times in parallel

I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.

GNU parallel claims that it:
makes sure output from the commands is
the same output as you would get had
you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.
Use the -k option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

Programmatically empty out large text file when in use by another process

I am running a batch job that has been going for many many hours, and the log file it is generating is increasing in size very fast and I am worried about disk space.
Is there any way through the command line, or otherwise, that I could hollow out that text file (set its contents back to nothing) with the utility still having a handle on the file?
I do not wish to stop the job and am only looking to free up disk space via this file.
Im on Vista, 64 bit.
Thanks for the help,

Well, it depends on how the job actually works. If it's a good little boy and it pipes it's log info out to stdout or stderr, you could redirect the output to a program that you write, which could then write the contents out to disk and manage the sizes.
If you have access to the job's code, you could essentially tell it to close the file after each write (hopefully it's an append) operation, and then you would have a timeslice in which you could actually wipe the file.
If you don't have either one, it's going to be a bit tough. If someone has an open handle to the file, there's not much you can do, IMO, without asking the developer of the application to find you a better solution, or just plain clearing out disk space.

Depends on how it is writing the log file. You can not just delete the start of the file, because the file handle has a offset of where to write next. It will still be writing at 100mb into the file even though you just deleted the first 50mb.
You could try renaming the file and hoping it just creates a new one. This is usually how rolling logs work.

You can use a rolling log class, which will wrap the regular file class but silently seek back to the beginning of the file when the file reaches a maximum designated size.
It is a very simple wrap, either write it yourself or try finding an implementation online.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Spark: Silently execute sc.wholeTextFiles - hadoop

Related

Jmeter wont write on CSV

Producing comprehensible logs for shell script ran through xargs

pyspark : how to check if a file exists in hdfs

Locking output file for shell script invoked multiple times in parallel

Programmatically empty out large text file when in use by another process

Categories

Resources