mclapply and spark_read_parquet - sparklyr

I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...
Example: (the same code works when using one core, but fails when using 2)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2)
Warning message:
In mclapply(seq(file_paths), function(i) { :
all scheduled cores encountered errors in user code
Any suggestion???
Thanks in advance.

Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.

Related

Is there a way to parallelize spark.read.load(string*) when reading many files?

I noticed that in spark-shell (spark 2.4.4), when I do a simple spark.read.format(xyz).load("a","b","c",...), it looks like spark uses a single ipc client (or "thread") to load the files a, b, c, ... sequentially (they are path to hdfs).
Is this expected?
The reason I am asking is, for my case, I am trying to load 50K files, and the sequential load takes a long time.
Thanks
PS, I am trying to see it in the source code, but not sure if this is the one:
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L180
Might not be an exact "answer" to my original question, but I found out the reason for my particular case: from name node's audit log, it was found that there were some runaway jobs pegging name node, which greatly slowed down the rpc calls. After killing these bad jobs, the spark's load speed was greatly improved.

Hive query: Is there a way to use UDTF with `cluster by`?

Solved:
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me an explanation, I would be most grateful.
Following is my original question:
I know I'm not supposed to use cluster by after UDTF, so select myudtf("stringValue") cluster by rand() wouldn't work.
But since my udtf outputs 7000+ and growing rows every hour, so I really need to distribute the subsequent processing to all my hadoop cluster slave units.
And I imagine I don't get that without using cluster by rand(), so I tried the following cheat:
First I wrap the result up with an other table, select key from (select myudtf("stringValue") as key) t limit 1; and it gives correct result,
OK
some/key/value/string
Time taken: 0.035 seconds, Fetched: 1 row(s)
Then I add the cluster by part, select key from (select myudtf("stringValue") as key) t cluster by rand() limit 1, then I get error:
WARNING: Hive-on-MR is deprecated in Hive ...
....
Task with the most failures(4):
-----
Task ID:
task_....
URL:
http:....
....
-----
Diagnostic Messages for this Task:
Error: tried to access class sun.security.ssl.SSLSessionContextImpl from class sun.security.ssl.SSLSessionContextImplConstructorAccess
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I did this trying to cheat hive to treat the temporary table t as a "normal" table which I can apply cluster by to, hoping that it will distribute the work load to all the hadoop slaves, but unfortunately hive is clever enough to see through my poorly attempted trick.
So, could some one please help me to clarify my mis-conceptions, or give me some hint of the correct way to do this?
FYI I asked help from a highly experienced engineering guy in my company, and he thinks it maybe a deeper system level bug, he tried to trace the problem for 20 something minutes before he left work, he did find some lib version issues but couldn't fix the problem after all. ...And I just guess it must be something I did wrongly.
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me a more specific explanation, I would be most grateful.

Loading Data into the application from GUI using Ruby

Problem:
Hi everyone, I am currently building an automation suite using Ruby-Selenium Webdriver-Cucumber to load data into the application using it's GUI. I've take input from mainframe .txt files. The scenarios are like to create a customer and then load multiple accounts for them as per the data provided in the inputs.
Current Approach
Execute the scenario using the rake task by passing line number as parameter and the script is executed for only one set of data.
To read the data for a particular line, I'm using below code:
File.readlines("#{file_path}")[line_number.to_i - 1]
My purpose of using line by line loading is to keep the execution running even if a line fails to load.
Shortcomings
Supposed I've to load 10 accounts to a single customer. So my current script will run 10 times to load each account. I want something that can load the accounts in a single go.
What I am looking for
To overcome the above shortcoming, I want to capture the entire data for a single customer from the file like accounts etc and load them into the application in a single execution.
Also, I've to keep track on the execution time and memory allocation as well.
Please provide your thoughts on this approach and any suggestions or improvements are welcomed. (Sorry for the long post)
The first thing I'd do is break this down into steps -- as you said in your comment, but more formally here:
Get the data to apply to all records. Put up a page with the
necessary information (or support command line specification if not
too much?).
For each line in the file, do the following (automated):
Get the web page for inputting its data;
Fill in the fields;
Submit the form
Given this, I'd say the 'for each line' instruction should definitely be reading a line at a time from the file using File.foreach or similar.
Is there anything beyond this that needs to be taken into account?

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Resources