What jar does SparkR use to parse R syntaxes - sparkr

I am a newbie on SparkR, so would like to ask about what jar does SparkR use to parse the R syntaxes

I hope that I understood you correctly and that your assumption is that R code will be translated into Java code. So my short answer would be that SparkR uses no jars to parse R syntaxes.
The longer explanation:
Well, SparkR is written in R and when you create a sparkR.session(), a local Spark driver JVM will be created and a socket connection between your R instance and the JVM will be established. SparkR functions are more or less wrappers for Spark SQL methods written in Scala which exist on the Spark side and will be called using the socket connection. The function arguments or whole functions will be serialized in a custom format, sent to the JVM, and eventually the Spark executors will execute the (deserialized) R code in an R instance which is used by the Spark cluster.
The serialization and deserialization code is located in /R/pkg/R/serialize.R and /R/pkg/R/deserialize.R of your Spark distribution. R/pkg/inst/worker/worker.R executes the R code in the Spark cluster.
An older architecture overview with more details can be found here:
https://cs.stanford.edu/~matei/papers/2016/sigmod_sparkr.pdf
There are two kinds of RPCs we support in the SparkR JVM backend:
method invocation and creating new objects. Method invocations are
called using a reference to an existing Java object (or class name for
static methods) and a list of arguments to be passed on to the method.
The arguments are serialized using our custom wire format which is
then deserialized on the JVM side. We then use Java reflection to
invoke the appropriate method. In order to create objects, we use a
special method name init and then similarly invoke the appropriate
constructor based on the provided arguments. Finally, we use a new R
class ’jobj’ that refers to a Java object existing in the backend.
These references are tracked on the Java side and are automatically
garbage collected when they go out of scope on the R side.

Related

CacheLoader is not getting called while trying to find an entity using GemfireRepository

CacheLoader is not getting called while trying to find an entity using GemfireRepository.
As a solution, I am using Region<K,V> for looking up, which is calling CacheLoader. So wanted to know whether there is any restriction for Spring Data Repository which doesn't allow CacheLoader to be called when entry is not present in the cache.
And, is there any other alternative? Because I have one more scenario where my cache key is combination of id1 & id2 and I want to get all entries based on id1. And if there is no entry present in cache, then it will call CacheLoader to load all entries from Cassandra store.
There are no limitations nor restrictions in SDG when using the SD Repository abstraction (and SDG's Repository extension) that would prevent a CacheLoader from being invoked so long as the CacheLoader was properly registered on the target Region. Once control is handed over to GemFire/Geode to complete the data access operation (CRUD), it is out of SDG's hands.
However, you should know that GemFire/Geode only invokes CacheLoaders on gets (i.e. Region.get(key) operations), never on (OQL) queries. OQL queries are invoked from derived query methods or custom, user-defined query methods using #Query annotated methods declared in the application Repository interface.
NOTE: See Apache Geode CacheLoader Javadoc and User Guide for more details.
For a simple CrudRepository.findById(key) call, the call stack follows from...
SimplyGemfireRepository.findById(key)
GemfireTemplate.get(key)
And then, Region.get(key) (from here).
By way of example, and to illustrate this behavior, I added the o.s.d.g.repository.sample.RepositoryDataAccessOnRegionUsingCacheLoaderIntegrationTests to the SDG test suite as part of DATAGEODE-308. You can provide additional feedback in this JIRA ticket, if necessary.
Cheers!

How Context object is working in hadoop? [duplicate]

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

Mixed Write for GSON

I am using GSON object model access to construct JSON to be used as body of my POST webservice calls in Jmeter.
Now I frequently encounter : GCC Out of memory exception with the error pointing to the code section=> gson.toJson(objectToSerialize).
From the past posts, it was suggested to use gson serialization with the streaming access model.
My current code does this: Create an object of an class by populating its variables and passes this class on to the GSON serializer, gets back the constructed JSON in form of string and I use them.
Could the experts suggest, is there a way that I could integrate streaming access model into my code without having to do much of a rework. Would this be memory efficient?
PS: I took a look into Mixed writes example specified in this link but unable to get around how to construct a JSON by passing one object of the class as we do in the object model:
https://sites.google.com/site/gson/streaming
Thank you!
Why don't you just use these variables in "Body Data" mode of the HTTP Request sampler like:
If your JSON payload is large you may have to amend Java HEAP size as default allocation is just 512MB and it may be not enough for more or less large load. If you don't have enough free RAM to fit JSON data size * number of virtual users you may have to consider Distributed Testing
The other option may be you using not very efficient scripting test element. It is recommended to use JSR223 Test Elements and Groovy as a language as other options are not performing that well.
See Beanshell vs JSR223 vs Java JMeter Scripting: The Performance-Off You've Been Waiting For! guide for more information on

What is Keyword Context in Hadoop programming world?

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

I am trying to get Hadoop psuedo cluster detail using api in java

I am learning hadoop
I am trying to get my pseudo hadoop cluster details like name node, data node live node, dead node etc from java code using hadoop api
i.e.
what ever we can see using http port i.e 50070.
For this i have tried using
FSNamesystem class but not able to call its constructor as it is private for different package so i m thinking if some how i can give dependency injection for namenode and configuration class which are the argument for this constructor so that i can use as shown below :
FSNamesystem f= FSNamesystem.getFSNamesystem();
f.getLiveNode() and all remaining methods
can any one suggest how i can implement this
thanks in advance

Resources