difference between client and executor in dask - client

Executor is the primary entry point for users of distributed.Similarly, Client is the primary entry point for users of dask.distributed.
So, both seem like identical.
In dask, can both be used interchangeably ?
If yes,what is the use case to use either executor or client ?

The short and simplest answer: the executor is a deprecated term for the thing that is now known as the client. Refer only to the client. Your documentation page is for versioin 1.9, but distributed is now at version 1.22.

Related

How Context object is working in hadoop? [duplicate]

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

What is Keyword Context in Hadoop programming world?

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

How to restore bolt state during failover

I'm trying to figure out how to restore the state of a storm bolt intance during failover. I can persist the state externally (DB or file system), however once the bolt instance is restarted I need to point to the specific state of that bolt instance to recover it.
The prepare method of a bolt receives a context, documented here http://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
What is not clear to me is - is there any piece of this context that uniquely identifies the specific bolt instance so I can understand which persistent state to point to? Is that ID preserved during failover? Alternatively, is there any variable/object I can set for the specific bolt/instance that is preserved during failover? Any help appreciated!
br
Sib
P.S.
New to stackoverflow so pls bear with me...
You can probably look for Trident Its basically an abstraction built on top of storm . The documentation says
Trident has first-class abstractions for reading from and writing to stateful sources. The state can either be internal to the topology – e.g., kept in-memory and backed by HDFS – or externally stored in a database like Memcached or Cassandra
In case of any fail over it says
Trident manages state in a fault-tolerant way so that state updates are idempotent in the face of retries and failures.
You can go through the documentation for any further clarification.
Tx (and credit) to Storm user group!
http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3C74083558E4509844944FF5CF2BA7B20F1060FD0E#ESESSMB305.ericsson.se%3E
In original Storm, both spout and bolt are stateless. Storm can managed to restart nodes but it will require some effort to restore the states of nodes. There are two solutions that I can think of:
If a message fails to process, Storm will replay it from ROOT of the topology and the logic of replay has to be implemented by user. So in this case I would like to put more state information (e.g. the ID of some external state storage and id of this task) in messages.
Or you can use Trident. It can provides txid to each transaction and simplify storage process.
I'm OK with first solution because my app doesn't require transactional operations and I have a better understanding of the original Storm (Storm generates simpler log than Trident does).
You can use the task ID.
Task ids are assigned at topology creation and are static. If a task dies/restarts or gets reassigned somewhere else, it will still have the same id.

Read issue in MongoDB asynchronous replication

I'm new to MongoDB. I created a Java app using MongoDB as database.
I configured 3 servers in a replica set.
my pseudo code:
{
createUser
getUser
updateUser
}
Here createUser creates the user successfully but getUser fails to return that user in somtimes.
when I analysed it is due to the data replication latency.
How can I overcome this issue?
is there anyway to replicate data immediately when it is created?
is there any other way to get user without fail?
Thx in advance!
If you are certain that the issue is due to replication latency, one thing you can do is make sure your writes are safe and using the w flag. That way, MongoDB will wait until data is replicated to at least n nodes before returning. You can do this from the client driver as well.
MongoDB getLastError
Are you reading with slaveOk=True ? If you read from the ReplicaSet Primary, this shouldn't be an issue either.
The slaveOk property is now known as ReadPreference (.SECONDARY in this case) in newer Mongo Java driver versions. This can be set at the Mongo/DB/Collection level. Note that when you set ReadPreference at these levels, it applies for all callers (i.e. these objects are shared across threads).
Another approach is to try the ReadPreference.SECONDARY and if it fails, try without it and go to the master. This logic can be isolated to your repository layer, so the service layer doesn't have to deal with it. If you are doing this, you may want to set the ReadPreference at the DBQuery object, which is on a per-use basis.
I am not familiar with Java driver, but there are w and j options.
The w option confirms that write operations have replicated to the specified number of replica set members, including the primary.
The j will confirm the write operation only after it has written the operation to the journal.
It looks like you need to use WriteConcern.

Hadoop map/reduce chaining

I want to chain 2 Map/Reduce jobs. I am trying to use JobControl to achieve the same. My problem is -
JobControl needs org.apache.hadoop.mapred.jobcontrol.Job which in turn needs org.apache.hadoop.mapred.JobConf which is deprecated. How do I get around this problem to chain my Map/Reduce?
Anyone has any better ideas for chaining (other than Cascading).
You could use Riffle, it allows you to chain arbitrary processes together (anything you stick its Annotations on).
It has a rudimentary dependency scheduler, so it will order and execute your jobs for you. And it's Apache licensed. Its also on the Conjars repo if you're a maven user.
I'm the author, and wrote it so Mahout and other custom applications would be able to have a common tool that was also compatible with Cascading Flows.
I'm also the author of Cascading. But MapReduceFlow + Cascade in Cascading works quite well for most raw MR job chaining.
Cloudera has a workflow tool called Oozie that can help with this sort of chaining. Might be overkill for just getting one job to run after another.

Resources