Parallel Bilby: How to transfer generated data to a cluster? - cluster-computing

Doing inference with parallel bilby on a computer cluster requires me to run parallel_bilby_generation first.
Doing this on the cluster head is unfriendly as it blocks a lot of CPU.
I would prefer to do it on my home computer instead, where it is also much faster.
However, the file structure to provide in the config.ini file is different on my home computer than on the cluster.
Changing the respective filenames in the outdir/config_complete.ini before transfer apparently is not sufficient.
This does not really come as a surprise as the actual run command seems to only make use of the outdir/data/data_dump.pickle file.
That one is not human-readable though.
What should I do instead? How can I make a previously generated set-up work on a cluster?
Is there an alternative to running the parallel_bilby_generation directly on the cluster at all?

Related

Sync config files between nodes on hadoop cluster

I have a hadoop cluster consisting of 4 nodes on which I am running a pyspark script. I have a config.ini file which contains details like locations of certificates, passwords, server names etc which are needed by the script. Each time this file is updated I need to sync the changes across all 4 nodes. Is there a way to avoid that?
I have needed to sync or update changes to my script. Making them on just one node and running it from there is enough. Is the same possible for the config file?
The most secure answer is likely to learn how to use a keystore with spark.
A little less secure but still good. Have you considered you could just put the file in HDFS and then just reference it? (lower security but easier to use)
Unsecure methods that are easy to use:
You can also pass it as a file to spark-submit to transfer the file for you.
Or you could add the values to your spark submit.

Any performance difference between loading ML Module (Xquery / Javascript) from physical disk localation and loading from ML Module DB inside ML?

By default, ML HTTP server will use the Module DB inside ML.
(It seems all ML training materials refer to that type of configuration.)
Any changes in the XQuery programs will need to upload into the Module DB first. That could be accomplished by using mlLoadModules or mlReloadModules ml-gradle commands.
CI/CD does not access the ML cluster directly. Everything is via ml-gradle from a machine dedicated from code deployment to different ML enviroments like dev/uat/prod etc.
However it is also possible to configure the ML app server to use the XQuery program from physical disk location like below screenshot.
With that configuration, it is not required to reload the programs into ML Module DB.
The changes in the program have to be in the ML server itself. CI/CD will need to access to the ML cluster directly. One advantage of this way is that developer will easily see whether the changes in the program have been indeed deployed, as all changes are sitting as physical readable text files in the disk.
Questions:
Which way is better? Why?
Any ML query perforemance difference between these two different approaches?
For the physical file approach, does it mean that CI/CD will need to deploy the program changes to all the ML hosts in the ML cluster? (I guess it is not a concern if HTTP server refers XQuery programs from Module DB inside ML. ML cluster will auto sync the code among different hosts.)
In general, it's recommended to deploy modules to a database rather than the filesystem.
This makes deployment more simple and easy, as you only have to load the module once into the modules database, rather than putting the file on every single host. If you use the filesystem, then you need to put those files on every host in the cluster.
With a modules database, if you were to add nodes to the cluster, you don't have to also deploy the modules. You can then also take advantage of High Availability, backup and restore, and all the other features of a database.
Once a module is read, it is loaded into caches, so the performance impact should be negligible.
If you plan to use REST extensions, then you would need a modules database so that the configurations can be installed in that database.
Some might look to use filesystem for simple development on a single node, in which changes saved to the filesystem are made available without re-deploying. However, you could use something like the ml-gradle mlWatch task to auto-deploy modules as they are modified on the filesystem and achieve effectively the same thing using a modules database.

How to archive data stored in HDFS files on another (non-distributed) server?

I have a project folder containing approx. 50 GB of parquet files on a hadoop cluster (CDH 5.14), which I need to archive and move to another host (non-distributed with Windows or Linux). This is only a one time job - I do not plan to bring the data back to HDFS any time soon, however there should be a way to deploy it back to a distributed file system. What would be the optimal way to do it? Unfortunately, I don't have another hadoop cluster or a cloud environment where I could place this data.
I would appreciate any hints.
The optimal solution can depend on the actual data (e.g. Tables, many/few flat files). If you know how they got in there, looking at the inverse could be a logical first step.
For example, if you just use put to place the files, consider using get.
If you use Nifi to get it in, try Nifi to get it out.
After the data is on your Linux box, you can use SCP or something like FTP or a mounted drive to move it to the desired computer.

HDFS configuration & what is the user directory for?

I am currently "playing around" with Hadoop in a VM (CDH4.1.3 image from cloudera). What I am wondering about is the following (and the documentation did not really help me in that regard).
Following the tutorial, I would format a NameNode first - OK, that is already done if one uses the cloudera image. Likewise the HDFS file structure is already present. In the hdfs-site.xml the datanode data dir is set to:
/var/lib/hadoop-hdfs/cache/${user.name}/dfs/data
which is obviously where the blocks are supposed to be copied to in a real distributed setting. In the cloudera tutorial, one is told to create hdfs "home directories" for each user (/users/<username>), which I do not understand what they are for. Are they just for local test-runs in a single-node setup?
Say I really had petabytes of data on type not fitting into my local storage. This data would have to be distributed straight away, rendering a local "home directory" entirely useless.
Could someone tell me, just to give me an intuition, how a real Hadoop workflow with massive data would look like? What kind of distinct nodes would I have running for a start?
There's the master (JobTracker) with its slave file (where would I put that) allowing the master to resolve all the DataNodes. Then there is my NameNode that keeps track of where the block IDs are stored. The DataNodes are also carry TaskTracker responsibility. In the config files, the NameNode's URI is included -- am I correct so far? Then there is still the ${user.name} variable in the configuration which apparently, if I understood it right, has something to do with WebHDFS, which would also be great if someone could explain to me. In the running examples, the directions tend to be hardcoded to
/var/lib/hadoop-hdfs/cache/1/dfs/data, /var/lib/hadoop-hdfs/cache/2/dfs/data and so on.
So, back to the example: Say, I have my tape and want to import data into my HDFS (and I am required to stream data into the filesystem because I lack the local storage to save it locally on a single machine). Where would I start with the migration process? On an arbitrary DataNode? On the NameNode that distributes the chunks? After all, I cannot assume the data just to "be there", because the name node has to be aware of the block IDs.
It would be great if someone could shortly elaborate on these topics:
What is the home directory really for?
Do I migrate data to the home directory first and to the real distributed system afterwards?
How does WebHDFS work and what role does it play with regards to the user.name variable
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
What is the home directory really for?
You have a small confusion here. Just like /home exists for local filesystems on Linux, where users are given their own storage space, /users is a home mount ON the HDFS (Distributed FS). The tutorial needs you to administratively create a home directory for the user you wish to later be running data loads and queries as, such that they get adequate permissions and storage access onto the HDFS. The tutorial is not asking you to create these directories locally.
Do I migrate data to the home directory first and to the real distributed system afterwards?
I believe my above answer should clarify this for you. You should create your home directory on the HDFS, and then load all your data inside of that directory.
How does WebHDFS work and what role does it play with regards to the user.name variable
WebHDFS is one of the various ways to access HDFS. Regular clients to talk to HDFS require use of Java APIs. WebHDFS (and also HttpFs) techniques were added to HDFS to let other languages have their own set of APIs by providing a REST front-end to HDFS. WebHDFS allows user-authentication, to help persist the permission and security models.
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
The large part of problem HDFS solves for you is that of managing distribution of data. When loading files or data streams to HDFS (via CLI tools, sinks from Apache Flume, etc.), the blocks are spread in an ideal distribution by HDFS itself, and the chunking is managed by it as well. All you need to do is use the user-side regular FileSystem style APIs and forget about what goes where underneath - its all managed for you.

Life of distributed cache in Hadoop

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?
I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops to zero. The TaskRunner explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?
I cross posted this question at the AWS forum and got a good recommendation to use hadoop fs -get to transfer files in a way that persists across jobs.

Resources