Migrating 50TB data from local Hadoop cluster to Google Cloud Storage - google-api

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.
I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.
What is a recommended way of moving data from local Hadoop cluster to GCS ?
In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?
What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?
Thanks a lot,

Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.
Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:
cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/
You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar
Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:
cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12
Add the following entries to your core-site.xml file in your Hadoop conf dir:
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>your-ascii-google-project-id</value>
</property>
<property>
<name>fs.gs.system.bucket</name>
<value>some-bucket-your-project-owns</value>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>your-service-account-email#developer.gserviceaccount.com</value>
</property>
<property>
<name>fs.gs.auth.service.account.keyfile</name>
<value>/path/to/hadoop/conf/gcskey.p12</value>
</property>
The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.
If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.
Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.
Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

Look like few property names are changed in recent versions.
`String serviceAccount = "service-account#test.gserviceaccount.com";
String keyfile = "/path/to/local/keyfile.p12";
hadoopConfiguration.set("google.cloud.auth.service.account.enable", true);
hadoopConfiguration.set("google.cloud.auth.service.account.email", serviceAccount);
hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", keyfile);`

Related

How can I use NiFi to read/write directly from ADLS without HDInsight

We would like to use NiFi to connect with ADLS (using PutHDFS and FetchHDFS) without having to install HDInsight. Subsequently we want to use Azure DataBricks to run Spark jobs, and hoping that it can be done using NiFi's ExecuteSparkInteractive processor. From all the examples I could find, invariably HDP or HDInsight seem to be required.
Can anyone share the pointers how it can be done without needing HDP or HDInsight?
Thanks in advance.
As far as I can tell, ADLS won't work well (or work at all) with *HDFS processors available in Apache NiFi. There was a feature request made - NIFI-4360 and a subsequent PR raised for the same - #2158 but it was briefly reviewed but now not much progress is there. You can fork that or copy pasta that code-base and hopefully review it.
I did a test-setup more than a year ago. The PutHDFS processor worked with some additional classpath resources. The following dependencies have been required:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.0.4-SNAPSHOT.jar
hadoop-azure-datalake-2.0.0-SNAPSHOT.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
See also the following Blog for more details. You can copy the libs, the core-site.xml and hdfs-site.xml from an HDInsight setup to the machine where NiFi is running. You also should set the dfs.adls.home.mountpoint properly, directing to root or a data directory. Be aware that this is not officially supported, so phps. you should also consider Azure Data Factory or StreamSets as an option for Data Ingest.
PutHDFS does not expect a classic hadoop cluster in the first place. It expects core-site.xml only for conventional reasons. As you will see in the below example a minimalist config file to have PutHDFS work with ADLS.
Using NiFi PutHDFS processor to ingress into ADLS is simple. Below steps will lead to the solution
Have ADLS Gen1 set up(ADLS has been renamed as ADLS Gen1)
Additionally have OAUTH authentication set up for your ADLS account. See here
Create an empty core-site.xml for configuring PuHDFS processor
Update core-site.xml with the following properties(I am using Client keys mode for auth in this example)
fs.defaultFS = adl://<yourADLname>.azuredatalakestore.net
fs.adl.oauth2.access.token.provider.type = ClientCredential
fs.adl.oauth2.refresh.url = <Your Azure refresh endpoint>
fs.adl.oauth2.client.id = <Your application id>
fs.adl.oauth2.credential = <Your key>
Update your NiFi PutHDFS processor to refer to the core-site.xml and additional ADLS libraries(hadoop-azure-datalake-3.1.1.jar and azure-data-lake-store-sdk-2.3.1.jar) created in previous step as shown below.
Update the upstream processors and test.

How to renew a delegation token for a long running applications besides the time set in the hadoop cluster

I have an Apache Apex application which runs on my Hadoop environment.
I have no problem with the application except that, it is failing after 7days. And, i realized that it is because of the cluster level setting for any application.
Is there any way, i can renew the delegation token perodically at some interval to ensure job runs continously without failing!!
I could find any resources online for on how to renew a hdfs delegation tokens!! Can someone please share your knowledge ?
The problem is mentioned in the Apex documentation.
Also it offers 2 solution in detail. Non-intrusive for the Hadoop system would be to choose the 'Auto-refresh approach'.
Basically you need to copy your keytab file into HDFS and configure
<property>
<name>dt.authentication.store.keytab</name>
<value>hdfs-path-to-keytab-file</value>
</property>
in your dt-site.xml.
HTH

synchronization issues about hadoop federation

I have some questions about hadoop federation.
As far as I know, it has multiple masters(namenode) running at same time.
So my question is that if a client has a request, how to determine which master to serve the request from client.
Another question is that whether the metadata stored in every master is concurrent with each other or not.
If the data in masters is concurrent, while two clients have requests at same time at two different master, how to deal with the synchronization issues.
Hope I make my question clear.
I only read web on apache hadoop. Any material and tutorial are very grateful.
And comment and correction are very appreciated.
Using client side mount tables we can map file paths to namenodes (core-site.xml configuration below)
<property>
<name>fs.viewfs.mounttable.default.link./namenode1</name>
<value>hdfs://namenode1:9001/home</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./namenode2</name>
<value>hdfs://namenode2:9001/home</value>
</property>}
example during put operation we can specify path and request will go to namenode1
bin/hadoop fs -put file.txt /namenode1/input
In HDFS Federation each namenode manages its own metadata .

Deploying hdfs core-site.xml with cloudera manager

I'm trying to add lzo support to my configuration files using the cloudera manager (CDH5b2).
If I add the io.compression.codecs to the service-wide hdfs configuration, and deploy the configuration file, /etc/hadoop/conf.cloudera.hdfs/core-site.xml now contains the new value.
However, /etc/hadoop/conf.cloudera.yarn/core-site.xml has a higher priority (update-alternatives --display hadoop-conf), the hdfs core-site.xml values are not used when I start a MR job.
Obviously, I can simply modify the yarn core-site.xml file manually, but I don't understand how to do deploy the hdfs core-site.xml file properly using cloudera manager.
There is a MapReduce Client Environment Safety Valve, also known as 'MapReduce Service Advanced Configuration Snippet (Safety Valve) for core-site.xml' found in the gui under mapreduce's configuration ->Service-Wide->Advanced will allow you to add any value that doesn't fit elsewhere. (There is also one for core-site.xml as well.)
Having said that, details can be found on Cloudera's site at: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_cdh5_install.html

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system.
As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.
The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.
The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.
So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?
If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.
Has anyone tried this before?
Can it work or is this idea much too far off the intended use?
Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?
Thanks for your insights.
EDIT 2:
This is the config I created for hadoop 0.18.3
conf/hadoop-site.xml using the answer provided by bajafresh4life.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:33301</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>localhost:33302</value>
<description>
The job tracker http server address and port the server will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>localhost:33303</value>
<description>
The task tracker http server address and port.
If the port is 0 then the server will start on a free port.
</description>
</property>
</configuration>
Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.
Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...
We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

Resources