Load files from Google Cloud Storage to on premise Hadoop cluster

Load files from Google Cloud Storage to on premise Hadoop cluster - hadoop

I am trying to load Google Cloud Storage files to on premise Hadoop cluster. I developed a workaround (program) to download files on local EdgeNode and distcp to Hadoop. But this seems two-way workaround and not much impressive. I have gone through few websites (links1, link2) which summarizes using Hadoop Google Cloud Storage connector for such process and need infrastructure level configuration which is not possible in all cases.
Is there any way to copy files directly from Cloud Storage to Hadoop programmatically using Python or Java.

To do this programmatically you can use Cloud Storage API client libraries directly to download files from Cloud Storage and save them to HDFS.
But it will be much simpler and easier to install Cloud Storage connector on your on premise Hadoop cluster and use DistCp to download files from Cloud Storage to HDFS.

Related

What is the differnce between HDFS and ADLS?

I am confused about how azure data lake store in different from HDFS. Can anyone pls explain it in simple terms ?

HDFS is a file system. HDFS stands for Hadoop Distributed File system. It is part of Apache Hadoop eco system. Read more on HDFS
ADLS is a Azure storage offering from Microsoft. ADLS stands for Azure Data Lake Storage. It provides distributed storage file format for bulk data processing needs.
ADLS is having internal distributed file system format called Azure Blob File System(ABFS). In addition, it also provides similar file system interface API like Hadoop to address files and directories inside ADLS using URI scheme. This way, it is easier for applications using HDFS to migrate to ADLS without code changes. For clients, accessing HDFS using HDFS driver, similar experience is got by accessing ADLS using ABFS driver.
Azure Data Lake Storage Gen2 URI
The Hadoop Filesystem driver that is compatible with Azure Data Lake
Storage Gen2 is known by its scheme identifier abfs (Azure Blob File
System). Consistent with other Hadoop Filesystem drivers, the ABFS
driver employs a URI format to address files and directories within a
Data Lake Storage Gen2 capable account.
More on Azure Data Lake Storage
Hadoop compatible access: Data Lake Storage Gen2 allows you to manage
and access data just as you would with a Hadoop Distributed File
System (HDFS). The new ABFS driver is available within all Apache
Hadoop environments, including Azure HDInsight, Azure Databricks, and
Azure Synapse Analytics to access data stored in Data Lake Storage
Gen2.
UPDATE
also, read about Hadoop Compliant File System(HCFS) which ensures that distributed file system (like Azure Blob Storage) API meets set of requirements to satisfy working with Apache Hadoop ecosystem, similar to HDFS. More on HCFS

ADLS can be thought of as Microsoft managed HDFS. So essentially, instead of setting up your own HDFS on Azure you can use their managed service (without modifying any of your analytics or downstream code)

Kubernetes distributed filesystem

Well, my company is considering to move from Hadoop to Kubernetes. We can find solutions in Kubernetes for tools such as cassandra, sparks, etc. So the last problem for us is how to store massive amount of files in Kubernetes, saying 1 PB. FYI, we DO NOT want to use online storage services such as S3.
As far as I know, HDFS is merely used in Kubernetes and there are a few replacement products such as Torus and Quobyte. So my question is, any recommendation for the filesystem on Kubernetes? Or any better solution?
Many thanks.

You can use a Hadoop Compatible FileSystem such as Ceph or Minio. Both of which offer S3-compatible REST APIs for reading and writing. In Kubernetes, Ceph can be deployed using the Rook project.
But overall, running HDFS in Kubernetes would require stateful services like the NameNode, and DataNodes with proper affinity and network rules in place. The Hadoop Ozone project is a realization that object storage is more common for microservice workloads than HDFS block storage as reasonably trying to analyze PB of data using distributed microservices wasn't feasible. (I'm only speculating)
The alternative is to use Docker support in Hadoop & YARN 3.x

How can I securely transfer my data from on-prem HDFS to Google Cloud Storage?

I have a bunch of data in an on-prem HDFS installation. I want to move some of it to Google Cloud (Cloud Storage) but I have a few concerns:
How do I actually move the data?
I am worried about moving it over the public internet
What is the best way to move data securely from my HDFS store to Cloud Storage?

To move Data from an on-premise Hadoop cluster to Google Cloud Storage, you should probably use the Google Cloud Storage connector for Hadoop. You can install the connector in any cluster by following the install directions. As a note, Google Cloud Dataproc clusters have the connector installed by default.
Once the connector is installed, you can use DistCp to move the data from your HDFS to Cloud Storage. This will transfer data over the (public) internet unless you have a special interlink setup with Google Cloud. To this end, you can use a squid proxy and configure the Cloud Storage connector to use it.

Downloading files from Google Cloud Storage straight into HDFS and Hive tables

I'm working on Windows command line as problems with Unix and firewalls prevent gsutil from working. I can read my Google Cloud Storage files and copy them over to other buckets (which I don't need to do). What I'm wondering is how to download them directly into HDFS (which I'm 'ssh'ing into)? Has anyone done this? Ideally this is part one, part two is creating Hive tables for the Google Cloud Storage data so we can use HiveQL and Pig.

You can use the Google Cloud Storage connector which provides an HDFS-API compatible interface to your data already in Google Cloud Storage, so you don't even need to copy it anywhere, just read from and write directly to your Google Cloud Storage buckets/objects.
Once you set up the connector, you can also copy data between HDFS and Google Cloud Storage with the hdfs tool, if necessary.

Analytics for Apache Hadoop service - Could not load external files

The Bluemix Big Analytics tutorial mentions importing files, but when I launched the Big sheets from the Bluemix Analytics for Apache Hadoop service, I could not see any option to load external files to the Big sheet. Is there any other way to do it? Please help us in proceeding.

You would upload your data to the HDFS for your Analytics for Hadoop service using the webHDFS REST API first, and then it should be available for you in BigSheets via the DFS Files tab shown in your screenshot.
The data you upload would be under /user/biblumix in HDFS as this is the username your are provided when you create a Analytics for Hadoop service in Bluemix.
To use the webHDFS REST API see these instructions.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio