The Bluemix Big Analytics tutorial mentions importing files, but when I launched the Big sheets from the Bluemix Analytics for Apache Hadoop service, I could not see any option to load external files to the Big sheet. Is there any other way to do it? Please help us in proceeding.
You would upload your data to the HDFS for your Analytics for Hadoop service using the webHDFS REST API first, and then it should be available for you in BigSheets via the DFS Files tab shown in your screenshot.
The data you upload would be under /user/biblumix in HDFS as this is the username your are provided when you create a Analytics for Hadoop service in Bluemix.
To use the webHDFS REST API see these instructions.
Related
I am trying to load Google Cloud Storage files to on premise Hadoop cluster. I developed a workaround (program) to download files on local EdgeNode and distcp to Hadoop. But this seems two-way workaround and not much impressive. I have gone through few websites (links1, link2) which summarizes using Hadoop Google Cloud Storage connector for such process and need infrastructure level configuration which is not possible in all cases.
Is there any way to copy files directly from Cloud Storage to Hadoop programmatically using Python or Java.
To do this programmatically you can use Cloud Storage API client libraries directly to download files from Cloud Storage and save them to HDFS.
But it will be much simpler and easier to install Cloud Storage connector on your on premise Hadoop cluster and use DistCp to download files from Cloud Storage to HDFS.
I have a bunch of data in an on-prem HDFS installation. I want to move some of it to Google Cloud (Cloud Storage) but I have a few concerns:
How do I actually move the data?
I am worried about moving it over the public internet
What is the best way to move data securely from my HDFS store to Cloud Storage?
To move Data from an on-premise Hadoop cluster to Google Cloud Storage, you should probably use the Google Cloud Storage connector for Hadoop. You can install the connector in any cluster by following the install directions. As a note, Google Cloud Dataproc clusters have the connector installed by default.
Once the connector is installed, you can use DistCp to move the data from your HDFS to Cloud Storage. This will transfer data over the (public) internet unless you have a special interlink setup with Google Cloud. To this end, you can use a squid proxy and configure the Cloud Storage connector to use it.
I am using Big SQL from Analytics for Apache Hadoop in Bluemix and would like to look into logs in order to debug (e.g. map reduce job log - usually available under http://my-mapreduce-server.com:19888/jobhistory, bigsql.log from the Big SQL worker nodes).
Is there a way in Bluemix to access those logs?
Log files for most IOP components (e.g. MapReduce Job History Log, Resource Manager Log) are accessible from Ambari console's Quick Links. Just navigate to the respective service page. Log files for BigSQL is currently not available. Since the cluster is not hosted as Bluemix appls, they cannot be retrieved using the Bluemix cf command.
I'm working on Windows command line as problems with Unix and firewalls prevent gsutil from working. I can read my Google Cloud Storage files and copy them over to other buckets (which I don't need to do). What I'm wondering is how to download them directly into HDFS (which I'm 'ssh'ing into)? Has anyone done this? Ideally this is part one, part two is creating Hive tables for the Google Cloud Storage data so we can use HiveQL and Pig.
You can use the Google Cloud Storage connector which provides an HDFS-API compatible interface to your data already in Google Cloud Storage, so you don't even need to copy it anywhere, just read from and write directly to your Google Cloud Storage buckets/objects.
Once you set up the connector, you can also copy data between HDFS and Google Cloud Storage with the hdfs tool, if necessary.
It is possible, to connect my Hadoop cluster to multiple Google Cloud Projects at once ?
I can easly use any Google Storage bucket in single Google Project via Google Cloud Storage Connector as explained in this thread Migrating 50TB data from local Hadoop cluster to Google Cloud Storage. But i can't find any documentation or example how to connect to two or more Google Cloud Project from single map-reduce job. Do You have any suggestion/trick ?
Thanks a lot.
Indeed, it is possible to connect your cluster to buckets from multiple different projects at once. Ultimately, if you're using the instructions for using a service-account keyfile, the GCS requests are performed on behalf of that service-account, which can be treated more-or-less like any other user. You can either add the service account email your-service-account-email#developer.gserviceaccount.com to all the different cloud projects owning buckets you want to process, using the permissions section of cloud.google.com/console and simply adding that email address like any other member, or you can set GCS-level access to add that service-account like any other user.