Data Ingestion Into HDFS by unique technique - hadoop

I want to transfer Un-semi structured data(MS word/PDF/JSON) from a remote computer into hadoop(could be in batch and could be near realtime but not stream).
I have to Make sure that data is moved quickly from Remote location to my local machine(working on low bandwidth)into HDFS or local machine.
Fro example Internet Download Manager has this amazing technique of making several connections with the FTP and utilizing low bandwidth with more connections.
Is there any possibility that Hadoop ecosystem provides such a tool to ingest data into hadoop. Or any self made technique?
Which Tool/Technique could be better.

You could use the Web HDFS API http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Document_Conventions

Related

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

Import live data from server to hdfs instantly?

Is this possible to load live feed from server to HDFS because I have to load the live feed data on HDFS which is coming via server instantly without any loss of time?
There are lots of technologies that consume data in real time (or near real-time) and have write connectors to hdfs.
Flume
Nifi
Streamsets
are ones I have used.

Hdfs put VS webhdfs

I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.
I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?
What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful.
Below us the command I'm using
curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true"
this will redirect to a datanode address which I use in next step to write the data.
Hadoop provides several ways of accessing HDFS
All of the following support almost all features of the filesystem -
1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop
supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks
directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.
2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing
Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is
read, it is transmitted from the source node directly but **there is a overhead
of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.
3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the
data will be transfered and performance wise I believe this can be
even slower but preferred when needs to pull the data from public source into a secured cluster.
So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.
Hadoop provides a FileSystem Shell API to support file system operations such as create, rename or delete files and directories, open, read or write file.
The FileSystem shell is a java application that uses java FileSystem class to
provide FileSystem operations. FileSystem Shell API creates RPC connection for the operations.
If the client is within the Hadoop cluster, then this is useful because it use hdfs URI scheme to connect with the hadoop distributed FileSystem and hence client makes a direct RPC connection to write data into HDFS.
This is good for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an API to support these requirements based on standard REST functionality called WebHDFS.
WebHDFS provides the REST API functionality where any external application can connect the DistributedFileSystem over HTTP connection. No matter that the external application is Java or PHP.
WebHDFS concept is based on HTTP operations like GET, PUT, POST and DELETE.
Operations like OPEN, GETFILESTATUS, LISTSTATUS are using HTTP GET, others like CREATE, MKDIRS, RENAME, SETPERMISSIONS are relying on HTTP PUT
It provides secure read-write access to HDFS over HTTP. It is basically intended
as a replacement of HFTP(read only access over HTTP) and HSFTP(read only access over HTTPS).It used webhdfs URI scheme to connect with Distributed file system.
If the client is outside the Hadoop Cluster and trying to access HDFS. WebHDFS is usefull for it.Also If you are trying to connect the two difference version of Hadoop cluster then WebHDFS is usefull as it used REST API so it is independent of MapReduce or HDFS version.
The difference between HDFS access and WebHDFS is scalability due to the design of HDFS and the fact that a HDFS client decomposes a file into splits living in different nodes. When an HDFS client access file content, under the covers it goes to the NameNode and gets a list of file splits and their physical location on a Hadoop cluster.
It then can go do DataNodes living on all those locations to fetch blocks in the splits in parallel, piping the content directly to the client.
WebHDFS is a proxy living in the HDFS cluster and it layers on HDFS, so all data needs to be streamed to the proxy before it gets relayed on to the WebHDFS client. In essence it becomes a single point of access and an IO bottleneck.
You can you traditional java client api (which is being internally used by linux commands of hdfs).
From what I have read from here.
The performance of java client and Rest based approach have similar performance.

Hadoop - streaming data from HTTP upload (PUT) into HDFS directly

I have the following application deployment
web front-end taking data from client through HTTP/FTP
hadoop cluster
I need to store client's data on HDFS. What is the best way of doing that? Is it possible to stream data to HDFS directly, without consuming all data from the client on local drive, and then put it into the HDFS?
The feasible options which I can think of right now are :
HttpFS
WebHDFS
FTP client over HDFS
HDFS over WebDAV
Choosing the "best" one is totally upto you, based on your convenience and ease.
Personally, if you want low latency access to HDFS, your best bet is HBase. You can put and get values very easily since it is just a key value store. We are using the same thing in our application(s) and it works fabulously.

Download a file from HDFS cluster

I am developing an API for using hdfs as a distributed file storage. I have made a REST api for allowing a server to mkdir, ls, create and delete a file in the HDFS cluster using Webhdfs. But since Webhdfs does not support downloading a file, are there any solutions for achieving this. I mean I have a server who runs my REST api and communicates with the cluster. I know the OPEN operation just supports reading a text file content, but suppose I have a file which is 300 MB in size, how can I download it from the hdfs cluster. Do you guys have any possible solutions.? I was thinking of directly pinging the datanodes for a file, but this solution is flawed as if the file is 300 MB in size, it will put a huge load on my proxy server, so is there a streaming API to achieve this.
As an alternative you could make use of streamFile provided by DataNode API.
wget http://$datanode:50075/streamFile/demofile.txt
It'll not read the file as a whole, so the burden will be low, IMHO. I have tried it, but on a pseudo setup and it works fine. You can just give it a try on your fully distributed setup and see if it helps.
One way which comes to my mind, is to use a proxy worker, which reads the file using hadoop file system API, and creates a local normal file.And the provide download link to this file. Downside being
Scalablity of Proxy server
Files may be theoretically too large to fit into disk of a single proxy server.

Resources