Apache Spark Streaming from folder (not HDFS) - hadoop

I was wondering if there is any reliable way for creating spark streams from a physical location? I was using 'textFileStream' but seems it is mainly used if the files are in HDFS. If you see the definition of the function it says "Create an input stream that monitors a Hadoop-compatible filesystem"

Are you implying that HDFS is not a physical location? There are datanode directories that physically exist...
You should be able to use textFile with the file:// URI, but you need to ensure all nodes in the cluster can read from that location.
From the definition of Hadoop compatible filesystem.
The selection of which filesystem to use comes from the URI scheme used to refer to it -the prefix hdfs: on any file path means that it refers to an HDFS filesystem; file: to the local filesystem, s3: to Amazon S3, ftp: FTP, swift: OpenStackSwift, ...etc.
There are other filesystems that provide explicit integration with Hadoop through the relevant Java JAR files, native binaries and configuration parameters needed to add a new schema to Hadoop

Related

nifi putHDFS writes to local filesystem

Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.

How to download Hadoop files (on HDFS) via FTP?

I would like to implement an SSIS job that is able to download large CSV files that are located on a remote Hadoop cluster. Of course, having just a regular FTP server on Hadoop system does not expose HDFS files since it uses the local filesystem.
I would like to know whether there is an FTP server implementation that sits on top of HDFS. I would prefer this approach rather than having to copy files from HDFS to the local FS and then having the FTP server serving this because I will need to allocate more storage space.
I forked from an open-source project that works as expected: https://github.com/jamesattard/maroodi

Effective ways to load data from hdfs to local system?

I'm trying to load terabytes of data from hdfs to local using hadoop fs -get but it takes hours to complete this task. Is there an alternate effective way to get data from hdfs to local?
How fast you copy to a local filesystem is dependent on many factors including:
Are you copying in parallel or in serial.
Is the file splittable (can a mapper potentially deal with a block of data rather than a file, usually a problem if you have certain kinds of compressed files on HDFS)
Network bandwidth of course because you will likely be pulling from many DataNodes
Option 1: DistCp
In any case, since you state your files are on HDFS, we know each hadoop slave node can see the data. You can try to use the DistCp command (distributed copy) which will make your copy operation into a parallel MapReduce job for you WITH ONE MAJOR CAVEAT!.
MAJOR CAVEAT: This will be a distributed copy process so the destination you specify on the command line needs to be a place visible to all nodes. To do this you can mount a network share on all nodes and specify a directory in that network share (NFS, Samba, Other) as the destination for your files. This may take getting a system admin involved, but the result may be a faster file copy operation so the cost-benefit is up to you.
DistCp documentation is here: http://hadoop.apache.org/docs/r0.19.0/distcp.html
DistCp example: YourShell> hadoop distcp -i -update /path/on/hdfs/to/directoryOrFileToCopy file:///LocalpathToCopyTo
Option 2: Multi-threaded Java Application with HDFS API
As you found, the hadoop fs -get is a sequential operation. If your java skills are up to the task, you can write your own multithreaded copy program using the hadoop file system API calls.
Option 3: Multi-threaded Program in any language with HDFS REST API
If you know a different language than Java, you can similarly write a multi-threaded program that accesses HDFS through the HDFS REST API or as an NFS mount

Writing to local file during map phase in hadoop

Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/

How to Use third party API in hadoop to read files from hdfs if those API uses only local file system path?

I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!
For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.

Resources