I have a directory tree with two directories and synchronized files in them:
home/dirMaster/file1.txt
home/dirMaster/file2.txt
home/dirSlave/file1-slave.txt
home/dirSlave/file2-slave.txt
Based on the file name file1-slave.txt have records corresponding to file1.txt
I want to move then to hdfs using flume but based on my reading so far I have the following problems:
flume will not preserve my file name - so I lose the
synchronization
flume does not guaranty that the files from source will match the destination - e.g. source file might get split to several dest files
Is this correct? Can flume support this scenario?
Flume Agent allows to move the data from source to sink. It uses channel to keep this data before rolling into Sink.
Flume's one of the sinks is HDFS Sink. HDFS sink allows rolling the data into HDFS based on the following criteria.
hdfs.rollSize
hdfs.rollInterval
hdfs.rollCount
It rolls the data based on the above parameters combination and file name are having predefined pattern. We can also control the file names using Sink parameters. But this pattern is same for all files which are rolled by this agent. We cannot expect different file path patterns from single flume agent.
agent.sinks.sink.hdfs.path=hdfs://:9000/pattern
Pattern could be static or dynamic path.
Flume also produces n number of files based on rolling criteria.
So Flume is not suitable for your requirement. Flume is best fit for streaming data ingestion.
DistCP: It is a distributed parallel data loading utility in HDFS.
It is a Map only MapReduce program and it will produce n number of part files (= no of maps) in the destination directory.
So DistCP is also not suitable for tour requirement.
So Better to use hadoop fs -put to load the data into HDFS.
hadoop fs -put /home/dirMaster/ /home/dirMaster/ /home/
Related
I want to configure a flume flow so that it takes in a CSV file as a source, checks the data, and dynamically separates each row of data into folders by year/month in HDFS. Is this possible?
I might suggest you look at using Nifi instead. I feel like it's the natural replacement for Flume.
Having said that it would seem that you might want to consider using a the spooling directory source and a hive sink (instead of hdfs). The hive partitions (Partitions on year/Month) would enable you to land the data in the Manner you are suggesting.
How can I read tar.bzip2 file in spark in parallel.
I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.
So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.
To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.
I want to move some files from one location to another location [both the locations are on HDFS] and need to verify that the data has moved correctly.
In order to compare the data moved, I was thinking of calculating hash code on both the files and then comparing if they are equal. If equal, I would term the data movement as correct else the data movement has not happened correctly.
But I have a couple of questions regarding this.
Do I need to use the hashCode technique at all in first place? I am using MapR distribution and I read somewhere that data movement when done, implement hashing of the data at the backend and make sure that it has been transferred correctly. So is it guaranteed that when data will be moved inside HDFS, it will be consistent and no anomaly will be inserted while movement?
Is there any other way that I can use in order to make sure that the data moved is consistent across locations?
Thanks in advance.
You are asking about data copying. Just use DistCp.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
#sample example
$hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.
EDIT
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting.
After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to ·verify that the copy was truly successful·. Since DistCp employs both MapReduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy.
EDIT
The common method I used to check the source and dist files was check the number of files and the specified size of each file. This can be done by generate a manifest at the source, then check at the dist both the number and size.
In HDFS, move doesn't physically move the data (blocks) across the data nodes. It actually changes the name space in HDFS metadata. Where as copying data from one HDFS location to another HDFS location, we have two ways;
copy
parallel copy distcp
In General copy, it doesn't check the integrity of blocks. If you want data integrity while copying file from one location to another location in same HDFS cluster use CheckSum concept by modifying the FsShell.java class or write your own class using HDFS Java API.
In Case of distCp, HDFS checks the data integrity while copying data from One HDFS cluster to another HDFS Cluster.
I have an ingest pipeline created using spark streaming, and I would like to store the RDDs in hadoop as a large unstructured (JSONL) datafile to simplify future analysis.
What is the best approach for persisting astream to hadoop without ending up with very large numbers of small files? (since hadoop is not good with those, and they complicate analysis workflows)
First, I would suggest using a persistance layer that can handle this like Cassandra. But, if you are deadset on HDFS, then the mailing list has an answer already
You can use FileUtil.copyMerge (from the hadoop fs) API and specify the path to the folder where saveAsTextFiles is saving the part text file.
Suppose your directory is /a/b/c/ use
FileUtil.copyMerge(FileSystem of source, a/b/c,
FileSystem of destination, Path to the merged file say (a/b/c.txt),
true(to delete the original dir,null))
In hadoop mapreduce programming model; when we are processing files is it mandatory to keep the files in HDFS file system or can I keep the files in other file system's and still have the benefit of mapreduce programming model ?
Mappers read input data from an implementation of InputFormat. Most implementations descend from FileInputFormat, which reads data from local machine or HDFS. (by default, data is read from HDFS and the results of the mapreduce job are stored in HDFS as well.) You can write a custom InputFormat, when you want your data to be read from an alternative data source, not being HDFS.
TableInputFormat would read data records directly from HBase and DBInputFormat would access data from relational databases. You could also imagine a system where data is streamed to each machine over the network on a particular port; the InputFormat reads data from the port and parses it into individual records for mapping.
However, in your case, you have data in a ext4-filesystem on a single or multiple servers. In order to conveniently access this data within Hadoop you'd have to copy it into HDFS first. This way you will benefit from data locality, when the file chunks are processed in parallel.
I strongly suggest reading the tutorial from Yahoo! on this topic for detailed information. For collecting log files for mapreduce processing also take a look at Flume.
You can keep the files elsewhere but you'd lose the data locality advantage.
For example. if you're using AWS, you can store your files on S3 and access them directly from Map-reduce code, Pig, Hive, etc.
In order to user Apache Haddop you must have your files in HDFS, the hadoop file system. Though there are different abstract types of HDFS, like AWS S3, these are all at their basic level HDFS storage.
The data needs to be in HDFS because HDFS distributed the data along your cluster. During the mapping phase each Mapper goes through the data stored in it's node and then sends it to the proper node running the reducer code for the given chunk.
You can't have Hadoop MapReduce, withput using HDFS.