Calculating HashCode function in HDFS - hadoop

I want to move some files from one location to another location [both the locations are on HDFS] and need to verify that the data has moved correctly.
In order to compare the data moved, I was thinking of calculating hash code on both the files and then comparing if they are equal. If equal, I would term the data movement as correct else the data movement has not happened correctly.
But I have a couple of questions regarding this.
Do I need to use the hashCode technique at all in first place? I am using MapR distribution and I read somewhere that data movement when done, implement hashing of the data at the backend and make sure that it has been transferred correctly. So is it guaranteed that when data will be moved inside HDFS, it will be consistent and no anomaly will be inserted while movement?
Is there any other way that I can use in order to make sure that the data moved is consistent across locations?
Thanks in advance.

You are asking about data copying. Just use DistCp.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
#sample example
$hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.
EDIT
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting.
After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to ·verify that the copy was truly successful·. Since DistCp employs both MapReduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy.
EDIT
The common method I used to check the source and dist files was check the number of files and the specified size of each file. This can be done by generate a manifest at the source, then check at the dist both the number and size.

In HDFS, move doesn't physically move the data (blocks) across the data nodes. It actually changes the name space in HDFS metadata. Where as copying data from one HDFS location to another HDFS location, we have two ways;
copy
parallel copy distcp
In General copy, it doesn't check the integrity of blocks. If you want data integrity while copying file from one location to another location in same HDFS cluster use CheckSum concept by modifying the FsShell.java class or write your own class using HDFS Java API.
In Case of distCp, HDFS checks the data integrity while copying data from One HDFS cluster to another HDFS Cluster.

Related

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we don´t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.

Small files in hadoop

I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.

Persisting unstructured data to hadoop using spark streaming

I have an ingest pipeline created using spark streaming, and I would like to store the RDDs in hadoop as a large unstructured (JSONL) datafile to simplify future analysis.
What is the best approach for persisting astream to hadoop without ending up with very large numbers of small files? (since hadoop is not good with those, and they complicate analysis workflows)
First, I would suggest using a persistance layer that can handle this like Cassandra. But, if you are deadset on HDFS, then the mailing list has an answer already
You can use FileUtil.copyMerge (from the hadoop fs) API and specify the path to the folder where saveAsTextFiles is saving the part text file.
Suppose your directory is /a/b/c/ use
FileUtil.copyMerge(FileSystem of source, a/b/c,
FileSystem of destination, Path to the merged file say (a/b/c.txt),
true(to delete the original dir,null))

Add a entire directory to hadoop file system (hdfs)

I have data that is stored within the sub directories and would like to put the parent directory in the HDFS. The data is always present at the last directory and the directory structure extends upto 2 levels.
So the structure is [parent_dir]->[sub_directories]->[sub_directories]->data
I tried to add the entire directory by doing
hadoop fs -put parent_dir input
This takes a long long time ! The sub directories are possibly 258X258. And this eventually fails with
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(X.X.X.245:50010, storageID=DS-262356658-X.X.X.245-50010-1394905028736, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on devic
I can see the required space on the nodes. What am I doing wrong here ?
Also I the way I was planning to access my files was
hadoop jar Computation.jar input/*/* output
This worked well for small data set.
That error message is usually fundamentally correct. You may not be taking into account the replication factor for the HDFS filesystem. If your replication factor is set to 3, which is the default, then you need 300GB of storage available to store a 100GB dataset.
There are a couple of things you can do to help get around the issue:
1) Decrease your replication factor (dfs.replication), and your maximum blocks (dfs.replication.max) to 2 in your hdfs-site.xml
2) Compress your datasets. Hadoop can operate on bzip and gzip compressed files (though you need to be careful of splitting)

HDFS:How to distribute files of small sizes across?

I have very large number of small files to be stored in HDFS. Based on the file name I want to store them in different data nodes. This way I can achieve file names starting with certain alphabets to go into specific data nodes. How to do this in Hadoop?
Not a very good choice. Reasons :
Hadoop is not very good at handling very large number of small files.
Storing one complete file in a single node is against one of the fundamental principles of HDFS, distributed storage.
I would like to know what benefit will you get with this approach.
In response to your comment :
HDFS doesn't do any kind of sorting like HBase does. When you put a file into HDFS, it gets split into small blocks first and then gets stored(each block on a different node). So there is nothing like sending a whole file to a single node. Your file(blocks) reside on multiple nodes.
What you could do is create a directory hierarchy as per you needs and store files in those directories(in case your intention is to fetch the files directly based on their location). For example,
/dirA
/dirA/A.txt
/dirA/B.txt
/dirB
/dirB/P.txt
/dirB/Q.txt
/dirC
/dirC/Y.txt
/dirC/Z.txt
But, if you really want to send the blocks of a particular file to some specific nodes then you need to implement your own block placement policy and which is not very easy. See this for more details.

Resources