How to ignore empty parquet files when reading using Hive - hadoop

I am using Hive 3.1.0 and my query reads a bunch of parquet files from certain path every hour. I don't have control over how these files are generated as these are created by some external process. In some rare case it happens that within the specified path a certain parquet file may exist with zero size. I would like Hive to ignore this but my hive queries fail with the following error:-
<filename>.parquet is not a Parquet file (too small length: 0)
How do I avoid this ? There could be too many files landing in an hour , so it would be an overkill to create automation to detect and delete empty files. I believe there should be some simpler option in Hive to make it just ignore such files.

Try to use the property $file_size. If it is more than 0 then process the data load. It would be better if you can provide the query as how you are trying to access.

I don't know how to do this as hive property. If ever, you might want to handle empty files in a separate directory before pushing to final storage using:
find ./your-directory -type f -empty -print -delete
or if not possible, handle deleting files in your final storage.
Try to list the files to be deleted for sanity check.

Related

update Parquet file format

my requirement is to read that and generate another set of parquet data into another ADLS folder.
do i need this into spark dataframes and perform upserts ?
Parquet is like any other file format. You have to overwrite the files to perform insert, updates and deletes. It does not have ACID properties like a database.
1 - We can use SET properties with the spark dataframe to accomplish what you want. However, it compares at both the row and column level. Not as nice as an ANSI SQL.
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html
2 - We can save the data in the target directory as a DELTA file. Most people are using DELTA since it has ACID properties like a database. Please see the merge statement. It allows for updates and inserts.
https://docs.delta.io/latest/delta-update.html
Additionally we can soft delete when reversing the match.
The nice thing about a delta file (table) is we can partition by date for a daily file load. Thus we can use time travel to see what happen yesterday versus today.
https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
3 - If you do not care about history and soft deletes, the easiest way to accomplish this task is to archive the old files in the target directory, then copy over the new files from the source directory to the target directory.

What should be the minimum size of a valid ORC file with snappy compression

The scenario I am dealing with here is each hour 10k orc files are getting generated in HDFS by spark streaming application and after the end of the hour, a spark merge job runs and merge those small files in some bigger chunk and write it to hive landing path for external table to pick up. Sometimes, a corrupt ORC file is making the merge job to fail. The job would be to find out the corrupt ORC file and move it into a badrecordspath and then let the spark merge job begin. After going through the theory of ORC file, it seems a valid ORC file will have "ORC"(as a string) followed by another byte in the end of the file. How do I check that in optimised way so that it won't take much time to validate those 10K orc files. I thought of writing bash shell script but it seems to take some good amount of time to validate HDFS orc files. My idea is to narrow down the validation if I know the minimum size of a valid ORC file coz most of our corrupt files are very tiny in size(mostly 3 bytes). So if I get any suggestion, that would be very helpful.
PS: I can't use set spark.sql.files.ignoreCorruptFiles=true because I have to track the files and move those to bad records path.
Found out a solution. We can use set spark.sql.files.ignoreCorruptFiles=true and then we can track the ignored files using the below method:
def trackIgnoreCorruptFiles(df: DataFrame): List[Path] = {
val listOfFileAfterIgnore = df.withColumn("file_name", input_file_name)
.select("file_name")
.distinct()
.collect()
.map(x => new Path(x(0).toString))
.toList
listOfCompleteFiles.diff(listOfFileAfterIgnore)
}
input_file_name is an in built spark udf which returns the complete path of the file and we are getting it as a column in that dataframe df.This method returns the list of path of those files remain after ignore by spark. The list diff will give you the actual list of files ignored by spark. Then you can easily move those list of files to a badRecordsPath for future analysis.

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

Querying data from har archives - Apache Hive

I am using Hadoop and facing the dreaded problem of large numbers of small files. I need to be able to create har archives out of existing hive partitions and query them at the same time. However, Hive apparently supports archiving partitions only in managed tables and not external tables - which is pretty sad. I am trying to find a workaround for this, by manually archiving the files inside a partition's directory, using hadoop's archive tool. I now need to configure hive to be able to query the data stored in these archives, along with the unarchived data stored in other partition directories. Please note that we only have external tables in use.
The namespace for accessing the files in the created partition-har corresponds to the hdfs path of the partition dir.
For example, For example, a file in hdfs:
hdfs:///user/user1/data/db1/tab1/ds=2016_01_01/f1.txt
can after archiving be accessed as:
har:///user/user1/data/db1/tab1/ds=2016_01_01.har/f1.txt
Would it be possible for hive to query the har archives from the external table? Please suggest a way if yes.
Best Regards
In practice, the line between "managed" and "external" tables is very thin.
My suggestion:
create a "managed" table
add explicitly partitions for some days in the future, but with ad hoc locations -- i.e. the directories your external process expects to use
let the external process dump its file directly at HDFS level -- they are automagically exposed in Hive queries, "managed" or not(the Metastore does not track individual files and blocks, they are detected on each query; as a side note, you can run backup & restore operations at HDFS level if you wish, as long as you don't mess with the directory structure)
when a partition is "cold" and you are pretty sure there will never be another file dumped there, you can run a Hive command to archive the partition i.e. move small files in a single HAR + flag the partition as "archived" in the Metastore
Bonus: it's easy to unarchive your partition within Hive (whereas there is no hadoop unarchive command AFAIK).
Caveat: it's a "managed" table so remember not to DROP anything unless you have safely moved your data out of the Hive-managed directories.

Resources