Solution to small files bottleneck in hdfs - hadoop

I have hundreds of thousands of small csv files in hdfs. Before merging them into a single dataframe, I need to add an id to each file individually (or else in the merge it won't be possible to distinguish between data from different files).
Currently I am relying on yarn to distribute the processes that I create that add the id to each file and convert to parquet format. I find that no matter how I tune the cluster (in size/executor/memory) that the bandwidth is limited at 2000-3000 files/h.
for i in range(0,numBatches):
fileSlice = fileList[i*batchSize:((i+1)*batchSize)]
p = ThreadPool(numNodes)
logger.info('\n\n\n --------------- \n\n\n')
logger.info('Starting Batch : ' + str(i))
logger.info('\n\n\n --------------- \n\n\n')
p.map(lambda x: addIdCsv(x), fileSlice)
def addIdCsv(x):
logId=x[logId]
filePath=x[filePath]
fLogRaw = spark.read.option("header", "true").option('inferSchema', 'true').csv(filePath)
fLogRaw = fLogRaw.withColumn('id', F.lit(logId))
fLog.write.mode('overwrite').parquet(filePath + '_added')
You can see that my cluster is underperforming on CPU. But on the YARN manager it is given 100% access to resources.
What is the best was to solve this part of a data pipeline? What is the bottleneck?
Update 1
The jobs are evenly distributed as you can see in the event timeline visualization below.

As per #cricket_007 suggestion, Nifi provides a good easy solution to this problem which is more scalable and integrates better with other frameworks than plain python. The idea is to read the files into Nifi before writing to hdfs (in my case they are in S3). There is still an inherent bottleneck of reading/writing to S3 but has a throughput around 45k files/h.
The flow looks like this.
Most of the work is done in the ReplaceText processor that finds the end of line character '|' and adds the uuid and a newline.

Related

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

One file database with HDFS and MapReduce

Lets imagine I want to store a big number of urls with associated metadata
URL => Metadata
in a file
hdfs://db/urls.seq
I would like this file to grow (if new URLs are found) after every run of MapReduce.
Would that work with Hadoop? As I understand MapReduce outputs data to a new directory. Is there any way to take that output and append it to the file?
The only idea which comes to my mind is to create a temporary urls.seq and then replace the old one. It works but it feels wasteful. Also from my understanding Hadoop likes the "write once" approach and this idea seams to be in conflict with that.
As blackSmith has explained that you can easily append an existing file in hdfs but it would bring down your performance because hdfs is designed with "write once" strategy. My suggestion is to avoid this approach until no option left.
One approach you may consider that is you can make a new file for every mapreduce output , if size of every output is large enough then this technique will benefit you most because writing a new file will not affect performance as appending does. And also if you are reading the output of each mapreduce in next mapreduce then reading anew file won't affect your performance that much as appending does.
So there is a trade off it depends what you want whether performance or simplicity.
( Anyways Merry Christmas !)

hadoop/HDFS: Is it possible to write from several processes to the same file?

f.e. create file 20bytes.
1st process will write from 0 to 4
2nd from 5 to 9
etc
I need this to parallel creating a big files using my MapReduce.
Thanks.
P.S. Maybe it is not implemented yet, but it is possible in general - point me where I should dig please.
Are you able to explain what you plan to do with this file after you have created it.
If you need to get it out of HDFS to then use it then you can let Hadoop M/R create separate files and then use a command like hadoop fs -cat /path/to/output/part* > localfile to combine the parts to a single file and save off to the local file system.
Otherwise, there is no way you can have multiple writers open to the same file - reading and writing to HDFS is stream based, and while you can have multiple readers open (possibly reading different blocks), multiple writing is not possible.
Web downloaders request parts of the file using the Range HTTP header in multiple threads, and then either using tmp files before merging the parts together later (as Thomas Jungblut suggests), or they might be able to make use of Random IO, buffering the downloaded parts in memory before writing them off to the output file in the correct location. You unfortunately don't have the ability to perform random output with Hadoop HDFS.
I think the short answer is no. The way you accomplish this is write your multiple 'preliminary' files to hadoop and then M/R them into a single consolidated file. Basically, use hadoop, don't reinvent the wheel.

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

Resources