I'm using a custom output format that outputs a new sequence file per mapper per key, so you end up with something like this..
Input
Key1 Value
Key2 Value
Key1 Value
Files
/path/to/output/Key1/part-00000
/path/to/output/Key2/part-00000
I've noticed a huge performance hit, it usually takes around 10 minutes to simply map the input data, however after two hours the mappers weren't even half way complete. Though they were outputting rows. I expect the number of unique keys to be around half the number of input rows, around 200,000.
Has anyone ever done anything like this, or could suggest anything that might help the performance? I'd like to keep this key-splitting process within hadoop of possible.
Thanks!
I believe you should revisit your design. I don't believe HDFS scales well beyound 10M files. I suggest to read more on Hadoop, HDFS and Map/Reduce. A good place to start would be http://www.cloudera.com/blog/2009/02/the-small-files-problem/.
Good luck!
EDIT 8/26: Based on the #David Gruzman's comment, I looked deeper into the issue. Indeed the penalty for storing a large number of the small files is only for the NameNode. There is no additional space penalty to the data nodes. I removed the incorrect part of my answer.
It sounds like making output to some Key-Value store might help a lot.
For example HBASE might suit Your need since it is optimized for big number of writes, and you will reuse part of Your hadoop infrastructure.
There is existing output format to write right to HBase: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html
Related
I am trying to read a subset of a dataset by using pushdown predicate.
My input dataset consists in 1,2TB and 43436 parquet files stored on s3. With the push down predicate I am supposed to read 1/4 of data.
Seeing the Spark UI. I see that the job actually reads 1/4 of data (300GB) but there are still 43436 partitions in the first stage of the job however only 1/4 of these partitions has data, the other 3/4 are empty ones (check the median input data in the attached screenshots).
I was expecting Spark to create partitions only for non empty partitions. I am seeing a 20% performance overhead when reading the whole dataset with the pushdown predicate comparing to reading the prefiltred dataset by another job (1/4 of data) directly. I suspect that this overhead is due to the huge number of empty partitions/tasks I have in my first stage, so I have two questions:
Are there any workaround to avoid these empty partitions?
Do you think to any other reason responsible for the overhead? may be the pushdown filter execution is naturally a little bit slow?
Thank you in advance
Using S3 Select, you can retrieve only a subset of data.
With Amazon EMR release version 5.17.0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
Otherwise, S3 acts as an object store, in which case, an entire object has to be read. In your case you have to read all content from all files, and filter them on client side.
There is actually very similar question, where by testing you can see that:
The input size was always the same as the Spark job that processed all of the data
You can also see this question about optimizing data read from s3 of parquet files.
Seems like your files are rather small: 1.2TB / 43436 ≈ 30MB. So you may want to look at increasing the spark.sql.files.maxPartitionBytes, to see if it reduces the total number of partitions. I have not much experience with S3, so not sure whether its going to help given this note in its description:
The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.
Empty partitions: It seems that spark (2.4.5) tries to really have partitions with size ≈ spark.sql.files.maxPartitionBytes (default 128MB) by packing many files into one partition, source code here.
However it does this work before running the job, so it can't know that 3/4 of files will not output data after the pushed down predicate being applied. For the partitions where it will put only files whose lines will be filtered out, I ended up with empty partitions. This explains also why my max partition size is 44MB and not 128MB, because none of the partitions had by chance files that passed all the pushdown filter.
20% Overhead: Finally this is not due to empty partitions, I managed to have much less empty partitions by setting spark.sql.files.maxPartitionBytes to 1gb but it didn't improve reading. I think that the overhead is due to opening many files and reading their metadata.
Spark estimates that opening a file is equivalent to reading 4MB spark.sql.files.openCostInBytes. So opening many files even if thanks to the filter won't be read shouldn't be negligible..
I'm try to use the SSD in order to improve the hive performance.
SSD is, have a high-speed random access. Taking advantage to try to change the hive to be executed in the mapreduce code.
Now my idea is to simplify or eliminate the shuffling step.
Is it possible this? If possible, Where you do change?
ps. Tell us what happens when the hive is operating, where temporary files are stored.
I do not know English well. I'm sorry.
thank you.
In theory you can write your own partitioner and send the data on reducer which runs on the same node where the mapper ran.
Doing so you will never get the output file "unsplitted", so avoid the shuffling is not a good idea.
If you have a fast disk like SSD can be, you can increase the block size.
Usually the block size is computed to have the seek time no bigger than the 1% of the whole block transfer.
This will also reduce the number of the mapper used, since the number of the splits are few. Somewhat, less mapper means also less shuffling.
Using a compressed file format for the intermediate file, also speedup the work.
The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).
i am new to hadoop and i'm working with large number of small files in wordcount example.
it takes a lot of map tasks and results in slowing my execution.
how can i reduce the number of map tasks??
if the best solution to my problem is catting small files to a larger file, how can i cat them?
If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.
To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.
You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:
job.setInputFormatClass(CombinedInputFormat.class);
Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.
I am trying to optimize the performance of a search engine by preprocessing all the results. We have around 50k search terms. I am planning to search these 50k terms before hand and save it in memory (memcached/redis). Searching for all 50k terms takes more than a day in my case since we do deep semantic search. So I am planning to distribute the searching (preprocessing) over several nodes. I was considering to use hadoop. My input size is very less. Probably less than 1MB even though total search term is over 50k. But searching each term takes over a min i.e more computation oriented than data oriented. So I am wondering if I should use Hadoop or build my own distributed system. I remember reading that hadoop is used mainly if input is very huge. Please suggest me on how to go about this.
And I read hadoop reads data in block size. i.e 64mb for each jvm/mapper. Is it possible to make it number of lines instead of block size. Example: Every mapper gets 1000 lines instead of 64mb. Is it possible to achieve this.
Hadoop can definitely handle this task. Yes, much of Hadoop was designed to handle jobs with very large input or output data, but that's not it's sole purpose. It can work well for close to any type of distributed batch processsing. You'll want to take a look at NLineInputFormat; it allows you to split your input up based on exactly what you wanted, number of lines.