When reading data partitioned by column in Spark with something like spark.read.json("/A=1/B=2/C=3/D=4/E=5/") will allow to scan only the files in the folder E=5.
But let's say I am interested to read partitions in which C = my_value through all the data source. The instruction will be spark.read.json("/*/*/C=my_value/").
What happens computationally in the described scenario under the hood? Spark will just list through the partition values of A and B? Or it will scan through all the leaves (the actual files) too?
Thank you for an interesting question. Apache Spark uses Hadoop's FileSystem abstraction to deal with wildcard patterns. In the source code they're called glob patterns
The org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path) method is used to return "an array of paths that match the path pattern". This function calls then org.apache.hadoop.fs.Globber#glob to figure out the exact files matching algorithm for the glob pattern. globStatus is called by org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary. You can add some breakpoints to see how does it work under-the-hood.
But long story short:
What happens computationally in the described scenario under the hood? Spark will just list through the partition values of A and B? Or it will scan through all the leaves (the actual files) too?
Spark will split your glob in 3 parts ["*", "*", "C=my_value"]. Later, it will list files at every level by using Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path) method. For every file it will build a path and try to match it against the current pattern. The matching files will be kept as "candidates" that will be filtered out only at the last step, when the algorithm will look for "C=my_value".
Unless you have a lot of files, this operation shouldn't hurt you. And probably that's one of the reasons why you should rather keep less but bigger files (famous data engineering problem of "too many small files").
Related
I'm reading in tens of thousands of files into an rdd via something like sc.textFile("/data/*/*/*") One problem is that most of these files are tiny, whereas others are huge. That leads to imbalanced tasks, which causes all sorts of well-known problems.
Can I break up the largest partitions by instead reading in my data via sc.textFile("/data/*/*/*", minPartitions=n_files*5), where n_files is the number of input files?
As convered elsewhere on stackoverflow, minPartitions gets passed way down the hadoop rabit hole and is used in the org.apache.hadoop.mapred.TextInputFormat.getSplits. My question is whether this is implemented such that the largest files are split first. In other words, is the splitting strategy one that tries to lead to evenly sized partitions?
I would prefer an answer that points to wherever the splitting strategy is actually implemented in a recent version of spark/hadoop.
Nobody's posted an answer so I dug into this myself and will post an answer to my own question:
It appears that, if your input file(s) are splittable, then textFile will indeed try to balance partition size if you use the minPartitions option.
The partitioning strategy is implemented here, i.e., in the getSplits method of org.apache.hadoop.mapred.TextInputFormat. This partitioning strategy is complex, and operates by first setting goalSize, which is simply the total size of the input divided by the numSplits (minPartitions is passed down to set the value of numSplits). It then splits up files in such a way that tries to ensure that each partition's size (in terms of its input's byte size) is as close as possible to the goalSize/
If your input file(s) are not splittable, then this splitting will not take place: see the source code here.
My question is: Are there any standard compression formats which can ensure that a certain delimiter sequence does not occur in the compressed data stream?
We want to design a binary file format, containing chunks of sequential data (3D coordinates + other data, not really important for the question). Each chunk should be compressed using a standard compression format, like GZIP, ZIP, ...
So, the file structure will be like:
FileHeader
ChunkDelimiter Chunk1_Header compress(Chunk1_Data)
ChunkDelimiter Chunk2_Header compress(Chunk2_Data)
...
Use case is the following: The files should be read in splits in Hadoop, so we want to be able to start at an arbitrary byte position in the file, and find the start of the next chunk by looking for the delimiter sequence. -> The delimiter sequence should not occur within the chunks.
I know that we could post-process the compressed data, "escaping" the delimiter sequence in case that it occurs in the compressed output. But we'd better avoid this, since the "reverse escaping" would be required in the decoder, adding complexity.
Some more facts why we chose this file format:
Should be easily readable by third parties -> standard compression algorithm preferred.
Large files; streaming operation: amount of data and number of chunks is not known when starting to write the file -> Difficult to write start-of-chunk byte positions in the header.
I won't answer your question with a compression scheme name but will give you a hint of how other solved the same issue.
Let's give a look at Avro. Basically, they have similar requirements: files must be splitable and each data block can be compressed (you can even choose your compression scheme).
From the Avro Specification we learn that splittability is achieved with the help of a synchronization marker ("Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing."). We also discover that the synchronization marker is a 16-byte randomly-generated value ("The 16-byte, randomly-generated sync marker for this file.").
How does it solve your issue ? Well, since Martin Kleppmann provided a great answer to this question a few years ago I will just copy paste his message
On 23 January 2013 21:09, Josh Spiegel wrote:
As I understand it, Avro container files contain synchronization markers
every so often to support splitting the file. See:
https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
(1) Why isn't the synchronization marker the same for every container file?
(i.e. what is the point of generating it randomly every time)
(2) Is it possible, at least in theory, for naturally occurring data to
contain bytes that match the sync marker? If so, would this break
synchronization?
Thanks,
Josh
Because if it was predictable, it would inevitably appear in the actual data sometimes (e.g. imagine the Avro documentation, stating
what the sync marker is, is downloaded by a web crawler and stored in
an Avro data file; then the sync marker will appear in the actual
data). Data may come from malicious sources; making the marker random
makes it unfeasible to exploit.
Possibly, but extremely unlikely. The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data
is about 10^-23. It's more likely that your data center is wiped out
by a meteorite
(http://preshing.com/20110504/hash-collision-probabilities).
If the sync marker appears in your data, it only breaks reading the file if you happen to also seek to that place in the file. If you just
read over it sequentially, nothing happens.
Martin
Link to the Avro mailing list archive
If it works for Avro, it will work for you too.
No. I know of no standard compression format that does not allow any sequence of bits to occur somewhere within. To do otherwise would (slightly) degrade compression, going against the original purpose of a compression format.
The solutions are a) post-process the sequence to use a specified break pattern and insert escapes if the break pattern accidentally appears in the compressed data -- this is guaranteed to work, but you don't like this solution, or b) trust that the universe is not conspiring against you and use a long break pattern whose length assures that it is incredibly unlikely to appear accidentally in all the sequences this is applied to anytime from now until the heat death of the universe.
For b) you can protect somewhat against the universe conspiring against you by selecting a random pattern for each file, and providing the random pattern at the start of the file. For the truly paranoid, you could go even further and generate a new random pattern for each successive break, from the previous pattern.
Note that the universe can conspire against you for a fixed pattern. If you make one of these compressed files with a fixed break pattern, and then you include that file in another compressed archive also using that break pattern, that archive will likely not be able to compress this already compressed file and will simply store it, leaving exposed the same fixed break pattern as is being used by the archive.
Another protection for b) would be to detect the decompression failure of an incorrect break by seeing that the piece before the break does not terminate, and handle that special case by putting that piece and the following piece back together and trying the decompression again. You would also very likely detect this on the following piece as well, with that decompression failing.
I need to read and process a file as a single unit, not line by line, and it's not clear how you'd do this in a Hadoop MapReduce application. What I need to do is to read the first line of the file as a header, which I can use as my key, and the following lines as data to build a 2-D data array, which I can use as my value. I'll then do some analysis on the entire 2-D array of data (i.e. the value).
Below is how I'm planning to tackle this problem, and I would very much appreciate comments if this doesn't look reasonable or if there's a better way to go about this (this is my first serious MapReduce application so I'm probably making rookie mistakes):
My text file inputs contain one line with station information (name, lat/lon, ID, etc.) and then one or more lines containing a year value (i.e. 1956) plus 12 monthly values (i.e. 0.3 2.8 4.7 ...) separated by spaces. I have to do my processing over the entire array of monthly values [number_of_years][12] so each individual line is meaningless in isolation.
Create a custom key class, making it implement WritableComparable. This will hold the header information from the initial line of the input text files.
Create a custom input format class in which a) the isSplitable() method returns false, and b) the getRecordReader() method returns a custom record reader that knows how to read a file split and turn it into my custom key and value classes.
Create a mapper class which does the analysis on the input value (the 2-D array of monthly values) and outputs the original key (the station header info) and an output value (a 2-D array of analysis values). There'll only be a wrapper reducer class since there's no real reduction to be done.
It's not clear that this is a good/correct application of the map reduce approach a) since I'm doing analysis on a single value (the data array) mapped to a single key, and b) since there is never more than a single value (data array) per key then no real reduction will ever need to be performed. Another issue is that the files I'm processing are relatively small, much less than the default 64MB split size. With this being the case perhaps the first task is instead to consolidate the input files into a sequence file, as shown in the SmallFilesToSequenceFileConverter example in the Definitive Hadoop O'Reilly book (p. 194 in the 2nd Edition)?
Thanks in advance for your comments and/or suggestions!
It looks like your plan regarding coding is spot on, I would do the same thing.
You will benefit from hadoop if you have a lot of input files provided as input to the Job, as each file will have its own InputSplit and in Hadoop number of executed mappers is the same as number of input splits.
Too many small files will cause too much memory use on the HDFS Namenode. To consolidate the files you can use SequenceFiles or Hadoop Archives (hadoop equivalent of tar) See docs. With har files (Hadoop Archives) each small file will have its own Mapper.
I need to merge about 30 gzip-ed text files, each about 10-15GB compressed, each containing multi-line records, each sorted by the same key. The files reside on an NFS share, I have access to them from several nodes, and each node has its own /tmp filesystem. What would be the fastest way to go about it?
Some possible solutions:
A. Leave it all to sort -m. To do that, I need to pass every input file through awk/sed/grep to collapse each record into a line and extract a key that would be understood by sort. So I would get something like
sort -m -k [...] <(preprocess file1) [...] <(preprocess filen) | postprocess
B. Look into python's heapq.merge.
C. Write my own C code to do this. I could merge the files in small batches, make an OMP thread for each input file, one for the output, and one actually doing the merging in RAM, etc.
Options for all of the above:
D. Merge a few files at a time, in a tournament.
E. Use several nodes for this, copying intermediate results in between the nodes.
What would you recommend? I don't have much experience about secondary storage efficiency, and as such, I find it hard to estimate how either of these would perform.
If you go for your solution B involving heapq.merge, then you will be delighted to know, that Python 3.5 will add a key parameter to heapq.merge() according to docs.python.org, bugs.python.org and github.com. This will be a great solution to your problem.
What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.