How does SparkContext.textFile work under the covers? - hadoop

I am trying to understand the textFile method deeply, but I think my
lack of Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect
When sc.textFile(path) is called, then defaultMinPartitions is used,
which is really just math.min(taskScheduler.defaultParallelism, 2). Let's
assume we are using the SparkDeploySchedulerBackend and this is
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))
So, now let's say the default is 2, going back to the textFile, this is
passed in to HadoopRDD. The true size is determined in getPartitions() using
inputFormat.getSplits(jobConf, minPartitions). But, from what I can find,
the partitions is merely a hint and is in fact mostly ignored, so you will
probably get the total number of blocks.
OK, this fits with expectations, however what if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?
Cross posted with the spark mailing list

Short Version:
Split size is determined by mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize, if it's bigger than HDFS's blockSize, multiple blocks inside a same file would be combined into a single split.
Detailed Version:
I think you are right in understanding the procedure before inputFormat.getSplits.
Inside inputFormat.getSplits, more specifically, inside FileInputFormat's getSplits, it is mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize that would at last determine split size. (I'm not sure which would be effective in Spark, I prefer to believe the former one).
Let's see the code: FileInputFormat from Hadoop 2.4.0
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
Inside the for loop, makeSplit() is used to generate each split, and splitSize is the effective Split Size. The computeSplitSize Function to generate splitSize:
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
Therefore, if minSplitSize > blockSize, the output splits are actually a combination of several blocks in the same HDFS file, on the other hand, if minSplitSize < blockSize, each split corresponds to a HDFS's block.

I will add more points with examples to Yijie Shen answer
Before we go into details,lets understand the following
Assume that we are working on Spark Standalone local system with 4 cores
In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then
defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)
/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
logic to find defaultMinPartitions as below
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize
package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
}
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
goalSize = totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)
blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB
Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode
Splits = 91(file size 91 bytes) / 7 (splitSize) => 13
FileInputFormat: Total # of splits generated by getSplits: 13
=> while calculating splitSize if file size is > 32 MB then the split size will be taken the default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes

Related

what does it mean files overflow_xxxx.bin while training glove

I'm training a word embedding model based on Glove method. While the algorith shows a logger like:
$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000
The home directory of Glove is filled with files caled overflow_0534.bin. Can someone tell whether all is going well?
Thanks
Everything is OK.
You can view the source code of Glove cooccur program at Github.
At the line 57 of the file:
long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk
If your corpus has too many co-occurrence records, then there will be some data to be written into some temp bin disk files.
while (1) {
if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
The variable overflow_length depends on your memory settings.
Line 463:
if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
Line 467:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
Line 470:
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

Get correct file size from Spring MultipartFile

How to find correct file size from org.​springframework.​web.​multipart.​MultipartFile
double bytes = m.getSize();
double kilobytes = (bytes / 1024);
double megabytes = (kilobytes / 1024);
System.out.println(megabytes);
but its not giving the correct size.
Here m.getSize() return 2264855 but using this method it gives file size something 2MB but actual size in my system of file is 8MB
private HashMap add(#RequestParam( value = "files", required = false) MultipartFile[] mPartFile){
if(mPartFile != null && mPartFile.length>0){
for(MultipartFile m:mPartFile){
if(!CoreServices.isSupportedFile(m.getContentType())){
hm.put("status", MConstants.OPERATION_FAILED);
hm.put("reason", "File: "+m.getOriginalFilename()+" is not supported. Upload only jpeg, png, gif image files.");
return hm;
}
long size = m.getSize();
if( size> MConstants.MAX_ALLOWED_FILE_SIZE){
hm.put("status", MConstants.OPERATION_FAILED);
hm.put("reason", "File: "+m.getName()+" is not acceptable, its size is more than 5MB.");
return hm;
}
}
}
}
Here the value of MAX_ALLOWED_FILE_SIZE is 5000000 for 5MB limitation
You are using method getSize() of MultipartFile, which I think returns size in bytes in binary.
So convert that to in MB you have to multiply file.getSize() with 0.00000095367432. This will give you size in MB. You can verify here.
I am using the same. I can see size of a file in windows is 18.9 KB which is actual size not the size on disk. In my case I need information in KB so I multiplied it by 0.0009765625 and it returned 18.9462890625 KB which is close to the size of file an may be windows is rounding it up.
I believe multiplying file.getSize() with 0.00000095367432 will work.

Config for flume to write files ~100mb (close to 120mb hdfs file size)

Im trying to config Flume so it uses at least close the block size of HDFS which is 128mb in my case. This is my config which is writing about 10mb per file:
###############################
httpagent.sources = http-source
httpagent.sinks = k1
httpagent.channels = ch3
# Define / Configure Source (multiport seems to support newer "stuff")
###############################
httpagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
httpagent.sources.http-source.channels = ch3
httpagent.sources.http-source.port = 5140
httpagent.sinks = k1
httpagent.sinks.k1.type = hdfs
httpagent.sinks.k1.channel = ch3
httpagent.sinks.k1.hdfs.path = hdfs://r3608/hadoop/hdfs/data/flumechannel3/0.5/
httpagent.sinks.k1.hdfs.fileType = DataStream
httpagent.sinks.HDFS.hdfs.writeFormat = Text
httpagent.sinks.k1.hdfs.rollCount = 0
httpagent.sinks.k1.hdfs.batchSize = 10000
httpagent.sinks.k1.hdfs.rollSize = 0
httpagent.sinks.log-sink.channel = memory
httpagent.sinks.log-sink.type = logger
# Channels
###############################
httpagent.channels = ch3
httpagent.channels.ch3.type = memory
httpagent.channels.ch3.capacity = 100000
httpagent.channels.ch3.transactionCapacity = 80000
So the problem is i cant get it to write about 100mb files.. i would expect to write around 100mb at least if i change config like this :
httpagent.sinks = k1
httpagent.sinks.k1.type = hdfs
httpagent.sinks.k1.channel = ch3
httpagent.sinks.k1.hdfs.path = hdfs://r3608/hadoop/hdfs/data/flumechannel3/0.4test/
httpagent.sinks.k1.hdfs.fileType = DataStream
httpagent.sinks.HDFS.hdfs.writeFormat = Text
httpagent.sinks.k1.hdfs.rollSize = 100000000
httpagent.sinks.k1.hdfs.rollCount = 0
But then files get even smaller and hes writing about 3-8mb files... Since its not really possible to aggregate files ones they are in hdfs i really want to get this files bigger. Is there something im not getting about the rollSize Parameter? or is there some default value so hell never write that big files?
you need to override rollInterval to 0,never roll based on time interval:
httpagent.sinks.k1.hdfs.rollInterval = 0

Aparapi add sample

I'm studing Aparapi (https://code.google.com/p/aparapi/) and have a strange behaviour of one of the sample included.
The sample is the first, "add". Building and executing it, is ok. I also put the following code for testing if the GPU is really used
if(!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
and it works fine.
But, if I try to change the size of the array from 512 to a number greater than 999 (for example 1000), I have the following output:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1000, localSize[0] = 128
Apr 18, 2013 1:31:01 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
JTP
Kernel did not execute on the GPU!
Here's my code:
final int size = 1000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float)(Math.random()*100);
b[i] = (float)(Math.random()*100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
Range range = Range.create(size);
kernel.execute(range);
System.out.println(kernel.getExecutionMode());
if (!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
kernel.dispose();
}
I tried specifying the size using
Range range = Range.create(size, 128);
as suggested in a Google group, but nothing changed.
I'm currently running on Mac OS X 10.8 with Java 1.6.0_43. Aparapi version is the latest (2012-01-23).
Am I missing something? Any ideas?
Thanks in advance
Aparapi inherits a 'Grid Style' of implementation from OpenCL. When you specify a range of execution (say 1024), OpenCL will break this 'range' into groups of equal size. Possibly 4 groups of 256, or 8 groups of 128.
The group size must be a factor of range (so assert(range%groupSize==0)).
By default Aparapi internally selects the group size.
But you are choosing to fully specify the range and group size to using
Range r= Range.range(n,128)
You are responsible for ensuring that n%128==0.
From the error, it looks like you chose Range.range(1000,128).
Sadly 1000 % 128 != 0 so this range will fail.
If you specifiy
Range r = Range.range(n)
Aparapi will choose a valid group size, by finding the highest common factor of n.
Try dropping the 128 as the the second arg.
Gary

Determining All Possibilities for a Random String?

I was hoping someone with better math capabilities would assist me in figuring out the total possibilities for a string given it's length and character set.
i.e. [a-f0-9]{6}
What are the possibilities for this pattern of random characters?
It is equal to the number of characters in the set raised to 6th power.
In Python (3.x) interpreter:
>>> len("0123456789abcdef")
16
>>> 16**6
16777216
>>>
EDIT 1:
Why 16.7 million? Well, 000000 ... 999999 = 10^6 = 1M, 16/10 = 1.6 and
>>> 1.6**6
16.77721600000000
* EDIT 2:*
To create a list in Python, do: print(['{0:06x}'.format(i) for i in range(16**6)])
However, this is too huge. Here is a simpler, shorter example:
>>> ['{0:06x}'.format(i) for i in range(100)]
['000000', '000001', '000002', '000003', '000004', '000005', '000006', '000007', '000008', '000009', '00000a', '00000b', '00000c', '00000d', '00000e', '00000f', '000010', '000011', '000012', '000013', '000014', '000015', '000016', '000017', '000018', '000019', '00001a', '00001b', '00001c', '00001d', '00001e', '00001f', '000020', '000021', '000022', '000023', '000024', '000025', '000026', '000027', '000028', '000029', '00002a', '00002b', '00002c', '00002d', '00002e', '00002f', '000030', '000031', '000032', '000033', '000034', '000035', '000036', '000037', '000038', '000039', '00003a', '00003b', '00003c', '00003d', '00003e', '00003f', '000040', '000041', '000042', '000043', '000044', '000045', '000046', '000047', '000048', '000049', '00004a', '00004b', '00004c', '00004d', '00004e', '00004f', '000050', '000051', '000052', '000053', '000054', '000055', '000056', '000057', '000058', '000059', '00005a', '00005b', '00005c', '00005d', '00005e', '00005f', '000060', '000061', '000062', '000063']
>>>
EDIT 3:
As a function:
def generateAllHex(numDigits):
assert(numDigits > 0)
ceiling = 16**numDigits
for i in range(ceiling):
formatStr = '{0:0' + str(numDigits) + 'x}'
print(formatStr.format(i))
This will take a while to print at numDigits = 6.
I recommend dumping this to file instead like so:
def generateAllHex(numDigits, fileName):
assert(numDigits > 0)
ceiling = 16**numDigits
with open(fileName, 'w') as fout:
for i in range(ceiling):
formatStr = '{0:0' + str(numDigits) + 'x}'
fout.write(formatStr.format(i))
If you are just looking for the number of possibilities, the answer is (charset.length)^(length). If you need to actually generate a list of the possibilities, just loop through each character, recursively generating the remainder of the string.
e.g.
void generate(char[] charset, int length)
{
generate("",charset,length);
}
void generate(String prefix, char[] charset, int length)
{
for(int i=0;i<charset.length;i++)
{
if(length==1)
System.out.println(prefix + charset[i]);
else
generate(prefix+i,charset,length-1);
}
}
The number of possibilities is the size of your alphabet, to the power of the size of your string (in the general case, of course)
assuming your string size is 4: _ _ _ _ and your alphabet = { 0 , 1 }:
there are 2 possibilities to put 0 or 1 in the first place, second place and so on.
so it all sums up to: alphabet_size^String_size
first: 000000
last: ffffff
This matches hexadecimal numbers.
For any given set of possible values, the number of permutations is the number of possibilities raised to the power of the number of items.
In this case, that would be 16 to the 6th power, or 16777216 possibilities.

Resources