I'm training a word embedding model based on Glove method. While the algorith shows a logger like:
$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000
The home directory of Glove is filled with files caled overflow_0534.bin. Can someone tell whether all is going well?
Thanks
Everything is OK.
You can view the source code of Glove cooccur program at Github.
At the line 57 of the file:
long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk
If your corpus has too many co-occurrence records, then there will be some data to be written into some temp bin disk files.
while (1) {
if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
The variable overflow_length depends on your memory settings.
Line 463:
if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
Line 467:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
Line 470:
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1
I am trying to understand the textFile method deeply, but I think my
lack of Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect
When sc.textFile(path) is called, then defaultMinPartitions is used,
which is really just math.min(taskScheduler.defaultParallelism, 2). Let's
assume we are using the SparkDeploySchedulerBackend and this is
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))
So, now let's say the default is 2, going back to the textFile, this is
passed in to HadoopRDD. The true size is determined in getPartitions() using
inputFormat.getSplits(jobConf, minPartitions). But, from what I can find,
the partitions is merely a hint and is in fact mostly ignored, so you will
probably get the total number of blocks.
OK, this fits with expectations, however what if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?
Cross posted with the spark mailing list
Short Version:
Split size is determined by mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize, if it's bigger than HDFS's blockSize, multiple blocks inside a same file would be combined into a single split.
Detailed Version:
I think you are right in understanding the procedure before inputFormat.getSplits.
Inside inputFormat.getSplits, more specifically, inside FileInputFormat's getSplits, it is mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize that would at last determine split size. (I'm not sure which would be effective in Spark, I prefer to believe the former one).
Let's see the code: FileInputFormat from Hadoop 2.4.0
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
Inside the for loop, makeSplit() is used to generate each split, and splitSize is the effective Split Size. The computeSplitSize Function to generate splitSize:
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
Therefore, if minSplitSize > blockSize, the output splits are actually a combination of several blocks in the same HDFS file, on the other hand, if minSplitSize < blockSize, each split corresponds to a HDFS's block.
I will add more points with examples to Yijie Shen answer
Before we go into details,lets understand the following
Assume that we are working on Spark Standalone local system with 4 cores
In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then
defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)
/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
logic to find defaultMinPartitions as below
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize
package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
}
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
goalSize = totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)
blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB
Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode
Splits = 91(file size 91 bytes) / 7 (splitSize) => 13
FileInputFormat: Total # of splits generated by getSplits: 13
=> while calculating splitSize if file size is > 32 MB then the split size will be taken the default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes
I was trying things out with my homeserver and wrote a little ruby program that fills up the RAM by a given amount. But actually I have to halve the amount of bytes I want to put into the RAM. Am I missing something here or is this a bug?
Here the code:
class RAM
def initialize
#b = ''
end
def fill_ram(size)
puts 'Choose if you want to set the size in bytes, megabytes or gigabytes.'
answer = ''
valid = ['bytes', 'megabytes', 'gigabytes']
until valid.include?(answer)
answer = gets.chomp.downcase
if answer == 'bytes'
size = size * 0.5
elsif answer == 'megabytes'
size = size * 1024 * 1024 * 0.5
elsif answer == 'gigabytes'
size = size * 1024 * 1024 * 1024 * 0.5
else
puts 'Please choose between bytes, megabytes or gigabyte.'
end
end
size1 = size
if #b.bytesize != 0
size1 = size + #b.bytesize
end
until #b.bytesize == size1
#b << '0' * size
end
size = 0
end
def clear_ram
exit
end
def read_ram
puts 'At the moment this program fills ' + #b.bytesize.to_s + ' bytes of RAM'
end
end
Just imagine that the "* 0.5" at each line wouldn't be there.
I did test it in IRB and just created a new RAM object and filled it with 1000 Megabytes of data. In my case it filled the RAM actually with 2000 Megabytes of data, so I did add the times 0.5 to each line, but that can't be the solution.
When I run it I get:
Choose if you want to set the size in bytes, megabytes or gigabytes.
bytes
At the moment this program fills 512 bytes of RAM
I think the problem is the missing check for the encoding.
I ran my test in US-ASCII (One character = 1 Byte).
If you run it in UTF-16 you have an explanation for your problem.
Can you try the following code to check your encoding:
p Encoding.default_internal
p Encoding.default_external
After reading the comment:
The result of your script depends on the parameter of RAM.fill_ram. How do you start your script - and how often do you call RAM.fill_ram?
Please provide the full code.
I called my example with
r = RAM.new
r.fill_ram(1024)
r.read_ram
I was recently trying to track down some bugs in a program I am working on using valgrind, and one of the errors I got was:
==6866== Invalid write of size 4
==6866== at 0x40C9E2: superneuron::read(_IO_FILE*) (superneuron.cc:414)
the offending line # 414 reads
amplitudes__[points_read] = 0x0;
and amplitudes__ is defined earlier as
uint32_t * amplitudes__ = (uint32_t* ) amplitudes;
Now obviously a uint32_t is 4 bytes long, so this is the write size, but could someone tell me why it's invalid ?
points_read is most likely out of bounds, you're writing past (or before) the memory you allocated for amplitudes.
A typical mistake new programmers do to get this warning is:
struct a *many_a;
many_a = malloc(sizeof *many_a * size + 1);
and then try to read or write to the memory at location 'size':
many_a[size] = ...;
Here the allocation should be:
many_a = malloc(sizeof *many_a * (size + 1));
Is there a way to find out the size of a PE Header without reading all of it or the entire file?
You can calculate the total size of the PE header like this:
sizeof(Signature) + sizeof(FileHeader) + sizeof(OptionalHeader) + sizeof(SectionTable)
The file header always has the same size but the OptionalHeader's size can differ, as can the section table size.
The OptionalHeader's size is stored in FileHeader.SizeOfOptionalHeader, and the section table size equals FileHeader.NumberOfSections * sizeof(IMAGE_SECTION_HEADER)
And some C code:
DWORD SizeOfPEHeader(const IMAGE_NT_HEADERS * pNTH)
{
return (offsetof(IMAGE_NT_HEADERS, OptionalHeader) + pNTH->FileHeader.SizeOfOptionalHeader + (pNTH->FileHeader.NumberOfSections * sizeof(IMAGE_SECTION_HEADER)));
}
All you have to do is read the DOS header, get the PE offset (e_lfanew) and read PE.Signature + PE.FileHeader into memory. That's two reading operations of fixed size and you have all the info you need.