Why Hadoop SequenceFile writing is much slower than reading? - hadoop

I am converting some custom files that I have into hadoop Sequence Files using the Java API.
I am reading byte arrays from a local file and append them to a sequence file as pairs of Index (Integer) - Data (Byte[]):
InputStream in = new BufferedInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");
IntWritable key = new IntWritable();
BytesWritable value = new BytesWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
sequenceFilePath, key.getClass(), value.getClass());
for (int i = 1; i <= nz; i++) {
byte[] imageData = new byte[nx * ny * 2];
in.read(imageData);
key.set(i);
value.set(imageData, 0, imageData.length);
writer.append(key, value);
}
IOUtils.closeStream(writer);
in.close();
I do exactly the opposite when I want to bring the files back to the initial format:
for (int i = 1; i <= nz; i++) {
reader.next(key, value);
int byteLength = value.getLength();
byte[] tempValue = value.getBytes();
out.write(tempValue, 0, byteLength);
out.flush();
}
I noticed that writting to SequenceFile takes almost an order of magnitude more than reading. I expect writting to be slower than reading but is this difference normal? Why?
More Info:
The byte arrays I read are 2MB size (nx=ny=1024 and nz=128)
I am testing in pseudo-distributed mode.

You are reading from local disk and writing to HDFS. When you write to HDFS your data is probably being replicated so it is physically written two or three times depending on what you have set for the replication factor.
So you are not only writing but writing two or three times the amount of data you are reading. And your writes are going over the network. Your reads are not.

Are nx and ny constants?
One reason you could be seeing this is that each iteration of your for loop creates a new byte array. This requires the JVM to allocate you some heap space. If the array is sufficiently large, this is going to be expensive, and eventually you're going to run into the GC. I'm not too sure on what HotSpot might do to optimize this out however.
My suggestion would be to create a single BytesWritable:
// use DataInputStream so you can call readFully()
DataInputStream in = new DataInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");
IntWritable key = new IntWritable();
// create a BytesWritable, which can hold the maximum possible number of bytes
BytesWritable value = new BytesWritable(new byte[maxPossibleSize]);
// grab a reference to the value's underlying byte array
byte byteBuf[] = value.getBytes();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
sequenceFilePath, key.getClass(), value.getClass());
for (int i = 1; i <= nz; i++) {
// work out how many bytes to read - if this is a constant, move outside the for loop
int imageDataSize nx * ny * 2;
// read in bytes to the byte array
in.readFully(byteBuf, 0, imageDataSize);
key.set(i);
// set the actual number of bytes used in the BytesWritable object
value.setSize(imageDataSize);
writer.append(key, value);
}
IOUtils.closeStream(writer);
in.close();

Related

Why Unrolled LinkedList is semi filled?

Going through all the articles related to the implementation of Unrolled LinkedList, it seems that based on the capacity we generate a threshold and if the elements increase beyond that we create another node and put the element in that node.
Threshold = (capacity/2)+1
But why are we not filling the array in Unrolled LinkedList to its full capacity and then creating another node? Why do we need a threshold and keep the array semi-filled?
Quoted from Geeks for Geeks - insertion in Unrolled Linked List:
/* Java program to show the insertion operation
* of Unrolled Linked List */
import java.util.Scanner;
import java.util.Random;
// class for each node
class UnrollNode {
UnrollNode next;
int num_elements;
int array[];
// Constructor
public UnrollNode(int n)
{
next = null;
num_elements = 0;
array = new int[n];
}
}
// Operation of Unrolled Function
class UnrollLinkList {
private UnrollNode start_pos;
private UnrollNode end_pos;
int size_node;
int nNode;
// Parameterized Constructor
UnrollLinkList(int capacity)
{
start_pos = null;
end_pos = null;
nNode = 0;
size_node = capacity + 1;
}
// Insertion operation
void Insert(int num)
{
nNode++;
// Check if the list starts from NULL
if (start_pos == null) {
start_pos = new UnrollNode(size_node);
start_pos.array[0] = num;
start_pos.num_elements++;
end_pos = start_pos;
return;
}
// Attaching the elements into nodes
if (end_pos.num_elements + 1 < size_node) {
end_pos.array[end_pos.num_elements] = num;
end_pos.num_elements++;
}
// Creation of new Node
else {
UnrollNode node_pointer = new UnrollNode(size_node);
int j = 0;
for (int i = end_pos.num_elements / 2 + 1;
i < end_pos.num_elements; i++)
node_pointer.array[j++] = end_pos.array[i];
node_pointer.array[j++] = num;
node_pointer.num_elements = j;
end_pos.num_elements = end_pos.num_elements / 2 + 1;
end_pos.next = node_pointer;
end_pos = node_pointer;
}
}
// Display the Linked List
void display()
{
System.out.print("\nUnrolled Linked List = ");
System.out.println();
UnrollNode pointer = start_pos;
while (pointer != null) {
for (int i = 0; i < pointer.num_elements; i++)
System.out.print(pointer.array[i] + " ");
System.out.println();
pointer = pointer.next;
}
System.out.println();
}
}
/* Main Class */
class UnrolledLinkedList_Check {
// Driver code
public static void main(String args[])
{
Scanner sc = new Scanner(System.in);
// create instance of Random class
Random rand = new Random();
UnrollLinkList ull = new UnrollLinkList(5);
// Perform Insertion Operation
for (int i = 0; i < 12; i++) {
// Generate random integers in range 0 to 99
int rand_int1 = rand.nextInt(100);
System.out.println("Entered Element is " + rand_int1);
ull.Insert(rand_int1);
ull.display();
}
}
}
why are we not filling the array in Unrolled LinkedList to its full capacity and then creating another node?
Actually, the code that is provided does fill the array to full capacity:
if (end_pos.num_elements + 1 < size_node) {
end_pos.array[end_pos.num_elements] = num;
end_pos.num_elements++;
}
The threshold is not used here. Only when the array reaches its capacity, the threshold plays a role as a new array gets created:
UnrollNode node_pointer = new UnrollNode(size_node);
int j = 0;
for (int i = end_pos.num_elements / 2 + 1; i < end_pos.num_elements; i++)
node_pointer.array[j++] = end_pos.array[i];
Here we see the second half of the full array is copied into the new array. We can imagine this process as splitting a block into two blocks -- much like happens in a B-tree.
This allows for fast insertion the next time a value needs to be inserted (not at the end, but) at a specific offset in that array. If it were left full, it would trigger a new block at each insertion into that array. By leaving slack space in an array, we ensure fast insertion for at least a few of the future insertions that happen to be in that array.
I think it is to ensure having at least 50% utilization.
The algorithm not only splits in half if an insert is done, but also redistributes content if it's utilized at less than 50%. I think the second part is key: if you don't do the first part, you can't add a check that redistributes if the node is underutilized because your newly created node would immediately break that.
If you don't do the redistribution at all you might end up with a situation where you have a lot of underutilized nodes.
My first intuition was the same as the other comment (to avoid creating new nodes every time) but this wouldn't necessarily be an issue if you always check the next node before creating a new node, so I'm not sure that sufficient as a reason? (But maybe I'm missing something here)

Chunking algorithm for any type of data

I want to chunk large files of any type (audio, video, image...) into small ones. I tried many algorithms but I'm unable to do this. Can any one suggest me a working algorithm?
Just copy chunks into small files using next start positions:
N = FileSize / ChunkSize //integer division
RestSize = FileSize % ChunkSize //integer modulo
for i = 0 to N - 1
Copy ChunkSize bytes from position i * ChunkSize into ChunkFile[i]
if RestSize > 0
Copy RestSize bytes from position N * ChunkSize into ChunkFile[N]
Example: need to divide 7 bytes file into 2-bytes chunks. N = 3, RestSize = 1. Three 2-bytes files and one 1-byte.
You can't read big chunk of files in one go, even if we have such a memory. Basically for each split you can read a fix size byte-array which you know should be feasible in terms of performance as well memory.
public static void main(String[] args) throws Exception
{
RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
long numSplits = 10; //from user input, extract it from args
long sourceSize = raf.length();
long bytesPerSplit = sourceSize/numSplits ;
long remainingBytes = sourceSize % numSplits;
int maxReadBufferSize = 8 * 1024; //8KB
for(int destIx=1; destIx <= numSplits; destIx++) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
if(bytesPerSplit > maxReadBufferSize) {
long numReads = bytesPerSplit/maxReadBufferSize;
long numRemainingRead = bytesPerSplit % maxReadBufferSize;
for(int i=0; i<numReads; i++) {
readWrite(raf, bw, maxReadBufferSize);
}
if(numRemainingRead > 0) {
readWrite(raf, bw, numRemainingRead);
}
}else {
readWrite(raf, bw, bytesPerSplit);
}
bw.close();
}
if(remainingBytes > 0) {
BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
readWrite(raf, bw, remainingBytes);
bw.close();
}
raf.close();
}
static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
byte[] buf = new byte[(int) numBytes];
int val = raf.read(buf);
if(val != -1) {
bw.write(buf);
}
}
You should also look for some discussions on various sites like https://coderanch.com/t/458202/java/Approach-split-file-chunks and on other sites.
Happy coding.

Conversion of PNG image to base64 in Windows phone7.1

I want to convert a PNG image found in a path to base64 for a html page in Windows phone7.1.How can it be done?
Stream imgStream;
imgStream = Assembly.GetExecutingAssembly().GetManifestResourceStream("NewUIChanges.Htmlfile.round1.png");
byte[] data = new byte[(int)imgStream.Length];
int offset = 0;
while (offset < data.Length)
{
int bytesRead = imgStream.Read(data, offset, data.Length - offset);
if (bytesRead <= 0)
{
throw new EndOfStreamException("Stream wasn't as long as it claimed");
}
offset += bytesRead;
}
The fact that it's a PNG image is actually irrelevant - all you need to know is that you've got some bytes that you need to convert into base64.
Read the data from a stream into a byte array, and then use Convert.ToBase64String. Reading a byte array from a stream can be slightly fiddly, depending on whether the stream advertises its length or not. If it does, you can use:
byte[] data = new byte[(int) stream.Length];
int offset = 0;
while (offset < data.Length)
{
int bytesRead = stream.Read(data, offset, data.Length - offset);
if (bytesRead <= 0)
{
throw new EndOfStreamException("Stream wasn't as long as it claimed");
}
offset += bytesRead;
}
If it doesn't, the simplest approach is probably to copy it to a MemoryStream:
using (MemoryStream ms = new MemoryStream())
{
byte[] buffer = new byte[8 * 1024];
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, bytesRead);
}
return ms.ToByteArray();
}
So once you've used either of those bits of code (or anything else suitable) to get a byte array, just use Convert.ToBase64String and you're away.
There are probably streaming solutions which will avoid ever having the whole byte array in memory - e.g. building up a StringBuilder of base64 data as it goes - but they would be more complicated. Unless you're going to deal with very large files, I'd stick with the above.

random writing Markov Model efficiency

Here is my implementation
However, it is a bit slow when analyzing the textfile,
Anyone have a better idea or better data structure to implement Random writing?
Im not using the STL library so dun worry about the syntax.
instead of using push_back, vector here is using .add
randomInteger will generate randome integer between ranges
I would like to produce 2000 character if possible;
I think the slowest part is reading the file char by char?
void generateText(int order, string initSeed, string filename){
Map<string , Vector<char> > model;
char ch;
string key;
ifstream input(filename.c_str());
for(int i = 0; i < order; i++){
input.get(ch);
key+=ch;
}
while(input.get(ch)){
model[key].add(ch);
key = key.substr(1,key.length()-1) + ch;
}
string result;
string seed = initSeed;
for(int i = 0;i<2000;i++){
if (model[seed].size() >0) {
ch = model[seed][randomInteger(0, model[seed].size()-1)];
cout << ch;
seed = seed.substr(1,seed.length()-1) + ch;
}
else
return;
}
}
You need to determine that it is taking too long. (How is this code not running in less than a second on an average laptop?)
If it is, you need to profile.
For example, a likely candidate is the cost of generating random numbers...
You'll only disprove me by profiling ;)
I think it is a bit slow because it creates lots of temporary strings during the analysis phase.
for(int i = 0; i < order; i++){
input.get(ch);
key+=ch; // key = key + ch, at least one new string created
}
while(input.get(ch)){
model[key].add(ch); // key copied to hash table
key = key.substr(1,key.length()-1) + ch; // a couple of temp strings created
}
You could do instead like this:
char key[order + 1]; // pseudo code, won't work because order is not constant
key[order] = 0; /* NUL terminate */
for (int i = 0; i < order; i++) {
input.get(key[i]);
}
while (!(input.eof())) {
for (int j = 0; j < order - 1; k++) {
key[j] = key[j + 1];
}
input.get(key[order]);
model[key].add(ch);
}
Here the only string that is actually created is the string that ends up as a key in the hash table. The key is rotated in a simple character array, avoiding string temporaries.

Running times for sorting methods over multple arrays

I have various sorting methods that are all sorting the same 100,000 random number array.
I'm using the following method to find the runtimes of each
long insertionStart = System.currentTimeMillis();
arr.Clone(iniArr);
arr.insertionSort();
long insertionFinal = System.currentTimeMillis() - insertionStart;
And the following for the random number arrary
int maxSize = 100000; // array size
Sortarr arr, iniArr; // reference to array
arr = new Sortarr(maxSize); // create the array
iniArr = new Sortarr(maxSize);
// insert random numbers
Random generator = new Random();
for (int i = 0; i < maxSize; i++) iniArr.insert(generator.nextInt());
How can I modify this so that I can have each of them sort 100 arrays rather than just one, and count the time of each array? Eg. Run1 - 23ms; Run2 - 25ms; ... Run100 - 22ms
EDIT:
I have one final thing to do.
So each iteration sorts the array a few ways, let's say insertion, merge, and quick sort.
So say insertion = 300ms, merge = 200ms, and quick = 100ms. I need to, for each iteration, find which method sorted the fastest.
I know this is a simple min/max type thing that you do a thousand times in lower programming classes.
Would it be easier to throw each value into an array and use an array.min call? (Whatever it actually is, new to java syntax..)
Currently, it looks like you are creating the array and then repeatedly sorting using different functions.
You simply need to put all of that in a loop.
int maxRuns = 100;
int maxSize = 100000; // array size
for (int run=0; run<maxRuns; run++) {
Sortarr arr, iniArr; // reference to array
arr = new Sortarr(maxSize); // create the array
iniArr = new Sortarr(maxSize);
// insert random numbers
Random generator = new Random();
for (int i = 0; i < maxSize; i++) iniArr.insert(generator.nextInt());
long insertionStart = System.currentTimeMillis();
arr.Clone(iniArr);
arr.insertionSort();
long insertionFinal = System.currentTimeMillis() - insertionStart;
/* <more code goes here> */
}
You can use the index run while printing out your results.
You probably would be doing something like:
for (int try = 0; try < 100; try++) {
iniArr = new Sortarr(maxSize);
// insert random numbers
Random generator = new Random();
for (int i = 0; i < maxSize; i++) iniArr.insert(generator.nextInt());
long insertionStart = System.currentTimeMillis();
arr.Clone(iniArr);
arr.insertionSort();
long insertionFinal = System.currentTimeMillis() - insertionStart;
// print out the time, and/or add up the total
}
you'd still need the initialization beforehand. I guess I don't know why the array is cloned before it is sorted. Can you directly sort that array?

Resources