save array of bigint to file - text-files

I have a large array of big integers which I need to save to the disk for the next user session. I don't have a data base or the option to use one.
Will large numbers take more space in memory when I write them to a text file?
What is the best way to store this array for later use?

I'd just go for it using a normal text file, the integer will take as much space as its string representation, so 2759275918572192759185721 as a big integer will take the same space as 2759275918572192759185721 as a string (which isn't that much).
When reading them from the file, you simply parse them again.
IMPORTANT: There is NO error handling in this code! You abosolutely MUST add try-catch-finally to ctahc IOException and NumberformatException!
File file = new File("C:\\Users\\Phiwa\\Desktop\\test.txt");
if(!file.exists())
file.createNewFile();
FileWriter fw = new FileWriter(file);
BigInteger bint1 = new BigInteger("999999999999999999");
fw.write(bint1.toString());
fw.flush();
fw.close();
// BigInteger has been written to file
// Read it from file again
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String str = br.readLine();
br.close();
BigInteger bint2 = BigInteger.valueOf(Long.parseLong(str));
System.out.println(bint2);
This works, did you already try this?
Things are a bit different if you have the integer in the format 0x999999999999999, in that case you would use
BigInteger bint2 = new BigInteger(str.replace("0x", ""), 16);

Related

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

xz-javadoc > What is the meaning of "Wrap it in BufferedInputStream if you need to read lots of data one byte at a time"

As a new xz-javadoc user, I am trying to use the XZInputStream to read decompressed bytes. Thus I am reading the xz-javadoc (http://tukaani.org/xz/xz-javadoc/org/tukaani/xz/XZInputStream.html).
In the doc page, there is the following text in the description of read() method:
Reading lots of data with read() from this input stream may be inefficient. Wrap it in BufferedInputStream if you need to read lots of data one byte at a time.
What is the meaning of this? wrap this input stream to BufferedInputStream?
What is the meaning of this? wrap this input stream to BufferedInputStream?
it means this:
InputStream is = new BufferedInputStream(new XZInputStream(file));
int by;
while ((by = is.read()) != -1)
{
// do stuff with "by"
}
is.close();
So although you're reading byte by byte, your input is buffered. There's also a longer explanation here.

Splits in hadoop with variable-length/non-delimited binary file

I've just started working on a hadoop based ingester for open street map data. There are a few formats - but I've been targeting a protocolbuffer based format (note - it's not pure pb).
It's looking to me like it would be more efficient to pre-split the file into a sequence file - as opposed to handling the variable-length encoding in a custom record reader / input format - but would like a sanity check.
The format is described in more detail at PBF Format Description
But basically it's a collection of [BlobHeader,Blob] blocks.
There's a Blob Header
message BlobHeader {
required string type = 1;
optional bytes indexdata = 2;
required int32 datasize = 3;
}
And then the Blob (the size of which is defined by the datasize parameter in the header)
message Blob {
optional bytes raw = 1; // No compression
optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
optional bytes zlib_data = 3;
// optional bytes lzma_data = 4; // PROPOSED.
// optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
}
There's more structure once you get down into the blob obviously - but I would handle that in the mapper - what I would like to do is initially have one blob per mapper (later might be some multiple of blobs per mapper).
Some of the other input formats/record readers use a "big enough" split size, and then seek backwards/forwards to a delimiter - but since there is no delimiter that would let me know the offset of blobs/headers - and no index that points to them either - I can't see any way to get my split points without first streaming through the file.
Now I wouldn't need to actually read the entire file off of disks - I could start with reading the header, using that info to seek past the blob, set that as the first split point, then repeat. But that's about the only alternative to pre-splitting into a sequence file I can come up with.
Is there a better way to handle this - or if not, thoughts on the two suggestions?
Well, I went with parsing the binary file in the getSplits method -and since i'm skipping over 99% of the data it's plenty fast (~20 seconds for the planet-osm 22GB world file). Here's the getSplits method if anyone else stumbles along.
#Override
public List<InputSplit> getSplits(JobContext context){
List<InputSplit> splits = new ArrayList<InputSplit>();
FileSystem fs = null;
Path file = OSMPBFInputFormat.getInputPaths(context)[0];
FSDataInputStream in = null;
try {
fs = FileSystem.get(context.getConfiguration());
in = fs.open(file);
long pos = 0;
while (in.available() > 0){
int len = in.readInt();
byte[] blobHeader = new byte[len];
in.read(blobHeader);
BlobHeader h = BlobHeader.parseFrom(blobHeader);
FileSplit split = new FileSplit(file, pos,len + h.getDatasize(), new String[] {});
splits.add(split);
pos += 4;
pos += len;
pos += h.getDatasize();
in.skip(h.getDatasize());
}
} catch (IOException e) {
sLogger.error(e.getLocalizedMessage());
} finally {
if (in != null) {try {in.close();}catch(Exception e){}};
if (fs != null) {try {fs.close();}catch(Exception e){}};
}
return splits;
}
working fine so far - though I haven't ground truthed the output yet. It's definitley faster than copying the pbf to hdfs, converting to a sequence in a single mapper, then ingesting (copy time dominates). It's also ~20% faster than having an external program copy to a sequence file in hdfs, then running a mapper against hdfs (scripted the latter).
So no complaints here.
Note that this generates a mapper for every block - which is ~23k mappers for the planet world file. I'm actually bundling up multiple blocks per split - just loop through x numbers of times before a split gets added to the collection.
For the BlobHeader I just compiled the protobuf .proto file from the OSM wiki link above. You can also pull it pre-generated from the OSM-binary class if you want - maven fragment is:
<dependency>
<groupId>org.openstreetmap.osmosis</groupId>
<artifactId>osmosis-osm-binary</artifactId>
<version>0.43-RELEASE</version>
</dependency>

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.
Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

outofmemory exception when reading xml from file

I am working with twitter api data and after storing the stream results in text files, I input the data into a parser application. What I have planned for was large data files so I read the content in using delimiter]} to separate the individual posts to avoid the potential for errors? A backup function was to read the data using a buffer and then snip into individual posts.
But the problem is that in some cases for a single post, a memory exception will occur. Now when I look at the individual post it does not seem necessarily large but the text will contain foreign characters or some encoding I guess that causes the memory exception. I have not figured out if is exactly this yet but thought I would get some input or advice from here...
myreader.TextFieldType = FileIO.FieldType.Delimited
myreader.SetDelimiters("]}}")
Dim currentRow As String()
Try
While Not myreader.EndOfData
Try
currentRow = myreader.ReadFields()
Dim currentField As String
For Each currentField In currentRow
data = data + currentField
counter += 1
If counter = 1000 Then
Dim pt As New parsingUtilities
If Not data = "" Then
pt.getNodes(data)
counter = 0
End If
End If
Next
Catch ex As Exception
If ex.Message.Contains("MemoryException") Then
fileBKup()
End If
End Try
the other time when a memory exception occurs is then I try to split into different posts:
Dim sampleResults() As String
Dim stringSplitter() As String = {"}}"}
' split the file content based on the closing entry tag
sampleResults = Nothing
Try
sampleResults = post.Split(stringSplitter, StringSplitOptions.RemoveEmptyEntries)
Catch ex As Exception
appLogs.constructLog(ex.Message.ToString, True, True)
moveErrorFiles(form1.infile)
Exit Sub
End Try
I expect the problem is the strings.
Strings are immutable, meaning that every time you think you're changing a string by doing this
data = data + currentField
you're actually creating another new string in memory. So if you do that thousands of times it can cause a problem because they mount up and you get an OutOfMemoryException.
If you're building up strings you should use a StringBuilder instead.

Resources