outofmemory exception when reading xml from file - visual-studio-2010

I am working with twitter api data and after storing the stream results in text files, I input the data into a parser application. What I have planned for was large data files so I read the content in using delimiter]} to separate the individual posts to avoid the potential for errors? A backup function was to read the data using a buffer and then snip into individual posts.
But the problem is that in some cases for a single post, a memory exception will occur. Now when I look at the individual post it does not seem necessarily large but the text will contain foreign characters or some encoding I guess that causes the memory exception. I have not figured out if is exactly this yet but thought I would get some input or advice from here...
myreader.TextFieldType = FileIO.FieldType.Delimited
myreader.SetDelimiters("]}}")
Dim currentRow As String()
Try
While Not myreader.EndOfData
Try
currentRow = myreader.ReadFields()
Dim currentField As String
For Each currentField In currentRow
data = data + currentField
counter += 1
If counter = 1000 Then
Dim pt As New parsingUtilities
If Not data = "" Then
pt.getNodes(data)
counter = 0
End If
End If
Next
Catch ex As Exception
If ex.Message.Contains("MemoryException") Then
fileBKup()
End If
End Try
the other time when a memory exception occurs is then I try to split into different posts:
Dim sampleResults() As String
Dim stringSplitter() As String = {"}}"}
' split the file content based on the closing entry tag
sampleResults = Nothing
Try
sampleResults = post.Split(stringSplitter, StringSplitOptions.RemoveEmptyEntries)
Catch ex As Exception
appLogs.constructLog(ex.Message.ToString, True, True)
moveErrorFiles(form1.infile)
Exit Sub
End Try

I expect the problem is the strings.
Strings are immutable, meaning that every time you think you're changing a string by doing this
data = data + currentField
you're actually creating another new string in memory. So if you do that thousands of times it can cause a problem because they mount up and you get an OutOfMemoryException.
If you're building up strings you should use a StringBuilder instead.

Related

Huge memory consumption in Map Task in Spark

I have a lot of files that contain roughly 60.000.000 lines. All of my files are formatted in the format {timestamp}#{producer}#{messageId}#{data_bytes}\n
I walk through my files one by one and also want to build one output file per input file.
Because some of the lines depend on previous lines, I grouped them by their producer. Whenever a line depends on one or more previous lines, their producer is always the same.
After grouping up all of the lines, I give them to my Java parser.
The parser then will contain all parsed data objects in memory and output it as JSON afterwards.
To visualize how I think my Job is processed, I threw together the following "flow graph". Note that I did not visualize the groupByKey-Shuffeling-Process.
My problems:
I expected Spark to split up the files, process the splits with separate tasks and save each task output to a "part"-file.
However, my tasks run out of memory and get killed by YARN before they can finish: Container killed by YARN for exceeding memory limits. 7.6 GB of 7.5 GB physical memory used
My Parser is throwing all parsed data objects into memory. I can't change the code of the Parser.
Please note that my code works for smaller files (for example two files with 600.000 lines each as the input to my Job)
My questions:
How can I make sure that Spark will create a result for every file-split in my map task? (Maybe they will if my tasks succeed but I will never see the output as of now.)
I thought that my map transformation val lineMap = lines.map ... (see Scala code below) produces a partitioned rdd. Thus I expect the values of the rdd to be split in some way before calling my second map task.
Furthermore, I thought that calling saveAsTextFile on this rdd lineMap will produce a output task that runs after each of my map task has finished. If my assumptions are correct, why do my executors still run out of memory? Is Spark doing several (too) big file splits and processes them concurrently, which leads to the Parser filling up the memory?
Is repartitioning the lineMap rdd to get more (smaller) inputs for my Parser a good idea?
Is there somewhere an additional reducer step which I am not aware of? Like results being aggregated before getting written to file or similar?
Scala code (I left out unrelevant code parts):
def main(args: Array[String]) {
val inputFilePath = args(0)
val outputFilePath = args(1)
val inputFiles = fs.listStatus(new Path(inputFilePath))
inputFiles.foreach( filename => {
processData(filename.getPath, ...)
})
}
def processData(filePath: Path, ...) {
val lines = sc.textFile(filePath.toString())
val lineMap = lines.map(line => (line.split(" ")(1), line)).groupByKey()
val parsedLines = lineMap.map{ case(key, values) => parseLinesByKey(key, values, config) }
//each output should be saved separately
parsedLines.saveAsTextFile(outputFilePath.toString() + "/" + filePath.getName)
}
def parseLinesByKey(key: String, values: Iterable[String], config : Config) = {
val importer = new LogFileImporter(...)
importer.parseData(values.toIterator.asJava, ...)
//importer from now contains all parsed data objects in memory that could be parsed
//from the given values.
val jsonMapper = getJsonMapper(...)
val jsonStringData = jsonMapper.getValueFromString(importer.getDataObject)
(key, jsonStringData)
}
I fixed this by removing the groupByKey call and implementing a new FileInputFormat as well as a RecordReader to remove my limitations that lines depend on other lines. For now, I implemented it so that each split will contain a 50.000 Byte overhead of the previous split. This will ensure that all lines that depend on previous lines can be parsed correctly.
I will now go ahead and still look through the last 50.000 Bytes of the previous split, but only copy over lines that actually affect the parsing of the current split. Thus, I minimize the overhead and still get a highly parallelizable task.
The following links dragged me into the right direction. Because the topic of FileInputFormat/RecordReader is quite complicated at first sight (it was for me at least), it is good to read through these articles and understand whether this is suitable for your problem or not:
https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
http://www.ae.be/blog-en/ingesting-data-spark-using-custom-hadoop-fileinputformat/
Relevant code parts from the ae.be article just in case the website goes down. The author (#Gurdt) uses this to detect whether a chat message contains an escaped line return (by having the line end with "\") and appends the escaped lines together until an unescaped \n is found. This will allow him to retrieve messages that spans over two or more lines. The code written in Scala:
Usage
val conf = new Configuration(sparkContext.hadoopConfiguration)
val rdd = sparkContext.newAPIHadoopFile("data.txt", classOf[MyFileInputFormat],
classOf[LongWritable], classOf[Text], conf)
FileInputFormat
class MyFileInputFormat extends FileInputFormat[LongWritable, Text] {
override def createRecordReader(split: InputSplit, context: TaskAttemptContext):
RecordReader[LongWritable, Text] = new MyRecordReader()
}
RecordReader
class MyRecordReader() extends RecordReader[LongWritable, Text] {
var start, end, position = 0L
var reader: LineReader = null
var key = new LongWritable
var value = new Text
override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
// split position in data (start one byte earlier to detect if
// the split starts in the middle of a previous record)
val split = inputSplit.asInstanceOf[FileSplit]
start = 0.max(split.getStart - 1)
end = start + split.getLength
// open a stream to the data, pointing to the start of the split
val stream = split.getPath.getFileSystem(context.getConfiguration)
.open(split.getPath)
stream.seek(start)
reader = new LineReader(stream, context.getConfiguration)
// if the split starts at a newline, we want to start yet another byte
// earlier to check if the newline was escaped or not
val firstByte = stream.readByte().toInt
if(firstByte == '\n')
start = 0.max(start - 1)
stream.seek(start)
if(start != 0)
skipRemainderFromPreviousSplit(reader)
}
def skipRemainderFromPreviousSplit(reader: LineReader): Unit = {
var readAnotherLine = true
while(readAnotherLine) {
// read next line
val buffer = new Text()
start += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
pos = start
// detect if delimiter was escaped
readAnotherLine = buffer.getLength >= 1 && // something was read
buffer.charAt(buffer.getLength - 1) == '\\' && // newline was escaped
pos <= end // seek head hasn't passed the split
}
}
override def nextKeyValue(): Boolean = {
key.set(pos)
// read newlines until an unescaped newline is read
var lastNewlineWasEscaped = false
while (pos < end || lastNewlineWasEscaped) {
// read next line
val buffer = new Text
pos += reader.readLine(buffer, Integer.MAX_VALUE, Integer.MAX_VALUE)
// append newly read data to previous data if necessary
value = if(lastNewlineWasEscaped) new Text(value + "\n" + buffer) else buffer
// detect if delimiter was escaped
lastNewlineWasEscaped = buffer.charAt(buffer.getLength - 1) == '\\'
// let Spark know that a key-value pair is ready!
if(!lastNewlineWasEscaped)
return true
}
// end of split reached?
return false
}
}
Note: You might need to implement getCurrentKey, getCurrentValue, close and getProgress in your RecordReader as well.

Parse word document in VBScript

I got a weird mission from a friend, to parse through a bunch of Word files and write certain parts of them to a text file for further processing.
VBscript is not my cup of tea so I'm not sure how to fit the pieces together.
The documents look like this:
Header
A lot of not interesting text
Table
Header
More boring text
Table
I want to parse the documents and get all the headers and table of contents out of it. I'm stepping step through the document with
For Each wPara In wd.ActiveDocument.Paragraphs
And I think I know how to get the headers
If Left(wPara.Range.Style, Len("Heading")) = "Heading" Then
But I'm unsure of how to do the
Else if .. this paragraph belongs to a table..
So, any hint on how I could determine if a paragraph is part of a table or not would be nice.
Untested, because I have no access to MS Word right now.
Option Explicit
Dim FSO, Word, textfile, doc, para
' start Word instance, open doc ...
' start FileSystemObject instance, open textfile for output...
For Each para In doc.Paragraphs
If IsHeading(para) Or IsInTable(para) Then
SaveToFile(textfile, para)
End If
Next
Function IsHeading(para)
IsHeading = para.OutlineLevel < 10
End Function
Function IsInTable(para)
Dim p, dummy
IsInTable = False
Set p = para.Parent
' at some point p and p.Parent will both be the Word Application object
Do While p Is Not p.Parent
' dirty check: if p is a table, calling a table object method will work
On Error Resume Next
Set dummy = obj.Cell(1, 1)
If Err.Number = 0 Then
IsInTable = True
Exit Do
Else
Err.Clear
End If
On Error GoTo 0
Set p = p.Parent
Loop
End Function
Obviously SaveToFile is something you'd implement yourself.
Since "is in table" is naturally defined as "the object's parent is a table", this is a perfect situation to use recursion (deconstructed a little further):
Function IsInTable(para)
IsInTable = IsTable(para.Parent)
If Not (IsInTable Or para Is para.Parent) Then
IsInTable = IsInTable(para.Parent)
End If
End Function
Function IsTable(obj)
Dim dummy
On Error Resume Next
Set dummy = obj.Cell(1, 1)
IsTable = (Err.Number = 0)
Err.Clear
On Error GoTo 0
End Function

Lotusscript : How to sort the field values (an array of words) by their frequency

I would like to sort the field values (strings) by their frequency in lotusscript.
Has anyone an idea to solve this?
Thanks a lot.
Personally I would avoid LotusScript if you can help it. You are going to run into limitations that cannot be worked around.
Regardless of which route you do take, from a performance point of view it is better to have the View indexes do the work.
So you would create a view. The first column would be as follows.
Column Value: The field you want to check.
Sort: Ascending
Type: Categorized
After this you can access the data using the NotesViewNavigator. The related method call is getNextCategory. This will give you a view entry object which you can call ChildCount on to get totals.
For example (Disclaimer: Code written from memory, not guaranteed to run):
Dim sess As New NotesSession
Dim db As NotesDatabase
Dim vw As NotesView
Dim nav as NotesViewNavigator
Dim entryA As NotesViewEntry
Dim entryB As NotesViewEntry
Set db = sess.CurrentDatabase
Set vw = db.GetView("testView")
vw.AutoUpdate = False
Set nav = vw.CreateViewNav
Set entryA = nav.GetFirst
while entryA not Nothing
Set entryB = nav.GetNextCategory(entryA)
if entryB not nothing then
' Do your processing.
' entryB.childCount will give total.
end if
set EntryA = EntryB
Wend
view.AutoUpdate = True
This way the heavy lifting (string sorting, counting) is handled by the View index. So you only need to process the final results.
To answer the op's (old) question directly, the way to do this in LotusScript is both simple and easy:
dim para as string
dim words as variant
dim fq list as long
'get the text to freq-count
para = doc.text '(or $ from somewhere)
'tidy up para by removing/replacing characters you don't want
para = replace(para, split(". , : ; - [ ] ()"), "")
words = split(para) 'or split(words, "delim") - default is space
forall w in words
if iselement(words(w)) then
fq(w) = fq(w) + 1
Else
fq(w) = 1
End forall
'you now have a count of each individual word in the FQ list
'to get the words out and the word frequencies (not sorted):
forall x in fq
print listtag(x) = x
End forall
Thats it. No issue with LotusScript - quick and easy (and lists can be massive). To get a sorted list, you would have to move to an array and do a sort on it or move to a field and let #sort do the job somehow.

Delete a record in a Random File in Vb6

I'm trying to manage a quite-small DataBase with Vb6 and NotePad.
I collect all the record in Random into the Notepad File (.dat).
I use the Get and Put command for getting the record I stored and insert the newest.
Now I'd like to have the possibility to DELETE a record I entered (maybe the latest).
I tought that:
Delete #FileNumber1, LatestRec, MyRec
was a good chance to get it.
LatestRec is the number of the latest record (ex: 5 means the 5th).
MyRec is my record variable.
Any suggestions?
The Delete statement you note above doesn't apply for random access files. Unfortunately, VB6 Random Access files provide no direct mechanism for record deletion, primarily because deletion leads to a rat's nest of other issues, such as file contraction (filling the empty space), fragmentation (unused empty space), to name a couple. If you truly need to delete a record, about the only option you have is to copy all the other records to a temporary file, delete the old file, and rename the temp file to the "original" name - and, sadly, that's right from Microsoft.
One thing you can do, which I'll admit up front isn't ideal, is to add a "deleted" field to your random-access file, defaulting to 0, but changing to true, 1, or some other relevant value, to indicate that the record is no longer valid.
You could even get into writing routines to reuse deleted records, but if you're getting into file semantics that much, you might be better served by considering a move of the application to a more robust database environment, such as SQL Server.
*EDIT:*Here is a very rough/crude/untested chunk of sample VB6 code that shows how you would delete/add a record with the "deleted field" concept I described above..caveat that tweaks might be needed to get this code perfect, but the point is to illustrate the concept for you:
Type SampleRecord
UserID As Long
lastName As String * 25
firstName As String * 25
Deleted As Boolean
End Type
' This logically deletes a record by setting
' its "Deleted" member to True
Sub DeleteRecord(recordId As Long)
Dim targetRecord As SampleRecord
Dim fileNumber As Integer
fileNumber = FreeFile
Open "SampleFile" For Random As fileNumber Len = LenB(SampleRecord)
Get fileNumber, recordId, targetRecord
targetRecord.Deleted = True
Put #fileNumber, recordId, targetRecord
Close #fileNumber
End Sub
Sub AddRecord(lastName As String, firstName As String)
Dim newRecord As SampleRecord
Dim fileNumber As Integer
Dim newRecordPosition As Long
newRecord.firstName = firstName
newRecord.lastName = lastName
newRecord.Deleted = False
newRecord.UserID = 123 ' assume an algorithm for assigning this value
fileNumber = FreeFile
Open "SampleFile" For Random As fileNumber Len = LenB(SampleRecord)
newRecordPosition = LOF(fileNumber) / LenB(SampleRecord) + 1
Put #fileNumber, newRecordPosition, newRecord
Close #fileNumber
End Sub

How to append binary values in VBScript

If I have two variables containing binary values, how do I append them together as one binary value? For example, if I used WMI to read the registry of two REG_BINARY value, I then want to be able to concatenate the values.
VBScript complains of a type mismatch when you try to join with the '&' operator.
REG_BINARY value will be returned as an array of bytes. VBScript may reference an array of bytes in a variable and it may pass this array of bytes either as a variant to another function or as a reference to array of bytes. However VBScript itself can do nothing with the array.
You are going to need some other component to do some from of concatenation:-
Function ConcatByteArrays(ra, rb)
Dim oStream : Set oStream = CreateObject("ADODB.Stream")
oStream.Open
oStream.Type = 1 'Binary'
oStream.Write ra
oStream.Write rb
oStream.Position = 0
ConcatByteArrays = oStream.Read(LenB(ra) + LenB(rb))
oStream.Close
End Function
In the above code I'm using the ADODB.Stream object which is ubiquitous on currently supported platforms.
If you actually had multiple arrays that you want to concatenate then you could use the following class:-
Class ByteArrayBuilder
Private moStream
Sub Class_Initialize()
Set moStream = CreateObject("ADODB.Stream")
moStream.Open
moStream.Type = 1
End Sub
Public Sub Append(rabyt)
moStream.Write rabyt
End Sub
Public Property Get Length
Length = moStream.Size
End Property
Public Function GetArray()
moStream.Position = 0
GetArray = moStream.Read(moStream.Size)
End Function
Sub Class_Terminate()
moStream.Close
End Sub
End Class
Call append as many times as you have arrays and retrieve the resulting array with GetArray.
For the record, I wanted VBScript code for a large userbase as a logon script that has the least chance of failing. I like the ADO objects, but there are so many mysterious ways ADO can be broken, so I shy away from ADODB.Stream.
Instead, I was able to write conversion code to convert binary to hex encoded strings. Then, to write back to a REG_BINARY value, I convert it to an array of integers and give it to the SetBinaryValue WMI method.
Note: WshShell can only handle REG_BINARY values containing 4 bytes, so it's unusable.
Thank you for the feedback.
Perhaps...
result = CStr(val1) & CStr(val2)

Resources