Less memory intensive way to count lines in text file in vb - visual-studio-2010

I have files that may contain up to 40,000,000 lines of data (records) in text files.
I need a memory efficient way to count how many lines are in a file.
I have tried:
Dim this As Integer
this = File.ReadAllLines(txtInPath.Text).Length
MsgBox(this)
where txtInPath is a text box in my form.
This code causes out of memory exception.
Thanks

Related

How to omit useless whitespace Text element from parsed XML using Nokogiri?

I’m working on an application that uses Nokogiri to ingest XML from a large number of small XML files, and a lot of memory is consumed by whitespace in those files, in the form of text nodes whose values are mostly whitespace-only, e.g. ” \n”. Is there a convenient solution already available for this? I tried deleting all text elements containing only whitespace, but it took a considerable amount of time at runtime. I’m now thinking of processing the text that is read from the files and removing the whitespace (possibly by simply calling strip on all the lines), before passing the text into the parser. What do you think? Is there a better way? The software is sometimes run on appliances with limited RAM and the memory savings would be helpful.

FB_FileGets vs FB_FileRead in twincat

There are two similar function for reading file in twincat software for Beckhoff company. FB_FileGets and FB_FileRead. I will be appreciate if someone explain what are the differences of these function and clear when we use each of them. both of them have same ‌prerequisite or not, use in same way in programs? which one has better speed(fast reading in different file format) and any inform that make them clear for better programming.
vs
The FB_FileGets reads the file line by line. So when you call it, you always get a one line of the text file as string. The maximum length of a line is 255 characters. So by using this function block, it's very easy to read all lines of a file. No need for buffers and memory copying, if the 255 line length limit is ok.
THe FB_FileReadreads given number of bytes from the file. So you can read files with for example 65000 characters in a single line.
I would use the FB_FileGets in all cases where you know that the lines are less than 255 characters and the you handle the data as line-by-line. It's very simple to use. If you have no idea of the line sizes, you need all data at once or the file is very big, I would use the FB_FileRead.
I haven't tested but I think that the FB_FileReadis probably faster, as it just copies the bytes to buffer. And you can read the whole file at once, not line-by-line.

Accessing memory location using pseudo "file handle" in MATLAB

There's lots of questions relating to dealing with large data sets by avoiding loading the whole thing into memory. My question is kind of the opposite: I've written code that reads files line by line to avoid memory overflow problems. However, I've just been given access to a powerful workstation with several hundred GB of memory, removing that problem, and making disk-access into the bottleneck.
Thing is, my code is written to access data files line by line using functions like fgetl. Is it possibly for me to somehow replace the file handle f = fopen('datafile.txt') with something else that acts in exactly the same way with respect to functions reading from a file, but instead of reading from the disk just returns values stored in memory?
I'm thinking, for example, having a large cell array with the contents of the file split by line and fgetl just returns the next. If I have to write my own wrapper for this, how can I go about doing this?

Write to specific part of preallocated file

I am currently trying to write to different locations of a pre-allocated file.
I first allocated my file like so:
File.open("file", "wb") { |file| file.truncate(size) }
Size being the total size of the file.
Afterwards I receive data of XX size which fits into Y location of that file. Keep in mind this portion of the process is forked. Each fork has it's own unique socket and opens it's own unique file handle, writes to the file, then closes it as so.
data = socket.read(256)
File.open("file", "wb") do |output|
output.seek(location * 256, IO::SEEK_SET)
output.write(data)
end
This should in turn allow the forked processes to open a file handle, seek to the correct location (If location is 2 and data_size is 256, then the write location is 512 -> 768) and write the chunk of data that it received.
Although what this is doing is beyond my comprehension. I monitor the files size as it is being populated and it is bouncing around from different file sizes which should not be changing.
When analyzing the file with a hex editor, where the file data header should be at the top is filled with nullbytes (like wise with 1/4 of the file). Although if I limit the forked processes to only write 1 file chunk and then exit the writes are fine and at their proper location.
I have done some other testing such as dumping that part locations, and the start locations of the data and my equation for seeking to the correct location of the file seems to be correct as well.
Is there something I am missing here or is there another way to have multiple threads/processes open a file handle to a file, seek to a specific location, and then write a chunk of data?
I have also attempted to use FLOCK on the file, and it yields the same results, likewise with using the main process instead of forking.
I have tested the same application, but rather than opening/closing the file handle each time I need to write data in rapid succession (transferring close to 70mb/s), I created one file handle per forked process and kept it open. This fixed the problem resulting in a 1:1 duplication of the file with matching checksums.
So the question is, why is opening/writing/closing file handles to a file in rapid succession causing this behavior?
It's your file mode.
File.open("file", "wb")
"wb" means "upon opening, truncate the file to zero length".
I suggest "r+b", which means "reading and writing, no truncation". Read more about available modes here: http://ruby-doc.org/core-2.2.2/IO.html#method-c-new
BTW, "b" in those modes means "binary" (as opposed to default "t" (text))

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

Resources