how to view an extremely large file on a sever: Freebase - view

I'm trying to look at the Freebase data dump which is stored on a server that I access through ssh. The trouble is I don't know how I can view it in a way that doesn't take forever, make things freeze or crash, I had been trying to view it with nano and it evokes the precisely the behaviour just described.
The operating system is Darwin.
How can I examine this data?

Basically you could use command more or less to scroll over the file. If you know which lines in the file you are interested in, like from line 3000 to 3999, you could show them with sed -n '3000,3999p' your_file_name.

Related

two programs accessing one file

New to this forum - looks great!
I have some Processing code that periodically reads data wirelessly from remote devices and writes that data as bytes to a file, e.g. data.dat. I want to write an Objective C program on my Mac Mini using Xcode to read this file, parse the data, and act on the data if data values indicate a problem. My question is: can my two different programs access the same file asynchronously without a problem? If this is a problem can you suggest a technique that will allow these operations?
Thanks,
Kevin H.
Multiple processes can read from the same file at a time without any problem. A process can also read from a file while another writes without problem, although you'll have to take care to ensure that you read in any new data that was written. Multiple processes should not write to the same file at at the same time, though. The OS will let you do it, but the ordering of data will be undefined, and you'll like overwrite data—in general, you're gonna have a bad time if you do that. So you should take care to ensure that only one process writes to a file at a time.
The simplest way to protect a file so that only one process can write to it at a time is with the C function flock(), although that function is admittedly a bit rudimentary and may or may not suit your use case.

Storing and processing large XML files with Heroku?

I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/

In Haskell, in Windows 7, can I read a file that is already write-locked by another program?

I have a 3rd party program that is running continuously, and is logging events in a text file. I want to write a small Haskell program that reads the text file while the other program is running and warns me when certain events are logged.
I looked around and it seems as if, for Windows, readFile is single write OR multiple read - it does not allow single write and multiple read. As I understand it, this is to avoid side effects like the write changing the file after/during reads.
Is there some way for me to work around this constraint on locks? The log file is only appended, and I am only looking for specific rows in the file, so I really don't mind if I don't get the most recent write, as I am interested in eventual consistency and will keep checking the file.

How to get the data displayed in Softice's windows to text file

I can't copy the information in softice to disk/file. I am aware of IceExt but everytime I execute the command to dump the screen to disk(such as "!DumpScreen \??\c:\test.raw")it crashes my system entirely. When I try to copy with the mouse, the cursor only makes it possible to copy one line. I have already read through the softice manual. I just need a way to retrieve data from softice. Any help would be appreciated. I am using xp professional.
Turns out that no addons are required to accomplish this. Using the command "u cs:eip L 1000" from softice. You will then see a duplicate of the data within softice's screen displayed in the command window.
The u 'unassembles' code at the address cs:eip (the current EIP), the L specifies a length of 1000 bytes, you might need more than 1000 so adjust accordingly. Once you've done this you should exit SoftICE and select File / Save SoftICE History As from Symbol Loader, with any luck the resulting file will contain your code dump.You may have to to use F10 to step through inorder to get the data in softice's history log.
Using this method, I successfully dumped the entire code window and data window. The main drawback to this method is that the resulting text will contain alot of noise(unnecessary data). I haven't figured out how to get around this. This is a adaptation of woodmann's method.

Programmatically empty out large text file when in use by another process

I am running a batch job that has been going for many many hours, and the log file it is generating is increasing in size very fast and I am worried about disk space.
Is there any way through the command line, or otherwise, that I could hollow out that text file (set its contents back to nothing) with the utility still having a handle on the file?
I do not wish to stop the job and am only looking to free up disk space via this file.
Im on Vista, 64 bit.
Thanks for the help,
Well, it depends on how the job actually works. If it's a good little boy and it pipes it's log info out to stdout or stderr, you could redirect the output to a program that you write, which could then write the contents out to disk and manage the sizes.
If you have access to the job's code, you could essentially tell it to close the file after each write (hopefully it's an append) operation, and then you would have a timeslice in which you could actually wipe the file.
If you don't have either one, it's going to be a bit tough. If someone has an open handle to the file, there's not much you can do, IMO, without asking the developer of the application to find you a better solution, or just plain clearing out disk space.
Depends on how it is writing the log file. You can not just delete the start of the file, because the file handle has a offset of where to write next. It will still be writing at 100mb into the file even though you just deleted the first 50mb.
You could try renaming the file and hoping it just creates a new one. This is usually how rolling logs work.
You can use a rolling log class, which will wrap the regular file class but silently seek back to the beginning of the file when the file reaches a maximum designated size.
It is a very simple wrap, either write it yourself or try finding an implementation online.

Resources