Text editor to open big (giant, huge, large) text files [closed] - windows

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I mean 100+ MB big; such text files can push the envelope of editors.
I need to look through a large XML file, but cannot if the editor is buggy.
Any suggestions?

Free read-only viewers:
Large Text File Viewer (Windows) – Fully customizable theming (colors, fonts, word wrap, tab size). Supports horizontal and vertical split view. Also support file following and regex search. Very fast, simple, and has small executable size.
klogg (Windows, macOS, Linux) – A maintained fork of glogg. Its main feature is regular expression search. It supports monitoring file changes (like tail), bookmarks, highlighting patterns using different colors, and has serious optimizations built in. But from a UI standpoint, it's rather minimal.
LogExpert (Windows) – "A GUI replacement for tail." It's really a log file analyzer, not a large file viewer, and in one test it required 10 seconds and 700 MB of RAM to load a 250 MB file. But its killer features are the columnizer (parse logs that are in CSV, JSONL, etc. and display in a spreadsheet format) and the highlighter (show lines with certain words in certain colors). Also supports file following, tabs, multifiles, bookmarks, search, plugins, and external tools.
Lister (Windows) – Very small and minimalist. It's one executable, barely 500 KB, but it still supports searching (with regexes), printing, a hex editor mode, and settings.
Free editors:
Your regular editor or IDE. Modern editors can handle surprisingly large files. In particular, Vim (Windows, macOS, Linux), Emacs (Windows, macOS, Linux), Notepad++ (Windows), Sublime Text (Windows, macOS, Linux), and VS Code (Windows, macOS, Linux) support large (~4 GB) files, assuming you have the RAM.
Large File Editor (Windows) – Opens and edits TB+ files, supports Unicode, uses little memory, has XML-specific features, and includes a binary mode.
GigaEdit (Windows) – Supports searching, character statistics, and font customization. But it's buggy – with large files, it only allows overwriting characters, not inserting them; it doesn't respect LF as a line terminator, only CRLF; and it's slow.
Builtin programs (no installation required):
less (macOS, Linux) – The traditional Unix command-line pager tool. Lets you view text files of practically any size. Can be installed on Windows, too.
Notepad (Windows) – Decent with large files, especially with word wrap turned off.
MORE (Windows) – This refers to the Windows MORE, not the Unix more. A console program that allows you to view a file, one screen at a time.
Web viewers:
readfileonline.com – Another HTML5 large file viewer. Supports search.
Paid editors/viewers:
010 Editor (Windows, macOS, Linux) – Opens giant (as large as 50 GB) files.
SlickEdit (Windows, macOS, Linux) – Opens large files.
UltraEdit (Windows, macOS, Linux) – Opens files of more than 6 GB, but the configuration must be changed for this to be practical: Menu » Advanced » Configuration » File Handling » Temporary Files » Open file without temp file...
EmEditor (Windows) – Handles very large text files nicely (officially up to 248 GB, but as much as 900 GB according to one report).
BssEditor (Windows) – Handles large files and very long lines. Don’t require an installation. Free for non commercial use.
loxx (Windows) – Supports file following, highlighting, line numbers, huge files, regex, multiple files and views, and much more. The free version can not: process regex, filter files, synchronize timestamps, and save changed files.

Tips and tricks
less
Why are you using editors to just look at a (large) file?
Under *nix or Cygwin, just use less. (There is a famous saying – "less is more, more or less" – because "less" replaced the earlier Unix command "more", with the addition that you could scroll back up.) Searching and navigating under less is very similar to Vim, but there is no swap file and little RAM used.
There is a Win32 port of GNU less. See the "less" section of the answer above.
Perl
Perl is good for quick scripts, and its .. (range flip-flop) operator makes for a nice selection mechanism to limit the crud you have to wade through.
For example:
$ perl -n -e 'print if ( 1000000 .. 2000000)' humongo.txt | less
This will extract everything from line 1 million to line 2 million, and allow you to sift the output manually in less.
Another example:
$ perl -n -e 'print if ( /regex one/ .. /regex two/)' humongo.txt | less
This starts printing when the "regular expression one" finds something, and stops when the "regular expression two" find the end of an interesting block. It may find multiple blocks. Sift the output...
logparser
This is another useful tool you can use. To quote the Wikipedia article:
logparser is a flexible command line utility that was initially written by Gabriele Giuseppini, a Microsoft employee, to automate tests for IIS logging. It was intended for use with the Windows operating system, and was included with the IIS 6.0 Resource Kit Tools. The default behavior of logparser works like a "data processing pipeline", by taking an SQL expression on the command line, and outputting the lines containing matches for the SQL expression.
Microsoft describes Logparser as a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows operating system such as the Event Log, the Registry, the file system, and Active Directory. The results of the input query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart.
Example usage:
C:\>logparser.exe -i:textline -o:tsv "select Index, Text from 'c:\path\to\file.log' where line > 1000 and line < 2000"
C:\>logparser.exe -i:textline -o:tsv "select Index, Text from 'c:\path\to\file.log' where line like '%pattern%'"
The relativity of sizes
100 MB isn't too big. 3 GB is getting kind of big. I used to work at a print & mail facility that created about 2% of U.S. first class mail. One of the systems for which I was the tech lead accounted for about 15+% of the pieces of mail. We had some big files to debug here and there.
And more...
Feel free to add more tools and information here. This answer is community wiki for a reason! We all need more advice on dealing with large amounts of data...

Related

Converting text file with spaces between CR & LF

I've never seen this line ending before and I am trying to load the file into a database.
The lines all have a fixed width. After the CSV text which contains the data (the length varies line-by-line), there is a CR followed by multiple spaces and ending with LF. The spaces provide the padding to equalize the line width.
Line1,Data 1,Data 2,Data 3,4,50D20202020200A
Line2,Data 11,Data 21,Data 31,41,510D2020200A
Line3,Data12,Data22,Data 32,42,520D202020200A
I am about to handle this with a stream reader / writer in C#, but there are 40 files that come in each month and if there is a way to convert them all at once instead of one line at a time, I would rather do that.
Any thoughts?
Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.
When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)
But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.
If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.
For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.

How can an executable be this small in file size?

I've been generating payloads on Metasploit and I've been experimenting with the different templates and one of the templates you can have your payload as is exe-small. The type of payload I've been generating is a windows/meterpreter/reverse_tcp and just using the normal exe template it has a file size around 72 KB however exe-small outputs a payload the size of 2.4kb. Why is this? And how could I apply this to my programming?
The smallest possible PE file is just 97 bytes - and it does nothing (just return).
The smallest runnable executable today is 133 bytes, because Windows requires kernel32 being loaded. Executing a PE file with no imports is not possible.
At that size it can already download payload from the Internet by specifying an UNC path in the import table.
To achieve such a small executable, you have to
implement in assembler, mainly to get rid of the C runtime
decrease the file alignment which is 1024 by default
remove the DOS stub that prints the message "This program cannot be run in DOS mode"
Merge some of the PE parts into the MZ header
Remove the data directory
The full description is available in a larger research blog post called TinyPE.
For EXE's this small, the most space typically is used for the icon. Typically the icon has various sizes and color schemes contained, which you could get rid of, if you do not care having an "old, rusty" icon, or no icon at all.
There is also some 4k of space used, when you sign the EXE.
As an example for a small EXE, see never10 by grc. There is a details page which highlights the above points:
https://www.grc.com/never10/details.htm
in the last paragraph:
A final note: I'm a bit annoyed that “Never10” is as large as it is at
85 kbyte. The digital signature increases the application's size by
4k, but the high-resolution and high-color icons Microsoft now
requires takes up 56k! So without all that annoying overhead, the app
would be a respectable 25k. And, yes, of course I wrote it in
assembly language.
Disclaimer: I am not affiliated with grc in any way.
The is little need for an executable to be big, except when it contains what I call code spam, code not actually critical to the functionality of the program/exe. This is valid for other files too. Look at a manually written HTML page compared to one written in FrontPage. That's spamcode.
I remember my good old DOS files that were all KB in size and were performing practically any needed task in the OS. One of my .exes (actually .com) was only 20 bytes in size.
Just think of it this way: just as in some situations a large majority of the files contained in a Windows OS can be removed and still the OS can function perfectly, it's the same with the .exe files: large parts of the code is either useless, or has different than relevant-to-objective purpose or are intentionally added (see below).
The peak of this aberration is the code added nowdays in the .exe files of some games that use advanced copy protection, which can make the files as large as dozens of MB. The actually code needed to run the game is practically under 10% of the full code.
A file size of 72 KB as in your example can be pretty sufficient to do practically anything to a windows OS.
To apply this to your programming, as in make very small .exes, keep things simple. Don't add unnecessary code just for the looks of it or by thinking you will use that part of the program/code at a point.

Is dual mode executable possible?

A bit of history... I have 3 systems that I spend time on, a DOS 6.22 system, a Windows 95 system, and a modern Windows 7 (64-bit) system. When I upgraded to Win7-64, some of my favorite command line utilities stopped working, so I decided to re-write them myself. The only 2 compilers I have are Borland Turbo C++ 3.0 and Visual Studio 2008, and they worked fine for building 2 versions, a DOS 16-bit, and a Windows 7 32-bit (could have built 64-bit too, I guess.) The problem came with my Win95 system. The DOS version works fine there, but since I spent the time to support LFNs in the Win7 build, I wanted it with my Win95 system. So, after a lot of research, I found and purchased Visual Studio 6 (last one with Win95 support according to what I researched,) copied the code over (had to rewrite sections, of course,) and it compiled just fine, and works :)
The problem occurred the next time I had to boot my Win95 system in DOS mode. The program stopped working (of course,) because Win95 wasn't loaded. I don't really want to have 2 copies of the program installed (needing 2 different file names,) so I was hoping there was a way to link the 2 versions together into one file. If I execute it in DOS, instead of it saying it requires windows, it would just jump to the DOS section of the program. That way, it would be a single program, with LFN support if Win95 is loaded, and without if Win95 isn't loaded. Since the Win95 version also works fine in Win7-64, it would probably also produce a single version that works on all 3 systems (which would be an added bonus.)
I did some web searches, and couldn't find anything germane to what I'm looking for. So I have no idea if it is even possible. I may have to get yet another compiler, but considering how old it would have to be, I could probably afford it. My web searches did result in information that leads me to believe that it "should" be possible, though. It would just require a different exe header than the one Windows compilers put in. It may require that I re-write the DOS version for 32-bit and use a DOS extender (for protected mode, assuming I can't find a way to include it in the file itself.) That would be acceptable (though not ideal.) I would much rather have 16-bit code in the DOS section, and 32-bit code in the Windows section (for the most compatibility.)
Does anyone have any information about something like this? If you could just point me in the right direction it would be greatly appreciated.
I don't know if it has been continued in Windows 7 executables, but back in Win95 the executable (EXE) actually had two entry points -- one "normal" one that DOS would find, and a second one that Windows would use. The DOS entry point was usually a very simple default that would just print "This is a Windows program" and exit. You can actually override this default, and have the linker use your own code, however it is very limited.
What I'd recommend doing is add logic to your DOS 6.22 version (e.g. "sed") that would check the OS level & if it meets the right criteria, pass the parameters along to a second executable (e.g. "sedx") that uses features from the "newer" OS.
The documentation for Visual Studio 6 describes the /STUB option here, simply point this at the DOS version of your program.
I don't have VS6 handy, so I can't be too specific, but in the project settings GUI, there should be an "additional options" setting in the linker section.
Well the answer is the /stub option in the Linker you are using for your Windows code. Some additional information for anyone who finds the question later.... I had to do several days of web searches to find that there doesn't appear to be another answer to my particular problem.
Stub requires that the DOS mode executable have a header of at least 40 bytes. After fighting with multiple compilers that "DO" give you a header of the right size (Borland Turbo C++ won't,) and not being able to convert my code, I had to get sneaky/fancy. BTW - Visual C 1.52c (last Visual C that supports DOS,) will make a correct header, as will Open WatCom.
If you are faces with the same issue I was - the compiler you used won't make the correct size header, and your code is too compiler specific to convert easily, you can do what I ended up doing. I used Open WatCom to write a tiny ("Hello World") Windows program using my exe with the short (Borland created,) header as the stub. Open WatCom will adjust the header automatically. I then used a Hex Editor to read the header information to get the ending address of the stub and a partial file copier to copy only that part of the program to a file I named "stub.exe" (stripping of the Windows code.) Using the same Hex Editor I zeroed out the PE pointer in the header. I now had a working DOS exe that would also work as a stub. Took my stub to my Windows compiler, and linked it in. It works great, all features fully realized :)
FYI - Information needed to strip the Windows portion and zero the PE pointer.
first byte is offset 0 (of course, but some people may not realize that, and think it's byte 1.) Also remember, that most Hex Editors (by their very name,) are giving you numbers in hexadecimal format.
offset 2 & 3, number of bytes in the last block of the DOS portion of the file in low byte - high byte format. That is, offset 2 is low, 3 is high. So take them, reverse them, and you will get a number from 0 - 511 (0 - 1ff in hex.) 0 means the entire block of 512 (200 in hex) bytes is used.
offset 4 & 5 (again in low/high format,) is the number of 512 (200 in hex) byte block in the DOS portion. Remember to reverse the number, and that the last block may only be a partial block. So, subtract one, multiply by 512 (200 hex,) add the number from 2-3, and you have how many bytes are in the DOS portion. Since you are starting from 0, subtract 1, and you now know to only copy bytes 0 - "whatever the total is" to your stub exe.
offset 60-61 (hex 3C-3D) is the pointer to the start of the PE (or Portable Executable,) portion of the code (the part that Windows jumps to.) It should be just past (mine was padded with a few zeroes,) the end of the DOS portion of the code. This isn't important at this time, as we are just turning those into 0's anyway (the PE portion has been stripped.) You can use this as confirmation that you have the correct "end of DOS" offset selected though.
The tools I used are:
Open WatCom at http://www.openwatcom.org/index.php/Main_Page
and
Part Copy at http://www.virtualobjectives.com.au/utilitiesprogs/partcopy.htm
I have no idea where to find the Hex Editor I used. I used CEdit, a DOS program I really like, but have been unable to find on the net. Have to use DOSBox with it as Win7 won't run it, though. There are probably other compilers that do the same thing, and probably tons of partial file copiers available. These are the tools I used.

does vim read the whole file into memory

When I open a file, does vim read it all into memory? I experienced significant slowdowns when I open large files. Or is it busy computing something (e.g., line number)?
Disabling features like syntax highlighting, cursorline, line numbers and so on will greatly reduce the load and make Vim snappier in these cases.
There's even a plugin to handle that for you and a Vim tip for some background info.
Yes, Vim loads the whole file into memory.
If you run htop in another pane you can watch it happen in real time.
If you don't have enough memory available, it will start hitting the swap which makes it take even longer.
You can disable plugins and heavy features (like syntax highlighting) to get improved performance (the -u effectively tells vim not to load your ~/.vimrc):
vim -u NONE mysqldump.sql
However, unless you really need to edit the file, I prefer to just use a different tool. I typically use less. I mostly search the files in vim with / and less supports that just fine.
less mysqldump.sql

Is there a good reason to limit Windows filename extentions to three characters?

I am creating a utility that will store data on flat file in a specific binary format.
I want the filename extension to be specific to my application. Is there any reason other than the old 8.3 filename limit for restricting the extension to 3 characters, and if not, what is the limit? Can I have myfilename.MyExtensionSoHandsOffEverybodyElse ?
This is a hold over from the old windows 3.x/MSDOS days. Today, there are plenty of file names that have more than 3 character extensions.
If I remember correctly, Windows XP had a maximum character limit for path names (including the file name) of 255 characters.
In my experience, having seen a few non-3-character extensions I'd say that it's a matter of tradition, and you're perfectly welcome to use myfilename.MyExtensionSoHandsOffEverybodyElse.
The only good reason for doing this is if you plan to support Windows 9x. If you're only targeting XP and later, as with most projects nowdays, the 8.3 thing is irrelevant.
In fact, Windows itself stores things in long-extension filenames in Vista and later, for example, .search-ms for saved searches.
No, there isn't a good reason to limit the extension to 3 characters. However, a shorter, descriptive name is better if a user has to remember it. For example, most people know what a .html or .doc file would contain.
As long as you make a reasonable attempt to avoid naming collisions with major software there shouldn't be an issue. A corollary to that is the fact that unless you create some insanely long extension that will only ever be unique to your software (and even then, it's not guaranteed), the extension you choose will always be subject to name collision by other people's software when they choose their program's data file extension as you are doing here.

Resources