Is MATLAB performace better than bash scripts for text manipulation? - performance

I have a program (for simulation of some physical systems) that gives very big (more than 1GB) text files as output. I have to extract the desired results (numbers) from this text file. Currently I've written bash scripts for this purpose that, for example, search the text files for some expressions and write the number after that expression in a separate file; e.g.:
grep $EXP | awk '{print $14}' > tmp
Unfortunately, these bash scripts are very time-consuming for large input text files. So I am considering to use another language for searching the text files. As there are many scripts that I have to rewrite, does writing the scripts in MATLAB give me a considarable speed-up?
As a side question, are there better options than MATLAB? (probably compiled languages like C?)

Related

xzgrep and other Compressed Pattern Matching Tools

I'm experimenting on compressed pattern matching utilities, or more specifically searching for text patterns within LZW-compressed text files.
I'm wondering if the xzgrep Linux utility is applying a certain algorithm for achieving that, or is it just equivalent to the regular decompression and grepping, and has nothing to do e.g.
uncompress LARGE_TEXT_FILE.Z | grep "My Pattern"
Also, are there any other utilities/software that apply any compressed pattern-matching algorithms (LZW-compressed text files, like http://tandem.bu.edu/papers/let.sleeping.files.lie.jcss.1996.pdf) , preferably with the source code available?
Thank you!

Merge two text files

I'm using Windows and Notepad++ to separate file in txt. I have 2 files which is I have to merge it side by side or line by line for my data analysis.
Here is the example:
file1.txt
Abcdefghijk
abcdefghijk
file2.txt
123456
123456
then the output I want is like this:
Abcdefghijk123456
abcdefghijk123456
in the next file or output file. Does anybody here know how to do this?
Your question answered here by TheMadTechnician. Using powershell, you should take both source files (1 and 2) as arrays of lines. Then comes simple cycle, like "merge line x from file1 with line x from file2 as long you have some lines in file1".
Unfortunately its impossible with pure cmd.
#riki.. you could also write a batch program to do this pro grammatically. There should probably be no limit over the number of lines.
It may depend on the number of lines you're having in each files. I suggest to copy paste the same if it is less than 50 lines.
Otherwise,
use some powerful languages like python, c,php etc. And make it run before performing data analysis.
There is a free utility you can download and run on your computer, called txtcollector. I read about it here. I used it because I had a whole folder of files to concatenate. It was a breeze. The only slight imperfection I noticed was that I couldn't paste in the path to the specific folder in the first step (choosing the folder where the files to be concatenated were). However, I could do this when choosing where to save the result.

How to extract specific lines from a huge data file?

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.
Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

Text editor to view giant log files

As I have not yet setup some log rotating solution, I have a 3gb (38-million line) log file which I need to find some information in from a certain date. As using cat | grep is horribly slow, and using my current editor (Large Text File Viewer) is equally slow, I was wondering: Is there any text editor that works well with viewing >35-million line log files? I could just use the cat | grep solution and leave it running overnight, but with millions of errors to sort through there has to be a better way.
You might want to try using grep by itself:
grep 2011-04-09 logfile.txt
instead of needlessly using cat:
cat logfile.txt | grep 2011-04-09
When dealing with large amounts of data, this can make a difference.
Interesting reading is a Usenet posting from last year: why GNU grep is fast.
Since you are on Windows, you should really try multiple implementations of grep. Not all implementations of grep are equal. There are some truly awful implementations.
It is not necessary to use cat: Grep can read directly from the log file, unless it is locked against being shared with readers.
grep pattern logfile > tmpfile
should do the trick. Then you can use most any editor to examine the selected records, assuming it is quite selective.
I don't think you're going to get any faster than grep alone (as others have noted, you don't need the cat).
I personally find "more" and "less" are useful (for smaller files). The reason is that sometimes a pattern will get you in the general vicinity of where you want (i.e. a date and time) and then you can scroll through the file at that point.
the "/" is the search command for regular expressions in more.

very long lines - windows grep character (not a line based) tool

Is there a grep-like tool for windows where I can restrict the number of characters it outputs in a line where a searched for pattern is found.
One of the upstream software systems generates huge text files which we then feed as the input to our system.
Sometimes the input files get corrupted and I need to do a quick textual search to find if particular the bits of data are missing or not. To make it even worse - the input files is just one very very long line of text - and when I use grep or findstr - the result of the search is huge chunk of text.
I am wandering - how can I limit the number of characters grep to show before/after the pattern I searched for.
Cheers.
Two things spring to my mind:
Call grep with the --only-matching option so that only the text that matches is emitted. Depending on your regex, this may or may not help.
Write a very simple executable, call it trunc, which reads from stdin line by line and output the first n characters to stdout. Then simply pipe the output from grep to trunc.
The latter option is relatively simple. If you didn't want to go the whole hog and produce a proper native exe it could be quite easily achieved with a Perl/Python/Ruby etc. script.

Resources