Can I access Gnuplot "stats" command's sorted output? - sorting

According to Gnuplot stats help,
...
Data values are sorted to find the median and quartile boundaries.
...
I am wondering if I can access this sorted data? For example, can I access the "10th smallest" value, not only the minimum value? (My viewers feel that absolute minimum may be an outlier and that 10th from the extreme may be more representative of the situation.)
On the one hand, some of this analysis would be easy in Perl, but then I haven't found a Perl module giving full-featured access to Gnuplot. So, I'm trying to do the analysis in Gnuplot.

No, you cannot access those sorted data values besides the values which are stored in variables. See show variables all after executing stats to see which are saved.
In your case you must use an external tool to achieve this. An easy variant would be to use some Unix command line tools which you call from gnuplot with the system function:
min = system('sort -n data.dat | head -n +10 | tail -1')

Related

How can I sort a very large log file, too large to load into main memory?

Given that i have a very large log file, large enough that it can not be loaded into my main memory, and i wanted to sort it somehow, what would be the most recommended sorting technique and algorithm?
If you have GNU sort, use it. It knows how to deal with large files. For details, see the answers to How to sort big files on Unix SE. You will of course need sufficient free disk space.
If you are looking for an algorithm, you could apply merge sort.
Essentially you split your data into smaller chunks and sort each chunk. Then you take two sorted chunks and merge them (this can be done in a streaming fashion, just take the smallest value of the two chunks and increment)m this results in a bigger chunk. Keep doing this until you have merged all chunks.
This depends on OS. If on Linux/Unix, you can use the sed command to print specific lines
sed -n -e 120p /var/log/syslog
Which would print line 120 of the syslog file. You could also use head
head -n 15 /var/log/syslog
Which would print the first 15 lines of the syslog file. There is also grep, tail, etc. for viewing portions of a large file. More detail here on these and more:
http://www.thegeekstuff.com/2009/08/10-awesome-examples-for-viewing-huge-log-files-in-unix
For Windows, there is Large Text File Viewer

UNIX sort unique guaranteed to give first

I like to use the -u option of the UNIX sort utility to get unique lines based on a particular subset of columns, e.g. sort -u -k1,1 -k4,4
I have looked extensively in UNIX sort and GNU sort documentation, and I cannot find any guarantee that the -u option will return the first instance (like the uniq utility) after sorting by the specified keys.
It seems to work as desired in practice (sort by keys, then give first instance of each unique key combination), but I was hoping for some kind of guarantee in the documentation to put my paranoia at ease.
Does anyone know of such a guarantee?
I think the code for such a small utility is likely the only place you'll find such a guarantee. You can enable more debugging output as well if you'd like to see how it is working.
If you look through the code for GNU sort, it appears that the uniqueness testing happens after all sorting is completed, when it is iterating through the sorted contents of the temporary files created by the sorting process.
This happens in a while loop that compares the previous line savedline with smallest, which is the next smallest input line which would be output.
Thus, my opinion would be that it will process your sorting criteria first, then unique the output at the last step.

Searching a file non-sequentially

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?
You can use tail -n +N file | grep to begin a grep at a given line offset.
You can combine head with tail to search over just a fixed range.
However, this still must scan the file for end of line characters.
In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.
For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.
You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.
If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.
If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.
Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.
if your lines are fixed length, you can use dd to read a particular section of the file:
dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands
Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size.
Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.
The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

Merge sorted files efficiently

I need to merge about 30 gzip-ed text files, each about 10-15GB compressed, each containing multi-line records, each sorted by the same key. The files reside on an NFS share, I have access to them from several nodes, and each node has its own /tmp filesystem. What would be the fastest way to go about it?
Some possible solutions:
A. Leave it all to sort -m. To do that, I need to pass every input file through awk/sed/grep to collapse each record into a line and extract a key that would be understood by sort. So I would get something like
sort -m -k [...] <(preprocess file1) [...] <(preprocess filen) | postprocess
B. Look into python's heapq.merge.
C. Write my own C code to do this. I could merge the files in small batches, make an OMP thread for each input file, one for the output, and one actually doing the merging in RAM, etc.
Options for all of the above:
D. Merge a few files at a time, in a tournament.
E. Use several nodes for this, copying intermediate results in between the nodes.
What would you recommend? I don't have much experience about secondary storage efficiency, and as such, I find it hard to estimate how either of these would perform.
If you go for your solution B involving heapq.merge, then you will be delighted to know, that Python 3.5 will add a key parameter to heapq.merge() according to docs.python.org, bugs.python.org and github.com. This will be a great solution to your problem.

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

Resources