convert multiple .txt to multiple ascii files fast - possible in Matlab? - macos

I have over 120 .txt files (all named like s1.txt, s2.txt, ..., s120.txt) that I need to convert to ASCII extension to use in MATLAB.
my .txt (comma , delimited .txt) files look like the following:
20080102,43.0300,3,9.493,569.567,34174.027,34174027
20080102,43.0600,3,9.498,569.897,34193.801,34193801
In MATLAB I wish to use code similar to the following:
for i = svec;
%# where svec = [1 2 13 15] some random number between 1 and 120.
eval(['load %mydirectory', eval(['s',int2str(i)]),'.ascii']);
end;
If I am not mistaken I can't use the above command with .txt files and therefore I must use ASCII files.
Since I have a lot of files to convert and they are large in size, is there a quick way to convert all my files via MATLAB, or perhaps there is a great converting software available for Mac on the web? Would anyone have a better suggestion than using the code above?

Adding to nrz's answer:
I'm not sure what you want to do exactly, but know that you can open any file in MATLAB, both as text (ASCII) or in binary mode. The latter can be achieved using fread.
As a side note, you also asked for a better suggestion for your code.
Well, what did you try to achieve with the two eval invocations? Why not call the commands directly? Do this instead:
for i = svec
load (['%mydirectory\s', int2str(i), '.txt'], '-ascii');
end
I also took the liberty to add a backslash that I think you had omitted.
In most cases, you'd be better off without using eval. Check the alternatives...

Can you show an example file? Not every text file is valid for load command. If your file is not in a valid format, changing the extension part of filename from .txt to .ascii doesn't help at all. Instead, in that case the data must be either converted to a valid format for load command or, alternatively, loaded into MATLAB by some other means eg. by using fscanf or xlsread. File structure is needed for both ways to solve this.
See also load command in matlab loading blank file.

A slightly cleaner way:
for i=1:120
fname = fullfile('mydirectory', sprintf('s%d.txt',i));
X = load(fname, '-ascii');
end

Related

Maximum number of input file for Ghostscript (gs)

I simply want to combine multiple eps files into one big file using gs command
the command work flawlessly except that when I specify more than 20 input files.
Somehow the command ignore input files starting from 21st input.
Anyone experience the same behavior? Is there a cap of number of input files specify anywhere?
I look through the site and couldn't find one.
sample command
gs -o output.eps -sDEVICE=eps2write file1.eps file2.eps .... file21.eps
Thank you.
Edit: add sample command
Almost certainly you have simply reached the maximum length of the command line for your Operating System. You can use the # syntax for Ghostscript to supply a file containing the command line instead.
https://www.ghostscript.com/doc/current/Use.htm#Input_control
Note that the EPS files will not be placed appropriately using that command, and this does not actually combine EPS files, it creates a new EPS file whose marking content should be the same as the input(s).
If you actually want to combine the EPS files its easy enough, but will require a small amount of programming to parse the EPS file headers and produce appropriate scale/translate operations, as well as stripping off any bitmap previews (which will also happen when you run them through Ghostscript).

Use Python to generate input to XeTeX

I am wondering what is a good way to use Python to generate an arithmetic worksheet consisting only of 2-digit numbers squared. Specifically I want to be able to call upon a Python program in which it asks me for parameters such as the range of the numbers in can call upon to square and the number of questions I want to generate. Once that is done the program will generate the numbers and then automatically open up a .tex file (already with preamble and customizations) and basically do a loop for each question like this:
\begin{exer}
n^2
\end{exer}
%%%%%Solution%%%%%%
\begin{solution}
n^2=n^2
\end{solution}
for some integer n.
Once it is done writing the .tex file then it will run xetex and output the pdf ready to use and print. Any help or suggestions? Python is preferred but not mandatory.
Actually your problem is so simple that doesn't require any special magic. However I would suggest you don't try to append your generated content into a file you already have with preamble, good practice is to leave it untouched and include (in fact you can copy it on generation or use TeX \include).
Now, let's add more to the generation. Python formatter is your friend here, you use the example you've given as a template, and write the product into a file in every iteration. Don't forget to escape "{" brackets, as they're symbols used by formatter.
At the end, (suggestion) you can subprocess to launch XeTeX - depending on your need call() is enough or use popen().

How to extract specific lines from a huge data file?

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.
Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

How to save output from mathematica file into some other format and then using it later?

I have a large program in mathematica, and generate many outputs. However, I don't want those outputs to be visible in my program and I want to save the outputs in text or any other format.
Moreover I also want to call any of the outputs and perform some specific operations on it (plotting, squaring etc.)
Please guide me in this respect.
You need to use the Export function. If you want the output suppressed, use semicolons. See the help file on Export for more.
(I don't know what else to say..)
Export function is the way to go.
For example, if you ever needed 1 billion digits of Pi:
Export["pi-to-1B.txt", N[[Pi], 1000000000]];

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Resources