When was a file used by another program - macos

I have a file with a series of 750 csv files. I wrote a Stata that runs through each of these files and performs a single task. The files are quite big, and my program has been running for more than 12 hours.
Is there is a way to know which was the last of the csv files that was used by Stata? I would like to know if the code is somewhere near finishing. I thought that organizing the 750 files by "last used" would do the trick, but it does not.
Next time I should be more careful about signalling how the process is going...
Thank you

From the OS X terminal, cd to the directory containing the CSV files, and run the command
ls -lUt | head
which should show your files, sorted by the most recent access time, limited to the 10 most recently accessed.

On the most basic level you can use display and log your session:
clear
set more off
local myfiles file1 file2 file3
forvalues i = 1/3 {
display "processing file`i'"
// <do-something-with-file>
}
See also help log.

Related

Maximum amount of lines that can be copy/pasted into command prompt

I'm using AWS CLI tool to download hundreds of thousands of files. I have almost a million of these one-liners generated from SQL query with different file path that I need to go through:
aws s3 cp s3://[myS3FilePath]/17802c9-6d3b-4eef-855a-a6ae0039c7ff/ C:\[MyLocalFilePath]\17802c9-6d3b-4eef-855a-a6ae0039c7ff\ --recursive
I've been taking ~1000 lines at a time, pasting them to command prompt, and waiting for them to be iterated through. Works great!
It's quite a time waste doing it in 1000 record batches though. What's the maximum amount of lines that I could paste to CMD without losing any of my download commands?
Could I paste 1.000.000 lines into the command prompt for example and trust that it will iterate through all of them?
The easy answer is to put all of the one-liners into a .bat file script and run the .bat file script.

Loop Over Files as Input for Program, Rename and Write Output to Different Directory

I have a problem with writing the output of a program to a different directory when I loop different files as variables as inputs. I run this in the command line. The problem is that I do not know how to "tell" the program to put the output with a changed filename into another directory than the input directory.
Here is the command, although it is a bioinformatic tool which requires specific input file formats. I am sorry that I could not give a better example. Nonetheless, the program is called computeMatrix in a software-tool box called deeptools2.
command:
for f in ~/my/path/*spc_files*; do computeMatrix reference-point--referencePoint center --regionsFileName /target/region.bed --binSize 500 --scoreFileName "$f" **--outFileName "$f.matrix"** ; done \
So far, I tried to use the command basename to just get the filename and then change the directory before that. However I could not figure out:
if this is combinable
what is the correct order of the commands (e.g.:
outputFile='basename"$f"', "~/new/targetDir/'basename$f'")
Probably there are other options to solve the problem which I could not think of/ find.

concatenate fastq files in a directory

I have a file uploader, resumable.js, which takes a file and breaks it into 1MB 'chunks' and than sends over the files 1MB at a time. So after an upload I have a directory with thousands, sometimes millions of individual fastq files. I can concatenate all of these 'chunks' back into the files original state with this line of code..
cat file_name.* > merged.fastq
How would I go about concatenating the files back into its original state without manually running this script in the command line? Should I set up some bash script to handle this issue, maybe a cronjob? Any ideas to solve this issue are greatly appreciated.
ANSWER: For what its worth I used this npm module and it works great.
https://www.npmjs.com/package/joiner

How can I find a directory-diff of millions of files to script maintenance?

I have been working on how to verify that millions of files that were on file system A have infact been moved to file system B. While working on a system migration, it became evident that all the files needed to be audited to prove that the files have been moved. The files were initially moved via rsync, which does provide logs, although not in a format that is helpful for doing an audit. So, I wrote this script to index all the files on System A:
#!/bin/bash
# Get directories and file list to be used to verify proper file moves have worked successfully.
LOGDATE=`/usr/bin/date +%Y-%m-%d`
FILE_LIST_OUT=/mounts/A_files_$LOGDATE.txt
MOUNT_POINTS="/mounts/AA mounts/AB"
touch $FILE_LIST_OUT
echo TYPE,USER,GROUP,BYTES,OCTAL,OCTETS,FILE_NAME > $FILE_LIST_OUT
for directory in $MOUNT_POINTS; do
# format: type,user,group,bytes,octal,octets,file_name
gfind $directory -mount -printf "%y","%u","%g","%s","%m","%p\n" >> $FILE_LIST_OUT
done
The file indexing works fine and takes about two hours to index ~30 million files.
On side B is where we run into issues. I have written a very simple shell script that reads the index file, tests to see if the file is there, and then counts up how many files are there, but it's running out of memory while looping through the 30 million lines on indexed file names. Effectively doing this little bit of code below through a while loop, and counters to increment for files found and not found.
if [ -f "$TYPE" "$FILENAME" ] ; then
print file found
++
else
file not found
++
fi
My questions are:
Can a shell script do this type of reporting from such a large list. A 64 bit unix system ran out of memory while trying to execute this script. I have already considered breaking up the input script into smaller chunks to make it faster. Currently it can
If as shell script is inappropriate, what would you suggest?
You just used rsync, use it again...
--ignore-existing
This tells rsync to skip updating files that already exist on the destination (this does not ignore existing directories, or nothing would get done). See also --existing.
This option is a transfer rule, not an exclude, so it doesn’t affect the data that goes into the file-lists, and thus it doesn’t affect deletions. It just limits the files that the receiver requests to be transferred.
This option can be useful for those doing backups using the --link-dest option when they need to continue a backup run that got interrupted. Since a --link-dest run is copied into a new directory hierarchy (when it is used properly), using --ignore existing will ensure that the already-handled files don’t get tweaked (which avoids a change in permissions on the hard-linked files). This does mean that this option is only looking at the existing files in the destination hierarchy itself.
That will actually fix any problems (at least in the same sense that any diff-list on file-exist tests could fix problem. Using --ignore-existing means rsync only does the file-exist tests (so it'll construct the diff list as you request and use it internally). If you just want information on the differences, check --dry-run, and --itemize-changes.
Lets say you have two directories, foo and bar. Let's say bar has three files, 1,2, and 3. Let's say that bar, has a directory quz, which has a file 1. The directory foo is empty:
Now, here is the result,
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/
>f+++++++++ 1
>f+++++++++ 2
>f+++++++++ 3
cd+++++++++ quz/
>f+++++++++ quz/1
Note, you're not interested in the cd+++++++++ -- that's just showing you that rsync issued a chdir. Now, let's add a file in foo called 1, and let's use grep to remove the chdir(s),
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/ | grep -v '^cd'
>f+++++++++ 2
>f+++++++++ 3
>f+++++++++ quz/1
f is for file. The +++++++++ means the file doesn't exist in the DEST dir.
Here is the bonus, remove --dry-run, and, it'll go ahead and make the changes for you.
Have you considered a solution such as kdiff3, which will diff directories of files ?
Note the feature for version 0.9.84
Directory-Comparison: Option "Full Analysis" allows to show the number
of solved vs. unsolved conflicts or deltas vs. whitespace-changes in
the directory tree.
There is absolutely no problem reading a 30 million line file in a shell script. The reason why your process failed was most likely that you tried to read the file entirely into memory, e.g. by doing something wrong like for i in $(cat file).
The correct way of reading a file is:
while IFS= read -r line
do
echo "Something with $line"
done < someFile
A shell script is inappropriate, yes. You should be using a diff tool:
diff -rNq /original /new
If you're not particular about the solution being a script, you could also look into meld, which would let you diff directory trees quite easily and you can also set ignore patterns if you have any.

saving entire file in VIM

I have a very large CSV file, over 2.5GB, that, when importing into SQL Server 2005, gives an error message "Column delimiter not found" on a specific line (82,449).
The issue is with double quotes within the text for that column, in this instance, it's a note field that someone wrote "Transferred money to ""MIKE"", Thnks".
Because the file is so large, I can't open it up in Notepad++ and make the change, which brought me to find VIM.
I am very new to VIM and I reviewed the tutorial document which taught me how to change the file using 82,449 G to find the line, l over to the spot, x the double quotes.
When I save the file using :saveas c:\Test VIM\Test.csv, it seems to be a portion of the file. The original file is 2.6GB and the new saved one is 1.1GB. The original file has 9,389,222 rows and the new saved one has 3,751,878. I tried using the G command to get to the bottom of the file before saving, which increased the size quite a bit, but still didn't save the whole file; Before using G, the file was only 230 MB.
Any ideas as to why I'm not saving the entire file?
You really need to use a "stream editor", something similar to sed on Linux, that lets you pipe your text through it, without trying to keep the entire file in memory. In sed I'd do something like:
sed 's/""MIKE""/"MIKE"/' < source_file_to_read > cleaned_file_to_write
There is a sed for Windows.
As a second choice, you could use a programming language like Perl, Python or Ruby, to process the text line by line from a file, writing as it searches for the doubled-quotes, then changing the line in question, and continuing to write until the file has been completely processed.
VIM might be able to load the file, if your machine has enough free RAM, but it'll be a slow process. If it does, you can search from direct mode using:
:/""MIKE""/
and manually remove a doubled-quote, or have VIM make the change automatically using:
:%s/""MIKE""/"MIKE"/g
In either case, write, then close, the file using:
:wq
In VIM, direct mode is the normal state of the editor, and you can get to it using your ESC key.
You can also split the file into smaller more manageable chunks, and then combine it back. Here's a script in bash that can split the file into equal parts:
#!/bin/bash
fspec=the_big_file.csv
num_files=10 # how many mini-files you want
total_lines=$(cat ${fspec} | wc -l)
((lines_per_file = (total_lines+num_files-1) / num_files))
split --lines=${lines_per_file} ${fspec} part.
echo "Total Lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l part.*
I just tested it on a 1GB file with 61151570 lines, and each resulting file was almost 100 MB
Edit:
I just realized you are on Windows, so the above may not apply. You can use a utility like simple text splitter a Windows program which does the same thing.
When you're able to open the file without errors like E342: Out of memory!, you should be able to save the complete file, too. There should at least be an error on :w, a partial save without error is a severe loss of data, and should be reported as a bug, either on the vim_dev mailing list or at http://code.google.com/p/vim/issues/list
Which exact version of Vim are you using? Using GVIM 7.3.600 (32-bit) on Windows 7/x64, I wasn't able to open a 1.9 GB file without out of memory. I was able to successfully open, edit, and save (fully!) a 3.9 GB file with the 64-bit version 7.3.000 from here. If you're not using that native 64-bit version yet, give it a try.

Resources