GNU Parallel does not run on all files - parallel-processing

I am using GNU Parallel to run the same command on multiple files in a directory. Using the example provided in the documentation, I use
find input -name *.json -print0 | parallel -0 "context --file={} --result={/.}.pdf sometex.tex"
This command allows me to produce a pdf file for every json file in the directory. However, when I run this, I get different results every time. Out of 1000 json files, some times I get 490 pdf files, some times I get 800. I have also tried running the command sequentially, and running sequentially was able to produce all 1000 pdf files. Does anyone know why this happens and how do I solve it?

After doing some investigation, I realise the issue is not with GNU Parallel but because I was trying to access the same file when using this command
find input -name *.json -print0 | parallel -0 "context --file={} --result={/.}.pdf sometex.tex"
This probably led to a race condition. I solved this my making a copy of the .tex file for each input and things were able to run smoothly.

Related

Copying a list of files with wildcards into a new folder

I don't have much experience with the command line, but essentially I have a list of files in a single folder as follows:
file1_a_1
file1_a_2
file2_b_1
file2_b_2
file3_c_1
file3_c_2
And I also have a text file with the files I want. However, this list does not have the full file path, instead, it looks like this:
file1_a file3_c
because I want to move all files that start with 30 or so specific codes (i.e. everything that starts with file1_a and file1_c for all the files that start with this).
I have tried:
cp file1_a* file3_c* 'dir/dest'
but this does not work. I have also tried the find command. I think I have to use a loop to do this but I cannot find any help on looping through files with a wildcard on the end.
Thanks in advance! I am working on a linux machine in bash.
you can use the xargs command with find command and a pipe
find / -name xxxxx | xargs cp /..

Pass a list of files to sftp get

I am looking for an 'sftp' alternative to the following command:
cat list_of_files_to_copy.txt | xargs -I % cp -r % -t /target/folder/
: Read a text file containing the folder paths to be copied, and pass each line (here using xargs) to a copy command cp to process them one-by-one.
I want to do this so I can parallelize the copying process, using partitions of all folders I can give each one as a different text file to multiple copying command on separate terminals (if this does not work as I am expecting to work please comment).
For some reason, the copy command is very slow in my system (even if I don't try to parallelize), whereas doing sftp get seems more efficient.
Any way I can implement this using sftp get ?
scp is the non interactive version of sftp, why don't you just create a loop like this
for F in $(<list_of_files_to_copy.txt);do
scp source destination
done

Loop Over Files as Input for Program, Rename and Write Output to Different Directory

I have a problem with writing the output of a program to a different directory when I loop different files as variables as inputs. I run this in the command line. The problem is that I do not know how to "tell" the program to put the output with a changed filename into another directory than the input directory.
Here is the command, although it is a bioinformatic tool which requires specific input file formats. I am sorry that I could not give a better example. Nonetheless, the program is called computeMatrix in a software-tool box called deeptools2.
command:
for f in ~/my/path/*spc_files*; do computeMatrix reference-point--referencePoint center --regionsFileName /target/region.bed --binSize 500 --scoreFileName "$f" **--outFileName "$f.matrix"** ; done \
So far, I tried to use the command basename to just get the filename and then change the directory before that. However I could not figure out:
if this is combinable
what is the correct order of the commands (e.g.:
outputFile='basename"$f"', "~/new/targetDir/'basename$f'")
Probably there are other options to solve the problem which I could not think of/ find.

Need to remove *.xml from unknown directories that are older than x days

We have a directory:
/home/httpdocs/
In this directory there may be directories or sub directories of directories, or sub directories of sub directories, and so on and so forth that contain XML files (files that end in .xml) - We do not know which direcrtory contains xml files and these directories contain a massive amount of files
we want to archive all files and remove them from the actual directories so that we only contain the last 7 days worth of xml files in the above mentioned directories.
It was mentioned to me that logrotate would be a good option to do this in, is that the best way to do it, and if so - how would we set it up?
Also if not using lot rotate can this be scripted? Can this script be run during production hours or will it bog down the system?
Sas
find -name "*.xml" -mtime +7 -print0 | tar -cvzf yourArchive.tar.gz --remove-files --null --files-from -
Will create a gzip compressed tar file 'yourArchive.tar.gz', containing all *.xml files in the current directory and any depth of subdirectory that was not changed during the last 24*7 hours and after adding these files to the tar archive the files are deleted.
Edit:
Can this script be run during production hours or will it bog down the
system?
Depends on your system actually. This does create lots of I/O load. If your production system uses a lot of I/O and you don't happen to have a fantastic I/O subsystem (like a huge raid system connected using fibre channel or the like), then this will have some noticable impact on your performance. How bad this is depends on more details though.
If system load is an issue than you could create a small database that keeps track of the files, maybe using inotify, which can run in background over a larger period of time, beeing less noticed.
You can also try to set the priority of the mentioned processes using renice, but since the problem is I/O and not CPU (unless your CPU sucks and your I/O is really great for some reason), this might not lead to the desired effect. But then the next best option would be to write your own script crawling the file tree that is decorated with sleeps. It will take some time to complete but will generate less impact on your production system. I would not recommend any of this unless you really have pressure to act.
Use find /home/httpdocs -name "*.xml" -mtime +7 -exec archive {} \; where archive is a program that archives and removes an XML file.
It'll probably be easiest to do this with find and a cron job.
The find command:
find /home/httpdocs -name \*.xml -ctime +7 -exec mv -b -t /path/to/backup/folder {} +
This will move any file ending in .xml within the /home/httpdocs tree to the backup folder you provide, making a backup of any file that would be overwritten (-b).
Now, to set this up as a cron job, run crontab -e as a user who has write permissions on both the httpdocs and backup folders (probably root, so sudo crontab -e). Then add a line like the following:
14 3 * * * find /home/httpdocs -name \*.xml -ctime +7 -exec mv -b -t /path/to/backup/folder {} +
This will run the command at 3:14am every day (change the 3 and 14 for different times). You could also put the find command into a script and run that, just to make the line shorter.

Find file in directory from command line

In editors/ides such as eclipse and textmate, there are shortcuts to quickly find a particular file in a project directory.
Is there a similar tool to do full path completion on filenames within a directory (recursively), in bash or other shell?
I have projects with alot of directories, and deep ones at that (sigh, java).
Hitting tab in the shell only cycles thru files in the immediate directory, thats not enough =/
find /root/directory/to/search -name 'filename.*'
# Directory is optional (defaults to cwd)
Standard UNIX globbing is supported. See man find for more information.
If you're using Vim, you can use:
:e **/filename.cpp
Or :tabn or any Vim command which accepts a filename.
If you're looking to do something with a list of files, you can use find combined with the bash $() construct (better than backticks since it's allowed to nest).
for example, say you're at the top level of your project directory and you want a list of all C files starting with "btree". The command:
find . -type f -name 'btree*.c'
will return a list of them. But this doesn't really help with doing something with them.
So, let's further assume you want to search all those file for the string "ERROR" or edit them all. You can execute one of:
grep ERROR $(find . -type f -name 'btree*.c')
vi $(find . -type f -name 'btree*.c')
to do this.
When I was in the UNIX world (using tcsh (sigh...)), I used to have all sorts of "find" aliases/scripts setup for searching for files. I think the default "find" syntax is a little clunky, so I used to have aliases/scripts to pipe "find . -print" into grep, which allows you to use regular expressions for searching:
# finds all .java files starting in current directory
find . -print | grep '\.java'
#finds all .java files whose name contains "Message"
find . -print | grep '.*Message.*\.java'
Of course, the above examples can be done with plain-old find, but if you have a more specific search, grep can help quite a bit. This works pretty well, unless "find . -print" has too many directories to recurse through... then it gets pretty slow. (for example, you wouldn't want to do this starting in root "/")
I use ls -R, piped to grep like this:
$ ls -R | grep -i "pattern"
where -R means recursively list all the files, and -i means case-insensitive. Finally, the patter could be something like this: "std*.h" or "^io" (anything that starts with "io" in the file name)
I use this script to quickly find files across directories in a project. I have found it works great and takes advantage of Vim's autocomplete by opening up and closing an new buffer for the search. It also smartly completes as much as possible for you so you can usually just type a character or two and open the file across any directory in your project. I started using it specifically because of a Java project and it has saved me a lot of time. You just build the cache once when you start your editing session by typing :FC (directory names). You can also just use . to get the current directory and all subdirectories. After that you just type :FF (or FS to open up a new split) and it will open up a new buffer to select the file you want. After you select the file the temp buffer closes and you are inside the requested file and can start editing. In addition, here is another link on Stack Overflow that may help.
http://content.hccfl.edu/pollock/Unix/FindCmd.htm
The linux/unix "find" command.
Yes, bash has filename completion mechanisms. I don't use them myself (too lazy to learn, and I don't find it necessary often enough to make it urgent), but the basic mechanism is to type the first few characters, and then a tab; this will extend the name as far as it can (perhaps not at all) as long as the name is unambiguous. There are a boatload of Emacs-style commands related to completion in the good ol' man page.
locate <file_pattern>
*** find will certainly work, and can target specific directories. However, this command is slower than the locate command. On a Linux OS, each morning a database is constructed that contains a list of all directory and files, and the locate command efficiently searches this database, so if you want to do a search for files that weren't created today, this would be the fastest way to accomplish such a task.

Resources