XARGS: Nesting Utilities within Utility call - bash

I am trying to build a YAML file for a large database by piping in a list of names to printf with xargs.
I would like to call ls in the printf command to get files specific for each name in my list, however calls to ls nested within a printf command doesn't seem to work..
The following command
cat w1.nextract.list | awk '{print $1}' | xargs -I {} printf "'{}':\n\tw1:\n\t\tanatomical_scan:\n\t\t\tAnat: $(ls $(pwd)/{})\n"
Just provides the following error
ls: cannot access '/data/raw/long/{}': No such file or directory
Followed by an output that looks like:
'149959':
w1:
anatomical_scan:
Anat:
I'd like to be able to use the standard input to xargs to be used within the nested utility command to give me an autocompleting path to the necessary files.. i.e.)
'149959':
w1:
anatomical_scan:
Anat: /data/raw/long/149959/test-1/test9393.txt
Anyone have any ideas?

Anyone have any ideas?
A safer way that has several caveats would be:
% cat w1.nextract.list | \
sed -e 's#^\(^[^/]*\)/\(.*\)$#'\''\1'\'':\n w1:\nANA-SCAN\nANAT-PWD\1/\2#' \
-e "s#ANAT-PWD# Anat: `pwd`/#" \
-e 's/ANA-SCAN/ anatomical_scan:/'
There are restrictions on the contents of the w1.nextract.list file:
None of the lines may contain a hash ('#') character.
Any other special characters on a line may be unsafe.
For testing, I created the w1.nextract.list file with one entry:
149959/test-1/test9393.txt
The resulting output is here:
'149959':
w1:
anatomical_scan:
Anat: /data/raw/long/149959/test-1/test9393.txt
Can you explain in more detail? What makes this so fragile?
Using xargs to printf results can lead to unexpected results if the input file has special characters or escape sequences. A bad actor could then modify your input file to exploit this issue. Best practice is to avoid.
Fragility comes from maintaining the w1.nextract.list file. You could auto generate the file to reduce this issue:
cd /data/raw/long/; find * -type f -print
What is a real YAML implementation?
The yq command is an example YAML implementation. You could use it to craft the .yaml file.
I haven't worked with these type of python packages before so it's a first time approach solution.
Using python, perl, or even php would allow you to craft the file without worrying about unsafe characters in the filenames.

Related

sed command to change names for few files in different directories at once

I have few folders as S1S, S2S ,S3S ... , In each of these folders there is a file1 .
This file1 in each folder consistent of
1990.A.BHT_S1S.dat
1994.I.BHT_S1S.dat
1995.K.BHT_S1S.dat
likewise S1S extension change according to the folder.
I'm trying to change these names into 1990.A.BHT type for all folders using this command
for dir in S*
do
cd $dir
sed -i 's/_${dir}\.dat//g' file1 > file2
cd ../
done
but i get an empty file for file2
Can someone help me to figure out my mistake please?
This might work for you (GNU sed and parallel):
parallel sed 's/_{}\.dat//' {}/file1 \> {}/file2 ::: S*S
Create a new file file2 in each directory S1S S2S S3S ... from file1 with the string _SnS.dat removed (where SnS represents the current directory).
There are several problems here. First, as konsolebox said in a comment, sed -i modifies the original file rather than producing output that can be redirected with >, so you need to remove that option.
Second, variables don't expand in single-quoted strings, so 's/_${dir}\.dat//g' doesn't use the dir variable, it just treats that whole thing as a literal string.
The third is probably ok, but using cd in a script is dangerous, because if it fails for some reason the rest of the script will run in unexpected places, with possibly very bad results. It's generally better to use explicit paths, like sed ... "$dir/file1" instead of cding to $dir and then using sed ... file1.
Finally (again probably ok here) is that you should almost always put double-quotes around variable references, to avoid weird parsing of some characters.
So here's how I'd rewrite the script snippet:
for dir in S*
do
sed "s/_${dir}\.dat//g" "$dir/file1" > "$dir/file2"
done
p.s. shellcheck.net is good at spotting common mistakes in shell scripts; it spots three of the four problems I saw (all but the sed -i problem). I recommend running your scripts through it as a check.

xargs -a [file] mv -t [new-directory] gives me mv: cannot stat `filename*': No such file or directory error

I have been trying to run this command (that I have run before in a different directory), and everything I've read on the message boards has not solved my unknown issue.
Of note: 1) the files exist in this directory 2) I have proper permissions to move these files around 3) I have run this exact line of code before and it has worked. 4) I tried listing files with and without '' to capture all the files (see below). 5) I also tired to list each file as 'Sample1', but that did not work.
xargs -a [filename.txt] mv -t [new-directory]
I have file beginnings (I have ~5 file for each beginning), and I want to move all the files associated with that beginning.
Example: Sample1.bam Sample1.sorted.bam, etc
The lines in the file are listed as such:
Sample1*
Sample2*
Sample3* ...etc.
What am I doing incorrectly and how can I fix it?
TIA!
When you execute command using 'xargs' arguments are passed directly to the called program ('mv' in your case). Wildcard patterns in the input are not expanded - 'sample1*' is passed as is to "mv", which issue an error message about note having a file named 'sample1*'.
To get file name expansion, you want to use the shell. One way to handle this situation is
xargs -a FILENAME.TXT -I__ sh -c "mv -t NEW-FOLDER -- __"
Security Note: the code provides some protection against command line injection (e.g., file name starting with '-'). However, other possible attacks are possible. Safer version is
cat FILENAME.txt | grep '^[A-Za-z0-9][A-Z-z0-9._-]*$' | xargs I__ sh -c "mv -t NEW-FOLDER -- __"
which will limit the input to file with alphanumeric. The 'grep' patterns can be extend the pattern as needed.
With GNU Parallel you would do something like:
cat FILENAME.txt | parallel mv {} NEW-FOLDER
One of the benefits of GNU Parallel is that it deals correctly with file names like:
My brother's 12" records cost > $1000.txt

Merging large number of files into one

I have around 30 K files. I want to merge them into one. I used CAT but I am getting this error.
cat *.n3 > merged.n3
-bash: /usr/bin/xargs: Argument list too long
How to increase the limit of using the "cat" command? Please help me if there is any iterative method to merge a large number of files.
Here's a safe way to do it, without the need for find:
printf '%s\0' *.n3 | xargs -0 cat > merged.txt
(I've also chosen merged.txt as the output file, as #MichaelDautermann soundly advises; rename to merged.n3 afterward).
Note: The reason this works is:
printf is a bash shell builtin, whose command line is not subject to the length limitation of command lines passed to external executables.
xargs is smart about partitioning the input arguments (passed via a pipe and thus also not subject to the command-line length limit) into multiple invocations so as to avoid the length limit; in other words: xargs makes as few calls as possible without running into the limit.
Using \0 as the delimiter paired with xargs' -0 option ensures that all filenames - even those with, e.g., embedded spaces or even newlines - are passed through as-is.
The traditional way
> merged.n3
for file in *.n3
do
cat "$file" >> merged.n3
done
Try using "find":
find . -name \*.n3 -exec cat {} > merged.txt \;
This "finds" all the files with the "n3" extension in your directory and then passes each result to the "cat" command.
And I set the output file name to be "merged.txt", which you can rename to "merged.n3" after you're done appending, since you likely do not want your new "merged.n3" file appending within itself.

Extract part of a filename shell script

In bash I would like to extract part of many filenames and save that output to another file.
The files are formatted as coffee_{SOME NUMBERS I WANT}.freqdist.
#!/bin/sh
for f in $(find . -name 'coffee*.freqdist)
That code will find all the coffee_{SOME NUMBERS I WANT}.freqdist file. Now, how do I make an array containing just {SOME NUMBERS I WANT} and write that to file?
I know that to write to file one would end the line with the following.
> log.txt
I'm missing the middle part though of how to filter the list of filenames.
You can do it natively in bash as follows:
filename=coffee_1234.freqdist
tmp=${filename#*_}
num=${tmp%.*}
echo "$num"
This is a pure bash solution. No external commands (like sed) are involved, so this is faster.
Append these numbers to a file using:
echo "$num" >> file
(You will need to delete/clear the file before you start your loop.)
If the intention is just to write the numbers to a file, you do not need find command:
ls coffee*.freqdist
coffee112.freqdist coffee12.freqdist coffee234.freqdist
The below should do it which can then be re-directed to a file:
$ ls coffee*.freqdist | sed 's/coffee\(.*\)\.freqdist/\1/'
112
12
234
Guru.
The previous answers have indicated some necessary techniques. This answer organizes the pipeline in a simple way that might apply to other jobs as well. (If your sed doesn't support ‘;’ as a separator, replace ‘;’ with ‘|sed’.)
$ ls */c*; ls c*
fee/coffee_2343.freqdist
coffee_18z8.x.freqdist coffee_512.freqdist coffee_707.freqdist
$ find . -name 'coffee*.freqdist' | sed 's/.*coffee_//; s/[.].*//' > outfile
$ cat outfile
512
18z8
2343
707

Find and replace html code for multiple files within multiple directories

I have a very basic understanding of shell scripting, but what I need to do requires more complex commands.
For one task, I need to find and replace html code within the index.html files on my server. These files are in multiple directories with a consistent naming convention. ([letter][3-digit number]) See the example below.
files: index.html
path: /www/mysite/board/today/[rsh][0-9]/
string to find: (div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I hope you don't mind the pseudo-regex. The folders containing my target index.html files look similar to r099, s017, h123. And suffice the say, the html code I'm trying to replace is relatively long, but its still just a string.
The second task is similar to the first, only the filename changes as well.
files: [rsh][0-9].html
path: www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/
string: (div id="id")[code](/div)<--include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I've seen other examples on SO and elsewhere on the net that simply show scripts modifying files under a single directory to find & replace a string without any special characters, but I haven't seen an example similar to what I'm trying to do just yet.
Any assistance would be greatly appreciated.
Thank You.
You have three separate sub-problems:
replacing text in a file
coping with special characters
selecting files to apply the transformation to
​1. The canonical text replacement tool is sed:
sed -e 's/PATTERN/REPLACEMENT/g' <INPUT_FILE >OUTPUT_FILE
If you have GNU sed (e.g. on Linux or Cygwin), pass -i to transform the file in place. You can act on more than one file in the same command line.
sed -i -e 's/PATTERN/REPLACEMENT/g' FILE OTHER_FILE…
If your sed doesn't have the -i option, you need to write to a different file and move that into place afterwards. (This is what GNU sed does behind the scenes.)
sed -e 's/PATTERN/REPLACEMENT/g' <FILE >FILE.tmp
mv FILE.tmp FILE
​2. If you want to replace a literal string by a literal string, you need to prefix all special characters by a backslash. For sed patterns, the special characters are .\[^$* plus the separator for the s command (usually /). For sed replacement text, the special characters are \& and newlines. You can use sed to turn a string into a suitable pattern or replacement text.
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
​3. To act on multiple files directly in one or more directories, use shell wildcards. Your requirements don't seem completely consistent; I think these are the patterns you're looking for, but be sure to review them.
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
This will match files like /www/mysite/board/today/r012/index.html and /www/mysite/person/4/5/6/card/2011/h7.html, but not /www/mysite/board/today/subdir/s012/index.html or /www/mysite/board/today/r1234/index.html.
If you need to act on files in subdirectories recursively, use find. It doesn't seem to be in your requirements and this answer is long enough already, so I'll stop here.
​4. Putting it all together:
string_to_replace='(div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)'
replacement_string='(div id="id")<--include="(path)"-->(/div)'
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
sed -i -e "s/$pattern/$replacement/g" \
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html \
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
Final note: you seem to be working on HTML with regular expressions. That's often not a good idea.
Finding the files can easily be done using find -regex:
find www/mysite/board/today -regex ".*[rsh][0-9][0-9][0-9]/index.html"
find www/mysite/person -regex ".*[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9][0-9][0-9].html"
Due to nature of HTML, replacing the content might not be very easy with sed, so I would suggest using an HTML or XML parsing library in a perl script. Can you provide a short sample of an actual html file and the result of the replacements?

Resources