Cat content of files to .txt files with common pattern name in bash - bash

I have a series of .dat files and a series of .txt files that have a common matching pattern. I want to cat the content of the .dat files into each respective .txt file with the matching pattern in the file name, in a loop. Example files are:
xfile_pr_WRF_mergetime_regionA.nc.dat
xfile_pr_GFDL_mergetime_regionA.nc.dat
xfile_pr_RCA_mergetime_regionA.nc.dat
#
yfile_pr_WRF_mergetime_regionA.nc.dat
yfile_pr_GFDL_mergetime_regionA.nc.dat
yfile_pr_RCA_mergetime_regionA.nc.dat
#
pr_WRF_mergetime_regionA_final.txt
pr_GFDL_mergetime_regionA_final.txt
pr_RCA_mergetime_regionA_final.txt
What I have tried so far is the following (I am trying to cat the content of all files starting with "xfile" to the respective model .txt file.
#
find -name 'xfile*' | sed 's/_mergetime_.*//' | sort -u | while read -r pattern
do
echo "${pattern}"*
cat "${pattern}"* >> "${pattern}".txt
done

Let me make some assumptions:
All filenames contain _mergetime_* substring.
The pattern is the portion such as pr_GFDL and is essential to
identify the file.
Then would you try the following:
declare -A map # create an associative array
for f in xfile_*.dat; do # loop over xfile_* files
pattern=${f%_mergetime_*} # remove _mergetime_* substring to extract pattern
pattern=${pattern#xfile_} # remove xfile_ prefix
map[$pattern]=$f # associate the pattern with the filename
done
for f in *.txt; do # loop over *.txt files
pattern=${f%_mergetime_*} # extract the pattern
[[ -f ${map[$pattern]} ]] && cat "${map[$pattern]}" >> "$f"
done

If I understood you correctly, you want the following:
- xfile_pr_WRF_mergetime_regionA.nc.dat
- yfile_pr_WRF_mergetime_regionA.nc.dat
----> pr_WRF_mergetime_regionA_final.txt
- xfile_pr_GFDL_mergetime_regionA.nc.dat
- yfile_pr_GFDL_mergetime_regionA.nc.dat
----> pr_GFDL_mergetime_regionA_final.txt
- xfile_pr_RCA_mergetime_regionA.nc.dat
- yfile_pr_RCA_mergetime_regionA.nc.dat
----> pr_RCA_mergetime_regionA_final.txt
So here's what you want to do in the script:
Get all .nc.dat files in the directory
Extra the pr_TYPE_mergetime_region from the file
Append the _final.txt part to the output file
Then actually pipe the cat output onto that file
So I ended up with the following code:
find *.dat | while read -r pattern
do
output=$(echo $pattern | sed -e 's![^(pr)]*!!' -e 's!.nc.dat!!')
cat $pattern >> "${output}_final.txt"
done
And here are the files I ended up with:
pr_GFDL_mergetime_regionA_final.txt
pr_RCA_mergetime_regionA_final.txt
pr_WRF_mergetime_regionA_final.txt
Kindly let me know in the comments if I misunderstood anything or missed anything.

Seems like what you asks for:
concatxy.sh:
#!/usr/bin/env bash
# do not return the pattern if no file matches
shopt -s nullglob
# Iterate all xfiles
for xfile in "xfile_pr_"*".nc.dat"; do
# Regex to extract the common filename part
[[ "$xfile" =~ ^xfile_(.*)\.nc\.dat$ ]]
# Compose the matching yfile name
yfile="yfile_${BASH_REMATCH[1]}.nc.dat"
# Compose the output text file name
txtfile="${BASH_REMATCH[1]}_final.txt"
# Perform the concatenation of xfile and yfile into the .txt file
cat "$xfile" "$yfile" >"$txtfile"
done
Creating populated test files:
preptest.sh:
#!/usr/bin/env bash
# Populating test files
echo "Content of xfile_pr_WRF_mergetime_regionA.nc.dat" >xfile_pr_WRF_mergetime_regionA.nc.dat
echo "Content of xfile_pr_GFDL_mergetime_regionA.nc.dat" >xfile_pr_GFDL_mergetime_regionA.nc.dat
echo "Content of xfile_pr_RCA_mergetime_regionA.nc.dat" >xfile_pr_RCA_mergetime_regionA.nc.dat
#
echo "Content of yfile_pr_WRF_mergetime_regionA.nc.dat" > yfile_pr_WRF_mergetime_regionA.nc.dat
echo "Content of yfile_pr_GFDL_mergetime_regionA.nc.dat" >yfile_pr_GFDL_mergetime_regionA.nc.dat
echo "Content of yfile_pr_RCA_mergetime_regionA.nc.dat" >yfile_pr_RCA_mergetime_regionA.nc.dat
#
#pr_WRF_mergetime_regionA_final.txt
#pr_GFDL_mergetime_regionA_final.txt
#pr_RCA_mergetime_regionA_final.txt
Running test
$ bash ./preptest.sh
$ bash ./concatxy.sh
$ ls -tr1
concatxy.sh
preptest.sh
yfile_pr_WRF_mergetime_regionA.nc.dat
yfile_pr_RCA_mergetime_regionA.nc.dat
yfile_pr_GFDL_mergetime_regionA.nc.dat
xfile_pr_WRF_mergetime_regionA.nc.dat
xfile_pr_RCA_mergetime_regionA.nc.dat
xfile_pr_GFDL_mergetime_regionA.nc.dat
pr_GFDL_mergetime_regionA_final.txt
pr_WRF_mergetime_regionA_final.txt
pr_RCA_mergetime_regionA_final.txt
$ cat pr_GFDL_mergetime_regionA_final.txt
Content of xfile_pr_GFDL_mergetime_regionA.nc.dat
Content of yfile_pr_GFDL_mergetime_regionA.nc.dat
$ cat pr_WRF_mergetime_regionA_final.txt
Content of xfile_pr_WRF_mergetime_regionA.nc.dat
Content of yfile_pr_WRF_mergetime_regionA.nc.dat
$ cat pr_RCA_mergetime_regionA_final.txt
Content of xfile_pr_RCA_mergetime_regionA.nc.dat
Content of yfile_pr_RCA_mergetime_regionA.nc.dat

Related

Is there a way to add a suffix to files where the suffix comes from a list in a text file?

So currently the searches are coming up with a single word renaming solution, where you define the (static) suffix within the code. I need to rename based on a text based filelist and so -
I have a list of files in /home/linux/test/ :
1000.ext
1001.ext
1002.ext
1003.ext
1004.ext
Then I have a txt file (labels.txt) containing the labels I want to use:
Alpha
Beta
Charlie
Delta
Echo
I want to rename the files to look like (example1):
1000 - Alpha.ext
1001 - Beta.ext
1002 - Charlie.ext
1003 - Delta.ext
1004 - Echo.ext
How would you a script which renames all the files in /home/linux/test/ to the list in example1?
Use paste to loop through the two lists in parallel. Split the filenames into the prefix and extension, then combine everything to make the new filenames.
dir=/home/linux/test
for file in "$dir"/*.ext
do
read -r label
prefix=${file%.*} # remove everything from last .
ext=${file##*.} # remove everything before last .
mv "$file" "$prefix - $label.$ext"
done < labels.txt
I originally partly got the request wrong, although this step is still useful, because it gives you the filenames you need.
#!/bin/sh
count=1000
cp labels.txt stack
cat > ed1 <<EOF
1p
q
EOF
cat > ed2 <<EOF
1d
wq
EOF
next () {
[ -s stack ] && main
}
main () {
line="$(ed -s stack < ed1)"
echo "${count} - ${line}.ext" >> newfile
ed -s stack < ed2
count=$(($count+1))
next
}
next
Now we just need to move the files:-
cp newfile stack
for i in *.ext
do
newname="$(ed -s stack < ed1)"
mv -v "${i}" "${newname}"
ed -s stack < ed2
done
rm -v ./ed1
rm -v ./ed2
rm -v ./stack
rm -v ./newfile
On the possibility that you don't have exactly the same number of files as labels, I set it up to cycle a couple of arrays in pseudo-parallel.
$: cat script
#!/bin/env bash
lst=( *.ext ) # array of files to rename
mapfile -t labels < labels.txt # array of labels to attach
for ndx in ${!lst[#]} # for each filename's numeric index
do # assign the new name
new="${lst[ndx]/.ext/ - ${labels[ndx%${#labels[#]}]}.ext}"
# show the command to rename the file
echo "mv \"${lst[ndx]}\" \"$new\""
done
$: ls -1 *ext # I added an extra file
1000.ext
1001.ext
1002.ext
1003.ext
1004.ext
1005.ext
$: ./script # loops back if more files than labels
mv "1000.ext" "1000 - Alpha.ext"
mv "1001.ext" "1001 - Beta.ext"
mv "1002.ext" "1002 - Charlie.ext"
mv "1003.ext" "1003 - Delta.ext"
mv "1004.ext" "1004 - Echo.ext"
mv "1005.ext" "1005 - Alpha.ext"
$: ./script > do # use ./script to write ./do
$: ./do # use ./do to change the names
$: ls -1
'1000 - Alpha.ext'
'1001 - Beta.ext'
'1002 - Charlie.ext'
'1003 - Delta.ext'
'1004 - Echo.ext'
'1005 - Alpha.ext'
do
labels.txt
script
You can just remove the echo to have ./script rename the files there.
I renamed labels to labels.txt to match your example.
If you aren't using bash this will need a call to something like sed or awk. Here's a short awk-based script that will do the same.
$: cat script2
#!/bin/env sh
printf "%s\n" *.ext > files.txt
awk 'NR==FNR{label[i++]=$0}
NR>FNR{ if (! label[i] ) { i=0 } cmd="mv \""$0"\" \""gensub(/[.]ext/, " - "label[i++]".ext", 1)"\"";
print cmd;
# system(cmd);
}' labels.txt files.txt
Uncomment the system line to make it actually do the renames as well.
It does assume your filenames don't have embedded newlines. Let us know if that's a problem.

looping with grep over several files

I have multiple files /text-1.txt, /text-2.txt ... /text-20.txt
and what I want to do is to grep for two patterns and stitch them into one file.
For example:
I have
grep "Int_dogs" /text-1.txt > /text-1-dogs.txt
grep "Int_cats" /text-1.txt> /text-1-cats.txt
cat /text-1-dogs.txt /text-1-cats.txt > /text-1-output.txt
I want to repeat this for all 20 files above. Is there an efficient way in bash/awk, etc. to do this ?
#!/bin/sh
count=1
next () {
[[ "${count}" -lt 21 ]] && main
[[ "${count}" -eq 21 ]] && exit 0
}
main () {
file="text-${count}"
grep "Int_dogs" "${file}.txt" > "${file}-dogs.txt"
grep "Int_cats" "${file}.txt" > "${file}-cats.txt"
cat "${file}-dogs.txt" "${file}-cats.txt" > "${file}-output.txt"
count=$((count+1))
next
}
next
grep has some features you seem not to be aware of:
grep can be launched on lists of files, but the output will be different:
For a single file, the output will only contain the filtered line, like in this example:
cat text-1.txt
I have a cat.
I have a dog.
I have a canary.
grep "cat" text-1.txt
I have a cat.
For multiple files, also the filename will be shown in the output: let's add another textfile:
cat text-2.txt
I don't have a dog.
I don't have a cat.
I don't have a canary.
grep "cat" text-*.txt
text-1.txt: I have a cat.
text-2.txt: I don't have a cat.
grep can be extended to search for multiple patterns in files, using the -E switch. The patterns need to be separated using a pipe symbol:
grep -E "cat|dog" text-1.txt
I have a dog.
I have a cat.
(summary of the previous two points + the remark that grep -E equals egrep):
egrep "cat|dog" text-*.txt
text-1.txt:I have a dog.
text-1.txt:I have a cat.
text-2.txt:I don't have a dog.
text-2.txt:I don't have a cat.
So, in order to redirect this to an output file, you can simply say:
egrep "cat|dog" text-*.txt >text-1-output.txt
Assuming you're using bash.
Try this:
for i in $(seq 1 20) ;do rm -f text-${i}-output.txt ; grep -E "Int_dogs|Int_cats" text-${i}.txt >> text-${i}-output.txt ;done
Details
This one-line script does the following:
Original files are intended to have the following name order/syntax:
text-<INTEGER_NUMBER>.txt - Example: text-1.txt, text-2.txt, ... text-100.txt.
Creates a loop starting from 1 to <N> and <N> is the number of files you want to process.
Warn: rm -f text-${i}-output.txt command first will be run and remove the possible outputfile (if there is any), to ensure that a fresh new output file will be only available at the end of the process.
grep -E "Int_dogs|Int_cats" text-${i}.txt will try to match both strings in the original file and by >> text-${i}-output.txt all the matched lines will be redirected to a newly created output file with the relevant number of the original file. Example: if integer number in original file is 5 text-5.txt, then text-5-output.txt file will be created & contain the matched string lines (if any).

BASH: File sorting according to file name

I need to sort 12000 filles into 1000 groups, according to its name and create for each group a new folder containing filles of this group. The name of each file is given in multi-column format (with _ separator), where the second column is varried from 1 to 12 (number of the part) and the last column ranged from 1 to 1000 (number of the system), indicating that initially 1000 different systems (last column) were splitted on 12 separate parts (second column).
Here is an example for a small subset based on 3 systems devided by 12 parts, totally 36 filles.
7000_01_lig_cne_1.dlg
7000_02_lig_cne_1.dlg
7000_03_lig_cne_1.dlg
...
7000_12_lig_cne_1.dlg
7000_01_lig_cne_2.dlg
7000_02_lig_cne_2.dlg
7000_03_lig_cne_2.dlg
...
7000_12_lig_cne_2.dlg
7000_01_lig_cne_3.dlg
7000_02_lig_cne_3.dlg
7000_03_lig_cne_3.dlg
...
7000_12_lig_cne_3.dlg
I need to group these filles based on the second column of their names (01, 02, 03 .. 12), thus creating 1000 folders, which should contrain 12 filles for each system in the following manner:
Folder1, name: 7000_lig_cne_1, it contains 12 filles: 7000_{this is from 01 to 12}_lig_cne_1.dlg
Folder2, name: 7000_lig_cne_2, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_2.dlg
...
Folder1000, name: 7000_lig_cne_1000, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_1000.dlg
Assuming that all *.dlg filles are present withint the same dir, I propose bash loop workflow, which only lack some sorting function (sed, awk ??), organized in the following manner:
#set the name of folder with all DLG
home=$PWD
FILES=${home}/all_DLG/7000_CNE
# set the name of protein and ligand library to analyse
experiment="7000_CNE"
#name of the output
output=${home}/sub_folders_to_analyse
#now here all magic comes
rm -r ${output}
mkdir ${output}
# sed sollution
for i in ${FILES}/*.dlg # define this better to suit your needs
do
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
mkdir -p ${output}/"${experiment}_lig$n"
cp "$i" ${output}/"${experiment}_lig$n"
done
! Note: there I indicated beggining of the name of each folder as ${experiment} to which I add the number of the final column $n at the end. Would it be rather possible to set up each time the name of the new folder automatically based on the name of the coppied filles? Manually it could be achived via skipping the second column in the name of the folder
cp ./all_DLG/7000_*_lig_cne_987.dlg ./output/7000_lig_cne_987
Iterate over files. Extract the destination directory name from the filename. Move the file.
for i in *.dlg; do
# extract last number with your favorite tool
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
echo mkdir -p "folder$n"
echo mv "$i" "folder$n"
done
Notes:
Do not use upper case variables in your scripts. Use lower case variables.
Remember to quote variables expansions.
Check your scripts with http://shellcheck.net
Tested on repl
update: for OP's foldernaming convention:
for i in *.dlg; do
foldername="$HOME/output/${i%%_*}_${i#*_*_}"
echo mkdir -p "$foldername"
echo mv "$i" "$foldername"
done
This might work for you (GNU parallel):
ls *.dlg |
parallel --dry-run 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d'
Pipe the output of ls command listing files ending in .dlg to parallel, which creates directories and moves the files to them.
Run the solution as is, and when satisfied the output of the dry run is ok, remove the option --dry-run.
The solution could be one instruction:
parallel 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d' ::: *.dlg
Using POSIX shell's built-in grammar only and sort:
#!/usr/bin/env sh
curdir=
# Create list of files with newline
# Safe since we know there is no special
# characters in name
printf -- %s\\n *.dlg |
# Sort the list by 5th key with _ as field delimiter
sort -t_ -k5 |
# Iterate reading the _ delimited fields of the sorted list
while IFS=_ read -r _ _ c d e; do
# Compose the new directory name
newdir="${c}_${d}_${e%.dlg}"
# If we enter a new group / directory
if [ "$curdir" != "$newdir" ]; then
# Make the new directory current
curdir="$newdir"
# Create the new directory
echo mkdir -p "$curdir"
# Move all its files into it
echo mv -- *_"$curdir.dlg" "$curdir/"
fi
done
Optionally as a sort and xargs arguments stream:
printf -- %s\\n * |
sort -u -t_ -k5
xargs -n1 sh -c
'd="lig_cne_${0##*_}"
d="${d%.dlg}"
echo mkdir -p "$d"
echo mv -- *"_$d.dlg" "$d/"
'
Here is a very simple awk script that do the trick in single sweep.
script.awk
BEGIN{FS="[_.]"} # make field separator "_" or "."
{ # for each filename
dirName=$1"_"$3"_"$4"_"$5; # compute the target dir name from fields
sysCmd = "mkdir -p " dirName"; cp "$0 " "dirName; # prepare bash command
system(sysCmd); # run bash command
}
running script.awk
ls -1 *.dlg | awk -f script.awk
oneliner awk script
ls -1 *.dlg | awk 'BEGIN{FS="[_.]"}{d=$1"_"$3"_"$4"_"$5;system("mkdir -p "d"; cp "$0 " "d);}'

awk to add extracted prefix from file to filename

The below awk execute as is, but it renames fields within each matching file that matches $p (which is extracted from each text file) instead of adding $x which is the prefix to add (from $1 of rename) to each filename in the directory. Each $x is followed by a_ the the filename. I can see in the echo $p the correct value to use in the lookup for $2 is extracted but each file in the directory is unchanged. Not every file in the rename will be in the directory, but it will always have a match to $p. Maybe there is a better way as I am not sure what I am doing wrong. Thank you :).
rename tab-delimeted
00-0000 File-01
00-0001 File-02
00-0002 File-03
00-0003 File-04
file1
File-01_xxxx.txt
file2
File-02_yyyy.txt
desired output
00-0000_File-01-xxxx.txt
00-0001_File-02-yyyy.txt
bash
for file1 in /path/to/folders/*.txt
do
# Grab file prefix
bname=`basename $file1` # strip of path
p="$(echo $bname|cut -d_ -f1,1)" # remove after second underscore
echo $p
# add prefix to matching file
awk -v var="$p" '$2~var{x=$1}(NR=x){print $x"_",$bname}' $file1 rename OFS="\t" > tmp && mv tmp $file1
done
This script :
touch File-01-azer.txt
touch File-02-ytrf.txt
touch File-03-fdfd.txt
touch File-04-dfrd.txt
while read p f;
do
f=$(ls $f*)
mv ${f} "${p}_${f}"
done << EEE
00-0000 File-01
00-0001 File-02
00-0002 File-03
00-0003 File-04
EEE
ls -1
outputs :
00-0000_File-01-azer.txt
00-0001_File-02-ytrf.txt
00-0002_File-03-fdfd.txt
00-0003_File-04-dfrd.txt
You can use a file as input using done < rename_map.txt or cat rename_map.txt | while

Shell Script to fill the target defined in XML file

Suppose there is one file.txt in which text is written as mentioned below:-
ABC
EFG
XYZ
In another xml, there is one empty body target named(compile) defined.
<project>
<compile>
.
.
.
start //from here till EOF
shell
script
xyz
</compile>
</project>
I need a shell script which fill the content in between the target defined . After executing the script it should look as mentioned below in output tag.It will be done for the entire content written in file.txt file.
Output:-
<!-- ...preceding portions of input document... -->
<project>
<compile>
componentName="ABC"
componentName="EFG"
componentName="XYZ"
start
shell
script
xyz
</compile>
</project>
<!-- ...remaining portions of input document... -->
Use a proper XML parser. XMLStarlet is one tool fit for the job:
#!/bin/bash
# ^^^^- important, not /bin/sh
# read input file into an array
IFS=$'\n' read -r -d '' -a pieces <file.txt
# assemble target text based on expanding that array
printf -v text 'componentName=%s\n' "${pieces[#]}"
# Read input, changing all elements named "compile" in the default namespace
# ...to contain our target text.
xmlstarlet ed -u '//compile' -v "$text" <in.xml >out.xml
You can do what you are attempting (to some degree) with sed and a while read -r loop. For example, you can fill a temporary file with the contents of your xml file from line 1 to the <targettag> with
sed -n "1, /^${ttag}$/p" "$xfn" > "$ofn" ## fill output to ttag
(where xfn is your xml file name and ofn is your output file name)
You can then read all values from your text file and prepend componentName=" and append " with:
while read -r line; do ## read each line in ifn and concatenate
printf "%s%s\"\n" "$cmptag" "$line" >> "$ofn"
done <"$ifn"
(where ifn is your input file name)
And finally, you can write the closing tag to end of your xml file to your output file with:
sed -n "/^${ttag/</<[\/]}$/, \${p}" "$xfn" >> "$ofn"
(using parameter expansion with substring replacement to add the closing '/' to the beginning of <targettag>.
Putting it altogether, you could do something like:
#!/bin/bash
ifn="f1"
xfn="f2.xml"
ofn="f3.xml"
ttag="${1:-<targettag>}" ## set target tag
cmptag="componentName=\"" ## set string to prepend
sed -n "1, /^${ttag}$/p" "$xfn" > "$ofn" ## fill output to ttag
while read -r line; do ## read each line in ifn and concatenate
printf "%s%s\"\n" "$cmptag" "$line" >> "$ofn"
done <"$ifn"
## fill output from closing tag to end
sed -n "/^${ttag/</<[\/]}$/, \${p}" "$xfn" >> "$ofn"
Input Files
$ cat f1
ABC
EFG
XYZ
$ cat f2.xml
<someschema>
<targettag>
</targettag>
</someschema>
Example Use/Output
$ fillxml.sh
$ cat f3.xml
<someschema>
<targettag>
componentName="ABC"
componentName="EFG"
componentName="XYZ"
</targettag>
</someschema>
(you can adjust the indentation to fit your needs)
Addition After Changes to Question
The changes needed to handle writing from start to end after adding the componentName="..." tags are simple. However, the commonality of the word start exemplifies why the answer by Charles encourages you to use an XML tool rather than a simple script. Why? If the word 'start' occurs anywhere else in your .xml file before your intended start, the script will fail by writing for the first occurrence of start to the end.
That said, if this is a simple on-off conversion and start doesn't occur otherwise, then the changes to the script to accomplish your desired output are easy:
#!/bin/bash
ifn="f1"
xfn="another.xml"
ofn="f3.xml"
ttag="${1:-<compile>}" ## set target tag
cmptag="componentName=\"" ## set string to prepend
sed -n "1, /^${ttag}$/p" "$xfn" > "$ofn" ## fill output to ttag
## read each line in ifn and concatenate
while read -r line || [ -n "$line" ]; do
printf "%s%s\"\n" "$cmptag" "$line" >> "$ofn"
done <"$ifn"
## fill output from 'start' to end
sed -n "/^start/, \${p}" "$xfn" >> "$ofn"
Input Files
$ cat f1
ABC
EFG
XYZ
$ cat another.xml
<project>
<compile>
start
shell
script
xyz
</compile>
</project>
Example Use/Output
$ cat f3.xml
<project>
<compile>
componentName="ABC"
componentName="EFG"
componentName="XYZ"
start
shell
script
xyz
</compile>
</project>
Look it over and let me know if you have questions.

Resources