Shell Script Retokenize Property Values to Keys For All Files In a Directory - bash

Previously, I wrote a small shell script to "retokenize" a file (useful for comparing for sanity checks). I'm currently in the need of doing something similar for a folder instead of just one file.
I'm curious if there is an easy way to rework the following to be like a method / function and how to recursively pass all files in a folder to the method, so that the end result is that all files in the folder are "retokenized". Hoping to see if there is a quick and easy way to do this. Being doing some googling and playing around, but want to see if anyone here has a quick / easy / clean solution.
Working version for one file:
#!/bin/bash
date
outputDump="output.txt"
prodPropsFile="input.properties"
prodPropsSortedFile="sorted.properties"
tempPropsFile="temp.properties"
echo "Removing comments and empty lines from prod properties file"
sed '/^#/d' < $prodPropsFile > $tempPropsFile
sed '/^s*$/d' < $tempPropsFile > $prodPropsSortedFile
cp $prodPropsSortedFile $tempPropsFile
echo "Sorting prod properties by value length. So don't do double tokenization"
awk -F"=" '{ st = index($0,"="); print length(substr($0,st+1)),$0 }' $tempPropsFile | sort -rn | cut -d" " -f2- > $prodPropsSortedFile
echo "Retokenizing."
while IFS== read k v;
do
# Sed escape /, \, and &. Needed for urls like jdbc connections, etc.
escapedV=$(echo $v | sed -e 's/\\/\\\\/g; s/\//\\\//g; s/&/\\\&/g')
# The /gI will replace the tokens globally with case insensitive, this is important in case someone does something like "http://..." versus "HTTP://...".
sed -i -- "s/$escapedV/$k/gI" $outputDump;
done < "$prodPropsSortedFile"
Example property file:
%%token1%%=value1
%%token2%%=value2
Example input file:
This is a file that has value1 and value2.
Example output file:
This is a file that has %%token1%% and %%token2%%.

Updated Script that Works for All Files in a Folder on my Mac:
#!/bin/bash
date
retokenize()
{
echo "Retokenizing $file"
while IFS== read k v;
do
# Sed escape /, \, and &. Needed for urls like jdbc connections, etc.
escapedV=$(echo $v | sed -e 's/\\/\\\\/g; s/\//\\\//g; s/&/\\\&/g')
sed -i '' "s/$escapedV/$k/g" $file;
done < "$prodPropsSortedFile"
}
# Copy our input to a output file that we will modify, so we don't affect the original.
inputDump="IIQExports"
prodPropsFile="input.properties"
prodPropsSortedFile="sorted.properties"
tempPropsFile="temp.properties"
echo "Removing comments and empty lines from prod properties file"
sed '/^#/d' < $prodPropsFile > $tempPropsFile
sed '/^s*$/d' < $tempPropsFile > $prodPropsSortedFile
cp $prodPropsSortedFile $tempPropsFile
echo "Sorting prod properties by length."
awk -F"=" '{ st = index($0,"="); print length(substr($0,st+1)),$0 }' $tempPropsFile | sort -rn | cut -d" " -f2- > $prodPropsSortedFile
echo "Retokenizing."
find ./$inputDump/ -type f > foo.txt
IFS=$'\n';for file in $(cat foo.txt);
do
retokenize $file;
done
echo "Done."
date

Related

Why is this bash loop failing to concatenate the files?

I am at my wits end as to why this loop is failing to concatenate the files the way I need it. Basically, lets say we have following files:
AB124661.lane3.R1.fastq.gz
AB124661.lane4.R1.fastq.gz
AB124661.lane3.R2.fastq.gz
AB124661.lane4.R2.fastq.gz
What we want is:
cat AB124661.lane3.R1.fastq.gz AB124661.lane4.R1.fastq.gz > AB124661.R1.fastq.gz
cat AB124661.lane3.R2.fastq.gz AB124661.lane4.R2.fastq.gz > AB124661.R2.fastq.gz
What I tried (and didn't work):
Create and save file names (AB124661) to a ID file:
ls -1 R1.gz | awk -F '.' '{print $1}' | sort | uniq > ID
This creates an ID file that stores the samples/files name.
Run the following loop:
for i in `cat ./ID`; do cat $i\.lane3.R1.fastq.gz $i\.lane4.R1.fastq.gz \> out/$i\.R1.fastq.gz; done
for i in `cat ./ID`; do cat $i\.lane3.R2.fastq.gz $i\.lane4.R2.fastq.gz \> out/$i\.R2.fastq.gz; done
The loop fails and concatenates into empty files.
Things I tried:
Yes, the ID file is definitely in the folder
When I run with echo it shows the cat command correct
Any help will be very much appreciated,
Best,
AC
why are you escaping the \> ? That's going to result in a cat: '>': No such file or directory instead of a redirection.
Don't read lines with for
while IFS= read -r id; do
cat "${id}.lane3.R1.fastq.gz" "${id}.lane4.R1.fastq.gz" > "out/${id}.R1.fastq.gz"
cat "${id}.lane3.R2.fastq.gz" "${id}.lane4.R2.fastq.gz" > "out/${id}.R2.fastq.gz"
done < ./ID
Let say you have id stored in file ./ID per line
while read -r line; do
cat "$line".lane3.R1.fastq.gz "$line".lane4.R1.fastq.gz > "$line".R1.fastq.gz
cat "$line".lane3.R2.fastq.gz "$line".lane4.R2.fastq.gz > "$line".R2.fastq.gz
done < ./ID
A pure shell solution could be like that:
for file in *.fastq.gz; do
id=${file%%.*}
[ -e "$id".R1.fastq.gz ] || cat "$id".*.R1.fastq.gz > "$id".R1.fastq.gz
[ -e "$id".R2.fastq.gz ] || cat "$id".*.R2.fastq.gz > "$id".R2.fastq.gz
done
Alternatively:
printf '%s\n' *.fastq.gz | cut -d. -f1 | sort -u |
while IFS= read -r id; do
cat "$id".*.R1.fastq.gz > "$id".R1.fastq.gz
cat "$id".*.R2.fastq.gz > "$id".R2.fastq.gz
done
This solution assumes filenames of interest don't contain newline characters.

Extract a line from a text file using grep?

I have a textfile called log.txt, and it logs the file name and the path it was gotten from. so something like this
2.txt
/home/test/etc/2.txt
basically the file name and its previous location. I want to use grep to grab the file directory save it as a variable and move the file back to its original location.
for var in "$#"
do
if grep "$var" log.txt
then
# code if found
else
# code if not found
fi
this just prints out to the console the 2.txt and its directory since the directory has 2.txt in it.
thanks.
Maybe flip the logic to make it more efficient?
f=''
while read prev
do case "$prev" in
*/*) f="${prev##*/}"; continue;; # remember the name
*) [[ -e "$f" ]] && mv "$f" "$prev";;
done < log.txt
That walks through all the files in the log and if they exist locally, move them back. Should be functionally the same without a grep per file.
If the name is always the same then why save it in the log at all?
If it is, then
while read prev
do f="${prev##*/}" # strip the path info
[[ -e "$f" ]] && mv "$f" "$prev"
done < <( grep / log.txt )
Having the file names on the same line would significantly simplify your script. But maybe try something like
# Convert from command-line arguments to lines
printf '%s\n' "$#" |
# Pair up with entries in file
awk 'NR==FNR { f[$0]; next }
FNR%2 { if ($0 in f) p=$0; else p=""; next }
p { print "mv \"" p "\" \"" $0 "\"" }' - log.txt |
sh
Test it by replacing sh with cat and see what you get. If it looks correct, switch back.
Briefly, something similar could perhaps be pulled off with printf '%s\n' "$#" | grep -A 1 -Fxf - log.txt but you end up having to parse the output to pair up the output lines anyway.
Another solution:
for f in `grep -v "/" log.txt`; do
grep "/$f" log.txt | xargs -I{} cp $f {}
done
grep -q (for "quiet") stops the output

How to remove a filename from the list of path in Shell

I would like to remove a file name only from the following configuration file.
Configuration File -- test.conf
knowledgebase/arun/test.rf
knowledgebase/arunraj/tester/test.drl
knowledgebase/arunraj2/arun/test/tester.drl
The above file should be read. And removed contents should went to another file called output.txt
Following are my try. It is not working to me at all. I am getting empty files only.
#!/bin/bash
file=test.conf
while IFS= read -r line
do
# grep --exclude=*.drl line
# awk 'BEGIN {getline line ; gsub("*.drl","", line) ; print line}'
# awk '{ gsub("/",".drl",$NF); print line }' arun.conf
# awk 'NF{NF--};1' line arun.conf
echo $line | rev | cut -d'/' -f 1 | rev >> output.txt
done < "$file"
Expected Output :
knowledgebase/arun
knowledgebase/arunraj/tester
knowledgebase/arunraj2/arun/test
There's the dirname command to make it easy and reliable:
#!/bin/bash
file=test.conf
while IFS= read -r line
do
dirname "$line"
done < "$file" > output.txt
There are Bash shell parameter expansions that will work OK with the list of names given but won't work reliably for some names:
file=test.conf
while IFS= read -r line
do
echo "${line%/*}"
done < "$file" > output.txt
There's sed to do the job — easily with the given set of names:
sed 's%/[^/]*$%%' test.conf > output.txt
It's harder if you have to deal with names like /plain.file (or plain.file — the same sorts of edge cases that trip up the shell expansion).
You could add Perl, Python, Awk variants to the list of ways of doing the job.
You can get the path like this:
path=${fullpath%/*}
It cuts away the string after the last /
Using awk one liner you can do this:
awk 'BEGIN{FS=OFS="/"} {NF--} 1' test.conf
Output:
knowledgebase/arun
knowledgebase/arunraj/tester
knowledgebase/arunraj2/arun/test

shell script - trying not to use tmp files

How can I do that without tmp1 and tmp2?
(information files are good)
cat information_file1 | sed -e 's/\,/\ /g' >> tmp1
echo Messi >> tmp2
cat tmp1 | grep Ronaldo | cut -d"=" -f2- >> tmp2
rm tmp1
cat information_file2 | fin_func tmp2
rm tmp2
fin_func for your insight.(its not really the func and I dont want to change it just that you will see how I use tmp2 and info_file2)
while read -a line; do
if [[ "`grep $line $1`" != "" ]]; then
echo 1
fi
done
This should work, although it’s pretty incomprehensive:
cat information_file2 | fin_func <(cat <(echo Messi) <(cat information_file1 | \
sed -e 's/\,/\ /g' | grep Ronaldo | cut -d"=" -f2-))
The <( … ) syntax is Bash’s process substitution, which returns a the name of a file /dev/fd file descriptor and whose output it written to.
The sample fin_func reads through the file given as a command argument multiple times, so unless we are allowed to modify that function at least one temporary file will be necessary. The sample fin_func given in the question can be easily modified so that it does not read the file multiple times, but since you indicate that this is not the real script I will assume it cannot be modified and must take a file as an argument. That said, I would write your script as:
trap 'rm -f $TMPFILE' 0 # in bash, just trapping on 0 will work for SIGINT, etc
TMPFILE=$( mktemp )
{ echo Messi
tr , ' ' < information_file1 |
awk -F= '/Ronaldo/{print $2}' ; } > $TMPFILE
< information_file2 fin_func $TMPFILE
I strongly suspect that fin_func could be rewritten so that it does not require a regular file as input. Also, there's no need for the tr, as you could gsub in awk only on matching lines and save a bit of processing, but that is probably a trivial optimization. However, using tr instead of sed is aesthetically necessary.

awk parse filename and add result to the end of each line

I have number of files which have similar names like
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out
etc.
I need to get number before .csv(1 or 2) from the file name and put it into end of every line in file with TAB separator.
I have written this code, it finds number that I need, but i do not know how to put this number into file. There is space in the filename, my script breaks because of it.
Also I am not sure, how to send to script list of files. Now I am working only with one file.
My code:
#!/bin/sh
string="DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out"
out=$(echo $string | awk 'BEGIN {FS="_"};{print substr ($7,0,1)}')
awk ' { print $0"\t$out" } ' $string
for file in *
do
sfx=$(echo "$file" | sed 's/.*_\(.*\).csv.*/\1/')
sed -i "s/$/\t$sfx/" "$file"
done
Using sed:
$ sed 's/.*_\(.*\).csv.*/&\t\1/' file
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out 1
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out 2
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out 1
To make this for many files:
sed 's/.*_\(.*\).csv.*/&\t\1/' file1 file2 file3
OR
sed 's/.*_\(.*\).csv.*/&\t\1/' file*
To make this changed get saved in the same file(If you have GNU sed):
sed -i 's/.*\(.\).csv.*/&\t\1/' file
Untested, but this should do what you want (extract the number before .csv and append that number to the end of every line in the .out file)
awk 'FNR==1 { split(FILENAME, field, /[_.]/) }
{ print $0"\t"field[7] > FILENAME"_aaaa" }' *.out
for file in *_aaaa; do mv "$file" "${file/_aaaa}"; done
If I understood correctly, you want to append the number from the filename to every line in that file - this should do it:
#!/bin/bash
while [[ 0 < $# ]]; do
num=$(echo "$1" | sed -r 's/.*_([0-9]+).csv.*/\t\1/' )
#awk -e "{ print \$0\"\t${num}\"; }" < "$1" > "$1.new"
#sed -r "s/$/\t$num/" < "$1" > "$1.mew"
#sed -ri "s/$/\t$num/" "$1"
shift
done
Run the script and give it names of the files you want to process. $# is the number of command line arguments for the script which is decremented at the end of the loop by shift, which drops the first argument, and shifts the other ones. Extract the number from the filename and pick one of the three commented lines to do the appending: awk gives you more flexibility, first sed creates new files, second sed processes them in-place (in case you are running GNU sed, that is).
Instead of awk, you may want to go with sed or coreutils.
Grab number from filename, with grep for variety:
num=$(<<<filename grep -Eo '[^_]+\.csv' | cut -d. -f1)
<<<filename is equivalent to echo filename.
With sed
Append num to each line with GNU sed:
sed "s/\$/\t$num" filename
Use the -i switch to modify filename in-place.
With paste
You also need to know the length of the file for this method:
len=$(<filename wc -l)
Combine filename and num with paste:
paste filename <(seq $len | while read; do echo $num; done)
Complete example
for filename in DWH_Export*; do
num=$(echo $filename | grep -Eo '[^_]+\.csv' | cut -d. -f1)
sed -i "s/\$/\t$num" $filename
done

Resources