shell script - trying not to use tmp files - bash

How can I do that without tmp1 and tmp2?
(information files are good)
cat information_file1 | sed -e 's/\,/\ /g' >> tmp1
echo Messi >> tmp2
cat tmp1 | grep Ronaldo | cut -d"=" -f2- >> tmp2
rm tmp1
cat information_file2 | fin_func tmp2
rm tmp2
fin_func for your insight.(its not really the func and I dont want to change it just that you will see how I use tmp2 and info_file2)
while read -a line; do
if [[ "`grep $line $1`" != "" ]]; then
echo 1
fi
done

This should work, although it’s pretty incomprehensive:
cat information_file2 | fin_func <(cat <(echo Messi) <(cat information_file1 | \
sed -e 's/\,/\ /g' | grep Ronaldo | cut -d"=" -f2-))
The <( … ) syntax is Bash’s process substitution, which returns a the name of a file /dev/fd file descriptor and whose output it written to.

The sample fin_func reads through the file given as a command argument multiple times, so unless we are allowed to modify that function at least one temporary file will be necessary. The sample fin_func given in the question can be easily modified so that it does not read the file multiple times, but since you indicate that this is not the real script I will assume it cannot be modified and must take a file as an argument. That said, I would write your script as:
trap 'rm -f $TMPFILE' 0 # in bash, just trapping on 0 will work for SIGINT, etc
TMPFILE=$( mktemp )
{ echo Messi
tr , ' ' < information_file1 |
awk -F= '/Ronaldo/{print $2}' ; } > $TMPFILE
< information_file2 fin_func $TMPFILE
I strongly suspect that fin_func could be rewritten so that it does not require a regular file as input. Also, there's no need for the tr, as you could gsub in awk only on matching lines and save a bit of processing, but that is probably a trivial optimization. However, using tr instead of sed is aesthetically necessary.

Related

GNU parallel with custom script doing string comparison

The follwoing script.sh compares part of a string (coming from stdin by cating a csv file) to a defined string and reports the differences in a certain format
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done < "${1:-/dev/stdin}"
It is intendet to be executed on a number of rows from a very large file in the format
XYZ,ABMDEFG
and it works well when i use it in a pipe:
cat large_file | ./find_something.sh
However, when I try to use it with parallel, i get this error:
$ cat large_file | parallel ./find_something.sh
./find_something.sh: line 9: XYZ, ABMDEFG : No such file or directory
What is causing this? Is parallel supposed to work for something like this, if I want to redirect the output to a single file afterwards?
Less important side note: I'm rather proud of my string comparison method, but if someone has a faster way to get from comparing ABCDEFG and XYZ,ABMDEFG to obtain XYZ,C3M I'd be happy to hear that, too.
Edit:
I should have said, I also want to preserve the order of each line in the output, corresponding to the input. Is that possible using parallel?
Your script accepts its input from a file (defaulting to stdin), whereas parallel will pass input as arguments, not via stdin. In that sense, parallel is closer to xargs.
Presumably, you want each of the lines in large_file to be processed as a unit, possibly in parallel.
That means you need your script to only process one such line at a time, and let parallel call your script many times, once for each line.
So your script should look like this:
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
line="$1"
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
Then you can redirect to a file as follows:
cat large_file | parallel ./find_something.sh > output_file
-k keeps the order.
#!/usr/bin/env bash
doit() {
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done
}
export -f doit
cat large_file | parallel --pipe -k doit
#or
parallel --pipepart -a large_file --block -10 -k doit

Shell Script Retokenize Property Values to Keys For All Files In a Directory

Previously, I wrote a small shell script to "retokenize" a file (useful for comparing for sanity checks). I'm currently in the need of doing something similar for a folder instead of just one file.
I'm curious if there is an easy way to rework the following to be like a method / function and how to recursively pass all files in a folder to the method, so that the end result is that all files in the folder are "retokenized". Hoping to see if there is a quick and easy way to do this. Being doing some googling and playing around, but want to see if anyone here has a quick / easy / clean solution.
Working version for one file:
#!/bin/bash
date
outputDump="output.txt"
prodPropsFile="input.properties"
prodPropsSortedFile="sorted.properties"
tempPropsFile="temp.properties"
echo "Removing comments and empty lines from prod properties file"
sed '/^#/d' < $prodPropsFile > $tempPropsFile
sed '/^s*$/d' < $tempPropsFile > $prodPropsSortedFile
cp $prodPropsSortedFile $tempPropsFile
echo "Sorting prod properties by value length. So don't do double tokenization"
awk -F"=" '{ st = index($0,"="); print length(substr($0,st+1)),$0 }' $tempPropsFile | sort -rn | cut -d" " -f2- > $prodPropsSortedFile
echo "Retokenizing."
while IFS== read k v;
do
# Sed escape /, \, and &. Needed for urls like jdbc connections, etc.
escapedV=$(echo $v | sed -e 's/\\/\\\\/g; s/\//\\\//g; s/&/\\\&/g')
# The /gI will replace the tokens globally with case insensitive, this is important in case someone does something like "http://..." versus "HTTP://...".
sed -i -- "s/$escapedV/$k/gI" $outputDump;
done < "$prodPropsSortedFile"
Example property file:
%%token1%%=value1
%%token2%%=value2
Example input file:
This is a file that has value1 and value2.
Example output file:
This is a file that has %%token1%% and %%token2%%.
Updated Script that Works for All Files in a Folder on my Mac:
#!/bin/bash
date
retokenize()
{
echo "Retokenizing $file"
while IFS== read k v;
do
# Sed escape /, \, and &. Needed for urls like jdbc connections, etc.
escapedV=$(echo $v | sed -e 's/\\/\\\\/g; s/\//\\\//g; s/&/\\\&/g')
sed -i '' "s/$escapedV/$k/g" $file;
done < "$prodPropsSortedFile"
}
# Copy our input to a output file that we will modify, so we don't affect the original.
inputDump="IIQExports"
prodPropsFile="input.properties"
prodPropsSortedFile="sorted.properties"
tempPropsFile="temp.properties"
echo "Removing comments and empty lines from prod properties file"
sed '/^#/d' < $prodPropsFile > $tempPropsFile
sed '/^s*$/d' < $tempPropsFile > $prodPropsSortedFile
cp $prodPropsSortedFile $tempPropsFile
echo "Sorting prod properties by length."
awk -F"=" '{ st = index($0,"="); print length(substr($0,st+1)),$0 }' $tempPropsFile | sort -rn | cut -d" " -f2- > $prodPropsSortedFile
echo "Retokenizing."
find ./$inputDump/ -type f > foo.txt
IFS=$'\n';for file in $(cat foo.txt);
do
retokenize $file;
done
echo "Done."
date

Unix file pattern issue: append changing value of variable pattern to copies of matching line

I have a file with contents:
abc|r=1,f=2,c=2
abc|r=1,f=2,c=2;r=3,f=4,c=8
I want a result like below:
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|3
The third column value is r value. A new line would be inserted for each occurrence.
I have tried with:
for i in `cat $xxxx.txt`
do
#echo $i
live=$(echo $i | awk -F " " '{print $1}')
home=$(echo $i | awk -F " " '{print $2}')
echo $live
done
but is not working properly. I am a beginner to sed/awk and not sure how can I use them. Can someone please help on this?
awk to the rescue!
$ awk -F'[,;|]' '{c=0;
for(i=2;i<=NF;i++)
if(match($i,/^r=/)) a[c++]=substr($i,RSTART+2);
delim=substr($0,length($0))=="|"?"":"|";
for(i=0;i<c;i++) print $0 delim a[i]}' file
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|3
Use an inner routine (made up of GNU grep, sed, and tr) to compile a second more elaborate sed command, the output of which needs further cleanup with more sed. Call the input file "foo".
sed -n $(grep -no 'r=[0-9]*' foo | \
sed 's/^[0-9]*/&s#.*#\&/;s/:r=/|/;s/.*/&#p;/' | \
tr -d '\n') foo | \
sed 's/|[0-9|]*|/|/'
Output:
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|3
Looking at the inner sed code:
grep -no 'r=[0-9]*' foo | \
sed 's/^[0-9]*/&s#.*#\&/;s/:r=/|/;s/.*/&#p;/' | \
tr -d '\n'
It's purpose is to parse foo on-the-fly (when foo changes, so will the output), and in this instance come up with:
1s#.*#&|1#p;2s#.*#&|1#p;2s#.*#&|3#p;
Which is almost perfect, but it leaves in old data on the last line:
sed -n '1s#.*#&|1#p;2s#.*#&|1#p;2s#.*#&|3#p;' foo
abc|r=1,f=2,c=2|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1
abc|r=1,f=2,c=2;r=3,f=4,c=8|1|3
...which old data |1 is what the final sed 's/|[0-9|]*|/|/' removes.
Here is a pure bash solution. I wouldn't recommend actually using this, but it might help you understand better how to work with files in bash.
# Iterate over each line, splitting into three fields
# using | as the delimiter. (f3 is only there to make
# sure a trailing | is not included in the value of f2)
while IFS="|" read -r f1 f2 f3; do
# Create an array of variable groups from $f2, using ;
# as the delimiter
IFS=";" read -a groups <<< "$f2"
for group in "${groups[#]}"; do
# Get each variable from the group separately
# by splitting on ,
IFS=, read -a vars <<< "$group"
for var in "${vars[#]}"; do
# Split each assignment on =, create
# the variable for real, and quit once we
# have found r
IFS== read name value <<< "$var"
declare "$name=$value"
[[ $name == r ]] && break
done
# Output the desired line for the current value of r
printf '%s|%s|%s\n' "$f1" "$f2" "$r"
done
done < $xxxx.txt
Changes for ksh:
read -A instead of read -a.
typeset instead of declare.
If <<< is a problem, you can use a here document instead. For example:
IFS=";" read -A groups <<EOF
$f2
EOF

awk parse filename and add result to the end of each line

I have number of files which have similar names like
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out
etc.
I need to get number before .csv(1 or 2) from the file name and put it into end of every line in file with TAB separator.
I have written this code, it finds number that I need, but i do not know how to put this number into file. There is space in the filename, my script breaks because of it.
Also I am not sure, how to send to script list of files. Now I am working only with one file.
My code:
#!/bin/sh
string="DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out"
out=$(echo $string | awk 'BEGIN {FS="_"};{print substr ($7,0,1)}')
awk ' { print $0"\t$out" } ' $string
for file in *
do
sfx=$(echo "$file" | sed 's/.*_\(.*\).csv.*/\1/')
sed -i "s/$/\t$sfx/" "$file"
done
Using sed:
$ sed 's/.*_\(.*\).csv.*/&\t\1/' file
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out 1
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out 2
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out 1
To make this for many files:
sed 's/.*_\(.*\).csv.*/&\t\1/' file1 file2 file3
OR
sed 's/.*_\(.*\).csv.*/&\t\1/' file*
To make this changed get saved in the same file(If you have GNU sed):
sed -i 's/.*\(.\).csv.*/&\t\1/' file
Untested, but this should do what you want (extract the number before .csv and append that number to the end of every line in the .out file)
awk 'FNR==1 { split(FILENAME, field, /[_.]/) }
{ print $0"\t"field[7] > FILENAME"_aaaa" }' *.out
for file in *_aaaa; do mv "$file" "${file/_aaaa}"; done
If I understood correctly, you want to append the number from the filename to every line in that file - this should do it:
#!/bin/bash
while [[ 0 < $# ]]; do
num=$(echo "$1" | sed -r 's/.*_([0-9]+).csv.*/\t\1/' )
#awk -e "{ print \$0\"\t${num}\"; }" < "$1" > "$1.new"
#sed -r "s/$/\t$num/" < "$1" > "$1.mew"
#sed -ri "s/$/\t$num/" "$1"
shift
done
Run the script and give it names of the files you want to process. $# is the number of command line arguments for the script which is decremented at the end of the loop by shift, which drops the first argument, and shifts the other ones. Extract the number from the filename and pick one of the three commented lines to do the appending: awk gives you more flexibility, first sed creates new files, second sed processes them in-place (in case you are running GNU sed, that is).
Instead of awk, you may want to go with sed or coreutils.
Grab number from filename, with grep for variety:
num=$(<<<filename grep -Eo '[^_]+\.csv' | cut -d. -f1)
<<<filename is equivalent to echo filename.
With sed
Append num to each line with GNU sed:
sed "s/\$/\t$num" filename
Use the -i switch to modify filename in-place.
With paste
You also need to know the length of the file for this method:
len=$(<filename wc -l)
Combine filename and num with paste:
paste filename <(seq $len | while read; do echo $num; done)
Complete example
for filename in DWH_Export*; do
num=$(echo $filename | grep -Eo '[^_]+\.csv' | cut -d. -f1)
sed -i "s/\$/\t$num" $filename
done

Randomizing arg order for a bash for statement

I have a bash script that processes all of the files in a directory using a loop like
for i in *.txt
do
ops.....
done
There are thousands of files and they are always processed in alphanumerical order because of '*.txt' expansion.
Is there a simple way to random the order and still insure that I process all of the files only once?
Assuming the filenames do not have spaces, just substitute the output of List::Util::shuffle.
for i in `perl -MList::Util=shuffle -e'$,=$";print shuffle<*.txt>'`; do
....
done
If filenames do have spaces but don't have embedded newlines or backslashes, read a line at a time.
perl -MList::Util=shuffle -le'$,=$\;print shuffle<*.txt>' | while read i; do
....
done
To be completely safe in Bash, use NUL-terminated strings.
perl -MList::Util=shuffle -0 -le'$,=$\;print shuffle<*.txt>' |
while read -r -d '' i; do
....
done
Not very efficient, but it is possible to do this in pure Bash if desired. sort -R does something like this, internally.
declare -a a # create an integer-indexed associative array
for i in *.txt; do
j=$RANDOM # find an unused slot
while [[ -n ${a[$j]} ]]; do
j=$RANDOM
done
a[$j]=$i # fill that slot
done
for i in "${a[#]}"; do # iterate in index order (which is random)
....
done
Or use a traditional Fisher-Yates shuffle.
a=(*.txt)
for ((i=${#a[*]}; i>1; i--)); do
j=$[RANDOM%i]
tmp=${a[$j]}
a[$j]=${a[$[i-1]]}
a[$[i-1]]=$tmp
done
for i in "${a[#]}"; do
....
done
You could pipe your filenames through the sort command:
ls | sort --random-sort | xargs ....
Here's an answer that relies on very basic functions within awk so it should be portable between unices.
ls -1 | awk '{print rand()*100, $0}' | sort -n | awk '{print $2}'
EDIT:
ephemient makes a good point that the above is not space-safe. Here's a version that is:
ls -1 | awk '{print rand()*100, $0}' | sort -n | sed 's/[0-9\.]* //'
If you have GNU coreutils, you can use shuf:
while read -d '' f
do
# some stuff with $f
done < <(shuf -ze *)
This will work with files with spaces or newlines in their names.
Off-topic Edit:
To illustrate SiegeX's point in the comment:
$ a=42; echo "Don't Panic" | while read line; do echo $line; echo $a; a=0; echo $a; done; echo $a
Don't Panic
42
0
42
$ a=42; while read line; do echo $line; echo $a; a=0; echo $a; done < <(echo "Don't Panic"); echo $a
Don't Panic
42
0
0
The pipe causes the while to be executed in a subshell and so changes to variables in the child don't flow back to the parent.
Here's a solution with standard unix commands:
for i in $(ls); do echo $RANDOM-$i; done | sort | cut -d- -f 2-
Here's a Python solution, if its available on your system
import glob
import random
files = glob.glob("*.txt")
if files:
for file in random.shuffle(files):
print file

Resources