Prevent creation of empty files in bash - bash

I wrote the following code, where:
$1 = Input .csv file
$2 = list of strings to be searched in $1
$3 = list of different strings to be searched in $2
while read str1
do
while read str2
do
grep $str1 $1 | grep $str2 | cut -d "," -f 6 > ${str1}_${str2}.txt
done < $3
done < $2
It basically does what I want it to do (search for two different strings from separate input files, extract field 6 of lines that contain both strings and write the content of field 6 into a result file).
However, of course, result files are created for all possible combinations of strings from $2 and $3, even if they are empty. Is there a way to prevent the creation of empty files in general or do I have to remove them at the end?

You can capture program output with $(...):
res=$(grep "$str1" "$1" | grep "$str2" | cut -d "," -f 6)
and with -n test, if a String is empty:
if [[ -n $res ]]; then echo "$res" > "${str1}_${str2}.txt" ; fi

Related

Simple bash script to split csv file by week number

I'm trying to separate a large pipe-delimited file based on a week number field. The file contains data for a full year thus having 53 weeks. I am hoping to create a loop that does the following:
1) check if week number is less than 10 - if it is paste a '0' in front
2) use grep to send the rows to a file (ie `grep '|01|' bigFile.txt > smallFile.txt` )
3) gzip the smaller file (ie `gzip smallFile.txt`)
4) repeat
Is there a resource that would show how to do this?
EDIT :
Data looks like this:
1|#gmail|1|0|0|0|1|01|com
1|#yahoo|0|1|0|0|0|27|com
The column I care about is the 2nd from the right.
EDIT 2:
Here's the script I'm using but it's not functioning:
for (( i = 1; i <= 12; i++ )); do
#statements
echo 'i :'$i
q=$i
# echo $q
# $q==10
if [[ q -lt 10 ]]; then
#statements
k='0'$q
echo $k
grep '|$k|' 20150226_train.txt > 'weeks_files/week'$k
gzip weeks_files/week $k
fi
if [[ q -gt 9 ]]; then
#statements
echo $q
grep \'|$q|\' 20150226_train.txt > 'weeks_files/week'$q
gzip 'weeks_files/week'$q
fi
done
Very simple in awk ...
awk -F'|' '{ print > ("smallfile-" $(NF-1) ".txt";) }' bigfile.txt
Edit: brackets added for "original-awk".
You're almost there.
#!/bin/bash
for (( i = 1; i <= 12; i++ )); do
#statements
echo 'i :'$i
q=$i
# echo $q
# $q==10
#OLD if [[ q -lt 10 ]]; then
if [[ $q -lt 10 ]]; then
#statements
k='0'$q
echo $k
#OLD grep '|$k|' 20150226_train.txt > 'weeks_files/week'$k
grep "|$k|" 20150226_train.txt > 'weeks_files/week'$k
#OLD gzip weeks_files/week $k
gzip weeks_files/week$k
#OLD fi
#OLD if [[ q -gt 9 ]]; then
elif [[ $q -gt 9 ]] ; then
#statements
echo $q
#OLD grep \'|$q|\' 20150226_train.txt > 'weeks_files/week'$q
grep "|$q|" 20150226_train.txt > 'weeks_files/week'$q
gzip 'weeks_files/week'$q
fi
done
You didn't alway use $ in front of your variable values. You can only get away with using k or q without a $ inside the shell arthimetic substitution feature, ie z=$(( x+k)) or just to operate on a variable like (( k++ )). There are others.
You need to learn the difference between single quoting and dbl-quoting. You need to use dbl-quoting when you want a value substituted for a variable, as in your lines
grep "|$q|" 20150226_train.txt > 'weeks_files/week'$q
and others.
I'm guessing that your use of grep \'|$q|\' 20150226_train.txt was an attempt to get the value of $q.
The way to get comfortable with debugging this sort of situation is to set the shell debugging option with set -x (turn it off with set +x). You'll see each line that is executed with the values substituted for the variables. Advanced debugging requires echo "varof Interset now = $var" (print statements). Also, you can use set -vx (and set +vx) to see each line or block of code before it is executed, and then the -x output will show which lines where acctually executed. For your script, you'd see the whole if ... elfi ...fi block printed, and then just the lines of -x output with values for variables. It can be confusing, even after years of looking at it. ;-)
So you can go thru and remove all lines with the prefix #OLD, and I'm hoping your code will work for you.
IHTH
mkdir -p weeks_files &&
awk -F'|' '
{ file=sprintf("weeks_files/week%2d",$(NF-1)); print > file }
!seen[file]++ { print file }
' 20150226_train.txt |
xargs gzip
If your data is ordered so that all of the rows for a given week number are contiguous you can make it simpler and more efficient:
mkdir -p weeks_files &&
awk -F'|' '
$(NF-1) != prev { file=sprintf("weeks_files/week%2d",$(NF-1)); print file }
{ print > file; prev=$(NF-1) }
' 20150226_train.txt |
xargs gzip
There are certainly a number of approaches - the 'awk' line below will reformat your data. If you take a sequential approach, then:
1) awk to reformat
awk -F '|' '{printf "%s|%s|%s|%s|%s|%s|%s|%02d|%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' SOURCE_FILE > bigFile.txt
2) loop through the weeks, create small file an zip it
for N in {01..53}
do
grep "|${N}|" bigFile.txt > smallFile.${N}.txt
gzip smallFile.${N}.txt
done
3) test script showing reformat step
#!/bin/bash
function show_data {
# Data set w/9 'fields'
# 1| 2 |3|4|5|6|7| 8|9
cat << EOM
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|2|com
1|#gmail|1|0|0|0|1|5|com
1|#yahoo|0|1|0|0|0|27|com
EOM
}
###
function stars {
echo "## $# ##"
}
###
stars "Raw data"
show_data
stars "Modified data"
# 1| 2| 3| 4| 5| 6| 7| 8|9 ##
show_data | awk -F '|' '{printf "%s|%s|%s|%s|%s|%s|%s|%02d|%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}'
Sample run:
$ bash test.sh
## Raw data ##
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|2|com
1|#gmail|1|0|0|0|1|5|com
1|#yahoo|0|1|0|0|0|27|com
## Modified data ##
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|02|com
1|#gmail|1|0|0|0|1|05|com
1|#yahoo|0|1|0|0|0|27|com

Cut column by column name in bash

I want to specify a column by name (i.e. 102), find the position of this column and then use something like cut -5,7- with the found position to delete the specified column.
This is my file header (delim = "\t"):
#CHROM POS 1 100 101 102 103 107 108
This awk should work:
awk -F'\t' -v c="102" 'NR==1{for (i=1; i<=NF; i++) if ($i==c){p=i; break}; next} {print $p}' file
Here's one possible solution without the restriction that only one column is to be removed. It is written as a bash function, where the first argument is the filename, and the remaining arguments are the columns to exclude.
rmcol() {
local file=$1
shift
cut -f$(head -n1 "$file" | tr \\t \\n | grep -vFxn "${#/#/-e}" |
cut -d: -f1 | paste -sd,) "$file"
}
If you want to select rather than exclude the named columns, then change -vFxn to -Fxn.
That almost certainly requires some sort of explanation. The first two lines of the function just removes the filename from the arguments and stores it for later use. The cut command will then select the appropriate columns; the column numbers are computed with the complicated pipeline which follows:
head -n1 "$file" | # Take the first line of the file
tr \\t \\n | # Change all the tabs to newlines [ Note 1]
grep # Select all lines (i.e. column names) which
-v # don't match
F # the literal string
x # which is the complete line
n # and include the line number in the output
"${#/#/-e}" | # Put -e at the beginning of each command line argument,
# converting the arguments into grep pattern arguments (-e)
cut -d: -f1 | # Select only the line number from that matches
paste -sd, # Paste together all the line numbers, separated with commas.
Using a for loop in bash:
C=1; for i in $(head file -n 1) ; do if [ $i == "102" ] ; then break ; else C=$(( $C + 1 )) ; fi ; done ; echo $C
And a full script
C=1
for i in $(head in_file -n 1) ; do
echo $i
if [ $i == "102" ] ; then
break ;
else
echo $C
C=$(( $C + 1 ))
fi
done
cut -f1-$(($C-1)),$(($C+1))- in_file
trying a solution without looping through columns, I get:
#!/bin/bash
pick="$1"
titles="pos 1 100 102 105"
tmp=" $titles "
tmp="${tmp%% $pick* }"
tmp=($tmp)
echo "column ${#tmp[#]}"
It suffers from incorrectly reporting last column if column name can't be found.
Try this small awk utility to cut specific headers - https://github.com/rohitprajapati/toyeca-cutter
Example usage -
awk -f toyeca-cutter.awk -v c="col1, col2, col3, col4" my_file.csv

Bash script read specifc value from files of an entire folder

I have a problem creating a script that reads specific value from all the files of an entire folder
I have a number of email files in a directory and I need to extract from each file, 2 specific values.
After that I have to put them into a new file that looks like that:
--------------
To: value1
value2
--------------
This is what I want to do, but I don't know how to create the script:
# I am putting the name of the files into a temp file
`ls -l | awk '{print $9 }' >tmpfile`
# use for the name of a file
`date=`date +"%T"
# The first specific value from file (phone number)
var1=`cat tmpfile | grep "To: 0" | awk '{print $2 }' | cut -b -10 `
# The second specific value from file(subject)
var2=cat file | grep Subject | awk '{print $2$3$4$5$6$7$8$9$10 }'
# Put the first value in a new file on the first row
echo "To: 4"$var1"" > sms-$date
# Put the second value in the same file on the second row
echo ""$var2"" >>sms-$date
.......
and do the same for every file in the directory
I tried using while and for functions but I couldn't finalize the script
Thank You
I've made a few changes to your script, hopefully they will be useful to you:
#!/bin/bash
for file in *; do
var1=$(awk '/To: 0/ {print substr($2,0,10)}' "$file")
var2=$(awk '/Subject/ {for (i=2; i<=10; ++i) s=s$i; print s}' "$file")
outfile="sms-"$(date +"%T")
i=0
while [ -f "$outfile" ]; do outfile="sms-$date-"$((i++)); done
echo "To: 4$var1" > "$outfile"
echo "$var2" >> "$outfile"
done
The for loop just goes through every file in the folder that you run the script from.
I have added added an additional suffix $i to the end of the file name. If no file with the same date already exists, then the file will be created without the suffix. Otherwise the value of $i will keep increasing until there is no file with the same name.
I'm using $( ) rather than backticks, this is just a personal preference but it can be clearer in my opinion, especially when there are other quotes about.
There's not usually any need to pipe the output of grep to awk. You can do the search in awk using the / / syntax.
I have removed the cut -b -10 and replaced it with substr($2, 0, 10), which prints the first 10 characters from column 2.
It's not much shorter but I used a loop rather than the $2$3..., I think it looks a bit neater.
There's no need for all the extra " in the two output lines.
I sugest to try the following:
#!/bin/sh
RESULT_FILE=sms-`date +"%T"`
DIR=.
fgrep -l 'To: 0' "$DIR" | while read FILE; do
var1=`fgrep 'To: 0' "$FILE" | awk '{print $2 }' | cut -b -10`
var2=`fgrep 'Subject' "$FILE" | awk '{print $2$3$4$5$6$7$8$9$10 }'`
echo "To: 4$var1" >>"$RESULT_FIL"
echo "$var2" >>"$RESULT_FIL"
done

BASH script - print sorted contents from all files in directory with no rep's

In the current directory there are files with names of the form "gradesXXX" (where XXX is a course number) which look like this:
ID GRADE (this line is not contained in the files)
123456789 56
213495873 84
098342362 77
. .
. .
. .
I want to write a BASH script that prints all the IDs that have a grade above a certain number, which is given as the first parameter to said script.
The requirements are that an ID must be printed once at most, and that no intermediate files are used.
I was guided to use two scripts - the first with length of one line, and the second with length of up to six lines (not including the "#!" line).
I'm quite lost with this one so any suggestions will be appreciated.
Cheers.
The answer I was looking for was
// internal script
#!/bin/bash
while read line; do
line_split=( $line )
if (( ${line_split[1]} > $1 )); then
echo ${line_split[0]}
fi
done
// external script
#!/bin/bash
cat grades* | sort -r -n -k 1 | internalScript $1 | cut -f1 -d" " | uniq
OK, a simple solution.
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt 60 ] ; then break ; fi ; echo $ID ; done | sort -u
I'm not sure why two scripts should be necessary. All in a script:
#!/bin/bash
threshold=$1
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt $threshold ] ; then break ; fi ; echo $ID ; done | sort -u
We first cat all the grade files, the sort them by grade in reverse order. The while loop breaks if grade is below threshold, so that only lines with higher grades get their ID printed. sort -u makes sure that every ID is sent only once.
You can use awk:
awk '{ if ($2 > 70) print $1 }' grades777
It prints the first column of every line which seconds column is greater than 70. If you need to change the threshold:
N=71
awk '{ if ($2 > '$N') print $1 }' grades777
That ' are required to pass shell variables in AWK. To work with all grade??? files in the current directory and remove duplicated lines:
awk '{ if ($2 > '$N') print $1 }' grades??? | sort -u
A simple one-line solution.
Yet another solution:
cat grades[0-9][0-9][0-9] | awk -v MAX=70 '{ if ($2 > MAX) foo[$1]=1 }END{for (id in foo) print id }'
Append | sort -n after that if you want the IDs in sorted order.
In pure bash :
N=60
for file in /path/*; do
while read id grade; do ((grade > N)) && echo "$id"; done < "$file"
done
OUTPUT
213495873
098342362

How to get output of grep in single line in shell script?

Here is a script which reads words from the file replaced.txt and displays the output each word in each line, But I want to display all the outputs in a single line.
#!/bin/sh
echo
echo "Enter the word to be translated"
read a
IFS=" " # Set the field separator
set $a # Breaks the string into $1, $2, ...
for a # a for loop by default loop through $1, $2, ...
do
{
b= grep "$a" replaced.txt | cut -f 2 -d" "
}
done
Content of "replaced.txt" file is given below:
hllo HELLO
m AM
rshbh RISHABH
jn JAIN
hw HOW
ws WAS
ur YOUR
dy DAY
This question can't be appropriate to what I asked, I just need the help to put output of the script in a single line.
Your entire script can be replaced by:
#!/bin/bash
echo
read -r -p "Enter the words to be translated: " a
echo $(printf "%s\n" $a | grep -Ff - replaced.txt | cut -f 2 -d ' ')
No need for a loop.
The echo with an unquoted argument removes embedded newlines and replaces each sequence of multiple spaces and/or tabs with one space.
One hackish-but-simple way to remove trailing newlines from the output of a command is to wrap it in printf %s "$(...) ". That is, you can change this:
b= grep "$a" replaced.txt | cut -f 2 -d" "
to this:
printf %s "$(grep "$a" replaced.txt | cut -f 2 -d" ") "
and add an echo command after the loop completes.
The $(...) notation sets up a "command substitution": the command grep "$a" replaced.txt | cut -f 2 -d" " is run in a subshell, and its output, minus any trailing newlines, is substituted into the argument-list. So, for example, if the command outputs DAY, then the above is equivalent to this:
printf %s "DAY "
(The printf %s ... notation is equivalent to echo -n ... — it outputs a string without adding a trailing newline — except that its behavior is more portably consistent, and it won't misbehave if the string you want to print happens to start with -n or -e or whatnot.)
You can also use
awk 'BEGIN { OFS=": "; ORS=" "; } NF >= 2 { print $2; }'
in a pipe after the cut.

Resources