Simple bash script to split csv file by week number - bash

I'm trying to separate a large pipe-delimited file based on a week number field. The file contains data for a full year thus having 53 weeks. I am hoping to create a loop that does the following:
1) check if week number is less than 10 - if it is paste a '0' in front
2) use grep to send the rows to a file (ie `grep '|01|' bigFile.txt > smallFile.txt` )
3) gzip the smaller file (ie `gzip smallFile.txt`)
4) repeat
Is there a resource that would show how to do this?
EDIT :
Data looks like this:
1|#gmail|1|0|0|0|1|01|com
1|#yahoo|0|1|0|0|0|27|com
The column I care about is the 2nd from the right.
EDIT 2:
Here's the script I'm using but it's not functioning:
for (( i = 1; i <= 12; i++ )); do
#statements
echo 'i :'$i
q=$i
# echo $q
# $q==10
if [[ q -lt 10 ]]; then
#statements
k='0'$q
echo $k
grep '|$k|' 20150226_train.txt > 'weeks_files/week'$k
gzip weeks_files/week $k
fi
if [[ q -gt 9 ]]; then
#statements
echo $q
grep \'|$q|\' 20150226_train.txt > 'weeks_files/week'$q
gzip 'weeks_files/week'$q
fi
done

Very simple in awk ...
awk -F'|' '{ print > ("smallfile-" $(NF-1) ".txt";) }' bigfile.txt
Edit: brackets added for "original-awk".

You're almost there.
#!/bin/bash
for (( i = 1; i <= 12; i++ )); do
#statements
echo 'i :'$i
q=$i
# echo $q
# $q==10
#OLD if [[ q -lt 10 ]]; then
if [[ $q -lt 10 ]]; then
#statements
k='0'$q
echo $k
#OLD grep '|$k|' 20150226_train.txt > 'weeks_files/week'$k
grep "|$k|" 20150226_train.txt > 'weeks_files/week'$k
#OLD gzip weeks_files/week $k
gzip weeks_files/week$k
#OLD fi
#OLD if [[ q -gt 9 ]]; then
elif [[ $q -gt 9 ]] ; then
#statements
echo $q
#OLD grep \'|$q|\' 20150226_train.txt > 'weeks_files/week'$q
grep "|$q|" 20150226_train.txt > 'weeks_files/week'$q
gzip 'weeks_files/week'$q
fi
done
You didn't alway use $ in front of your variable values. You can only get away with using k or q without a $ inside the shell arthimetic substitution feature, ie z=$(( x+k)) or just to operate on a variable like (( k++ )). There are others.
You need to learn the difference between single quoting and dbl-quoting. You need to use dbl-quoting when you want a value substituted for a variable, as in your lines
grep "|$q|" 20150226_train.txt > 'weeks_files/week'$q
and others.
I'm guessing that your use of grep \'|$q|\' 20150226_train.txt was an attempt to get the value of $q.
The way to get comfortable with debugging this sort of situation is to set the shell debugging option with set -x (turn it off with set +x). You'll see each line that is executed with the values substituted for the variables. Advanced debugging requires echo "varof Interset now = $var" (print statements). Also, you can use set -vx (and set +vx) to see each line or block of code before it is executed, and then the -x output will show which lines where acctually executed. For your script, you'd see the whole if ... elfi ...fi block printed, and then just the lines of -x output with values for variables. It can be confusing, even after years of looking at it. ;-)
So you can go thru and remove all lines with the prefix #OLD, and I'm hoping your code will work for you.
IHTH

mkdir -p weeks_files &&
awk -F'|' '
{ file=sprintf("weeks_files/week%2d",$(NF-1)); print > file }
!seen[file]++ { print file }
' 20150226_train.txt |
xargs gzip
If your data is ordered so that all of the rows for a given week number are contiguous you can make it simpler and more efficient:
mkdir -p weeks_files &&
awk -F'|' '
$(NF-1) != prev { file=sprintf("weeks_files/week%2d",$(NF-1)); print file }
{ print > file; prev=$(NF-1) }
' 20150226_train.txt |
xargs gzip

There are certainly a number of approaches - the 'awk' line below will reformat your data. If you take a sequential approach, then:
1) awk to reformat
awk -F '|' '{printf "%s|%s|%s|%s|%s|%s|%s|%02d|%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' SOURCE_FILE > bigFile.txt
2) loop through the weeks, create small file an zip it
for N in {01..53}
do
grep "|${N}|" bigFile.txt > smallFile.${N}.txt
gzip smallFile.${N}.txt
done
3) test script showing reformat step
#!/bin/bash
function show_data {
# Data set w/9 'fields'
# 1| 2 |3|4|5|6|7| 8|9
cat << EOM
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|2|com
1|#gmail|1|0|0|0|1|5|com
1|#yahoo|0|1|0|0|0|27|com
EOM
}
###
function stars {
echo "## $# ##"
}
###
stars "Raw data"
show_data
stars "Modified data"
# 1| 2| 3| 4| 5| 6| 7| 8|9 ##
show_data | awk -F '|' '{printf "%s|%s|%s|%s|%s|%s|%s|%02d|%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}'
Sample run:
$ bash test.sh
## Raw data ##
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|2|com
1|#gmail|1|0|0|0|1|5|com
1|#yahoo|0|1|0|0|0|27|com
## Modified data ##
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|02|com
1|#gmail|1|0|0|0|1|05|com
1|#yahoo|0|1|0|0|0|27|com

Related

For loop function doesn't work within a while loop

I am looking to repeat the same function for each gene in my genelist. This is what the while loop does. Then it extracts the files from the master document into a new bed file.
The number_of_lines variable is the number of rows in the document. And I want to create a document with the number of row corresponding to number_of_lines
i.e.
number_of_lines=1
output
1
number_of_lines=5
output
5
5
5
5
5
my code below
while read gene
do
grep -w $gene $masterfile | awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$4}' > $gene.bed
number_of_lines=$(grep "^.*$" -c $gene.bed)
echo $number_of_lines
cat "" > $gene.1.bed
for i in 'eval echo {1..$number_of_lines}'
do
echo $number_of_lines >> $gene.1.bed
done
done < $genelist
if I do this by itself
cat "" > $gene.1.bed
for i in 'eval echo {1..$number_of_lines}'
do
echo $number_of_lines >> $gene.1.bed
done
it works?
You need to put eval echo {1..$number_of_lines} inside $() to expand to the output.
cat "" will get an error, that should be echo "". But simpler is to just put the output redirection around the entire loop instead of after each echo statement.
while read gene
do
grep -w "$gene" "$masterfile" | awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$4}' > "$gene.bed"
number_of_lines=$(grep "^.*$" -c "$gene.bed")
echo $number_of_lines
for i in $(eval echo {1..$number_of_lines})
do
echo $number_of_lines
done > "$gene.1.bed"
done < "$genelist"
When you see eval, you "know" your code is wrong. #Barmar already pointed out the normal construction for ((i=0; i<$number_of_lines; i++)), what should be used here. With all lines having the same content, you have another possibility: yes. I made some other changes too.
while read gene
do
grep -w "${gene}" "${masterfile}" |
awk 'BEGIN {OFS="\t";} {print $1, $2, $3, $5, $6, $4}' > "${gene}.bed"
number_of_lines=$(wc -l < "${gene}.bed")
echo "${number_of_lines}"
yes "${number_of_lines}" | head -"${number_of_lines}" > "${gene}.1.bed"
done < "${genelist}"

UNIX average of specific employee as per designation

This is an example of a text file to be given as input
Name,Designation,Salary
Hari,Engineer,35000
Suresh,Consultant,80000
Umesh,Engineer,45500
Maya,Analyst,50000
Guru,Consultant,100000
Sushma,Engineer,30000
Mohan,Engineer,30000
My code should be able to run find the average salary of particular employee's designation. For example,
bash script.sh employees.txt Analyst
Then my output should be
50000
My current code to find just the average of all employees doesn't work. I am new to shell. This is my current code
count="$(tail -n 1 salary.txt | grep -o '^[^\s]\+')"
echo "$count"
salary="$(grep -o '[^ ]\+$' salary.txt | paste -sd+)"
echo "$salary"
echo "($salary)/$count" | bc
I get empty values as results.
This is better done in awk:
awk -F, -v dgn='Engineer' '$2 == dgn{s += $3; ++c} END{printf "%.2f\n", s/c}' file.csv
35125.00
Could you please try following(since OP requested for script way, so adding it in a script way where passing 1st argument as Input_file name and 2nd argument as string whose avg is needed).
cat script.ksh
file="$1"
name="$2"
awk -F, -v field="$name" '{a[$2]+=$3;b[$2]++} END{for(i in a){if(i == field){print a[i]/b[i]}}}' "$file"
Now run the script as follwos.
./script.ksh Input_file Analyst
50000
GNU datamash is a useful tool for calculating this kind of thing:
$ datamash -sHt, groupby 2 mean 3 < employees.txt
Combine with grep to limit it to just the title you're interested in.
If you want to do this in the shell:
#!/bin/bash
file=$1
designation=$2
# code to validate user input here ...
sum=0
count=0
while IFS=, read -r n d s; do
if [[ ${designation,,} == "${d,,}" ]]; then
(( sum += s ))
(( count++ ))
fi
done < "$file"
if (( count == 0 )); then
echo "No $designation found in $file"
else
echo $((sum / count))
fi
Using Perl
perl -F, -lane ' if(/Engineer/) { $dsg+=$F[2];$c++ } END { print $dsg/$c } ' file
with your given inputs
$ cat john.txt
Name,Designation,Salary
Hari,Engineer,35000
Suresh,Consultant,80000
Umesh,Engineer,45500
Maya,Analyst,50000
Guru,Consultant,100000
Sushma,Engineer,30000
Mohan,Engineer,30000
$ perl -F, -lane ' if(/Engineer/) { $dsg+=$F[2];$c++ } END { print $dsg/$c } ' john.txt
35125
$

Bash script read specifc value from files of an entire folder

I have a problem creating a script that reads specific value from all the files of an entire folder
I have a number of email files in a directory and I need to extract from each file, 2 specific values.
After that I have to put them into a new file that looks like that:
--------------
To: value1
value2
--------------
This is what I want to do, but I don't know how to create the script:
# I am putting the name of the files into a temp file
`ls -l | awk '{print $9 }' >tmpfile`
# use for the name of a file
`date=`date +"%T"
# The first specific value from file (phone number)
var1=`cat tmpfile | grep "To: 0" | awk '{print $2 }' | cut -b -10 `
# The second specific value from file(subject)
var2=cat file | grep Subject | awk '{print $2$3$4$5$6$7$8$9$10 }'
# Put the first value in a new file on the first row
echo "To: 4"$var1"" > sms-$date
# Put the second value in the same file on the second row
echo ""$var2"" >>sms-$date
.......
and do the same for every file in the directory
I tried using while and for functions but I couldn't finalize the script
Thank You
I've made a few changes to your script, hopefully they will be useful to you:
#!/bin/bash
for file in *; do
var1=$(awk '/To: 0/ {print substr($2,0,10)}' "$file")
var2=$(awk '/Subject/ {for (i=2; i<=10; ++i) s=s$i; print s}' "$file")
outfile="sms-"$(date +"%T")
i=0
while [ -f "$outfile" ]; do outfile="sms-$date-"$((i++)); done
echo "To: 4$var1" > "$outfile"
echo "$var2" >> "$outfile"
done
The for loop just goes through every file in the folder that you run the script from.
I have added added an additional suffix $i to the end of the file name. If no file with the same date already exists, then the file will be created without the suffix. Otherwise the value of $i will keep increasing until there is no file with the same name.
I'm using $( ) rather than backticks, this is just a personal preference but it can be clearer in my opinion, especially when there are other quotes about.
There's not usually any need to pipe the output of grep to awk. You can do the search in awk using the / / syntax.
I have removed the cut -b -10 and replaced it with substr($2, 0, 10), which prints the first 10 characters from column 2.
It's not much shorter but I used a loop rather than the $2$3..., I think it looks a bit neater.
There's no need for all the extra " in the two output lines.
I sugest to try the following:
#!/bin/sh
RESULT_FILE=sms-`date +"%T"`
DIR=.
fgrep -l 'To: 0' "$DIR" | while read FILE; do
var1=`fgrep 'To: 0' "$FILE" | awk '{print $2 }' | cut -b -10`
var2=`fgrep 'Subject' "$FILE" | awk '{print $2$3$4$5$6$7$8$9$10 }'`
echo "To: 4$var1" >>"$RESULT_FIL"
echo "$var2" >>"$RESULT_FIL"
done

Need help in shell script

I am new into shell scripting and learning it for past 2 month. I need your help in tuning or providing any other solution either in sed or AWK for the below question.
"write a script to input the filename and display the content of file in such a manner that each line has only 10 characters.If line in a file exceeds 10 characters then display the rest of the line in next line."
I have written the below script and worked fine. But it took 2 hours for me to write it..(certainly not acceptable. Problem is i know the shell commands very well but still have not mastered the skills to put them into shell scripts :-( . Thanks.
#!/bin/bash
if [ $# -ne 1 ]; then
echo "USAGE: $0 $1"
exit 99;
fi
VAR1=$(echo "$1" | wc -c)
cat "$1" | while read line
do
[ $VAR1 -gt 10 ] && echo "$line" || echo "$line"|tr " " "\n"
done
Using sed
sed 's/........../&\n/g' file.txt
Using grep
grep -oE '.{1,10}' file.txt
Using dd
cat file.txt | dd cbs=10 conv=unblock 2>/dev/null
Using awk?
awk 'BEGIN {FS=""} {for (i=1; i<=NF; i++) if (i % 10 == 0) printf "%s\n", $i ; else if (i == NF) print "\n" ; else printf "%s", $i} ' inputs.txt
This works, but I have a feeling that this is not the most optimal way of using awk :-P

Get 20% of lines in File randomly

This is my code:
nb_lignes=`wc -l $1 | cut -d " " -f1`
for i in $(seq $nb_lignes)
do
m=`head $1 -n $i | tail -1`
//command
done
Please how can i change it to get Get 20% of lines in File randomly to apply "command" on each line ?
20% or 40% or 60 % (it's a parameter)
Thank you.
This will randomly get 20% of the lines in the file:
awk -v p=20 'BEGIN {srand()} rand() <= p/100' filename
So something like this for the whole solution (assuming bash):
#!/bin/bash
filename="$1"
pct="${2:-20}" # specify percentage
while read line; do
: # some command with "$line"
done < <(awk -v p="$pct" 'BEGIN {srand()} rand() <= p/100' "$filename")
If you're using a shell without command substitution (the <(...) bit), you can do this - but the body of the loop won't be able to have any side effects in the outer script (e.g. any variables it sets won't be set anymore once the loop completes):
#!/bin/sh
filename="$1"
pct="${2:-20}" # specify percentage
awk -v p="$pct" 'BEGIN {srand()} rand() <= p/100' "$filename" |
while read line; do
: # some command with "$line"
done
Try this:
file=$1
nb_lignes=$(wc -l $file | cut -d " " -f1)
num_lines_to_get=$((20*${nb_lignes}/100))
for (( i=0; i < $num_lines_to_get; i++))
do
line=$(head -$((${RANDOM} % $nb_lignes)) $file | tail -1)
echo "$line"
done
Note that ${RANDOM} only generates numbers less than 32768 so this approach won't work for large files.
If you have shuf installed, you can use the following to get a random line instead of using $RANDOM.
line=$(shuf -n 1 $file)
you can do it with awk.see below:
awk -v b=20 '{a[NR]=$0}END{val=((b/100)*NR)+1;for(i=1;i<val;i++)print a[i]}' all.log
the above command prints 20% of all the lines starting from begining of the file.
you just have to change the value of b on command line to get the required % of lines.
tested below:
> cat temp
1
2
3
4
5
6
7
8
9
10
> awk -v b=10 '{a[NR]=$0}END{val=((b/100)*NR)+1;for(i=1;i<val;i++)print a[i]}' temp
1
> awk -v b=20 '{a[NR]=$0}END{val=((b/100)*NR)+1;for(i=1;i<val;i++)print a[i]}' temp
1
2
>
shuf will produce the file in a randomized order; if you know how many lines you want, you can give that to the -n parameter. No need to get them one at a time. So:
shuf -n $(( $(wc -l < $FILE) * $PCT / 100 )) "$file" |
while read line; do
# do something with $line
done
shuf comes standard with GNU/Linux distros afaik.

Resources