How to process large csv files efficiently using shell script, to get better performance than that for following script? - bash

I have a large csv file input_file with 5 columns. I want to do two things to second column:
(1) Remove last character
(2) Append leading and trailing single quote
Following are the sample rows from input_file.dat
420374,2014-04-06T18:44:58.314Z,214537888,12462,1
420374,2014-04-06T18:44:58.325Z,214537850,10471,1
281626,2014-04-06T09:40:13.032Z,214535653,1883,1
Sample output would look like :
420374,'2014-04-06T18:44:58.314',214537888,12462,1
420374,'2014-04-06T18:44:58.325',214537850,10471,1
281626,'2014-04-06T09:40:13.032',214535653,1883,1
I have written a following code to do the same.
#!/bin/sh
inputfilename=input_file.dat
outputfilename=output_file.dat
count=1
while read line
do
echo $count
count=$((count + 1))
v1=$(echo $line | cut -d ',' -f1)
v2=$(echo $line | cut -d ',' -f2)
v3=$(echo $line | cut -d ',' -f3)
v4=$(echo $line | cut -d ',' -f4)
v5=$(echo $line | cut -d ',' -f5)
v2len=${#v2}
v2len=$((v2len -1))
newv2=${v2:0:$v2len}
newv2="'$newv2'"
row=$v1,$newv2,$v3,$v4,$v5
echo $row >> $outputfilename
done < $inputfilename
But it's taking lot of time.
Is there any efficient way to achieve this?

You can do this with awk
awk -v q="'" 'BEGIN{FS=OFS=","} {$2=q substr($2,1,length($2)-1) q}1' input_file.dat
How it works:
BEGIN{FS=OFS=","} : set input and output field separator (FS, OFS) to ,.
-v q="'" : assign a literal single quote to the variable q (to avoid complex escaping in the awk expression)
{$2=q substr($2,1,length($2)-1) q} : Replace the second field ($2) with a single quote (q) followed by the value of the 2nd field without the last character (substr(string, start, length)) and appending a literal single quote (q) at the end.
1 : Just invoke the default action, which is print the current (edited) line.

Related

Bash - Transpose a single field keeping the rest same and repeat it across

I have a file with pipe separated fields.
eg.
1,2,3|xyz|abc
I need the output in below format:
1|xyz|abc
2|xyz|abc
3|xyz|abc
I have a working code in bash:
while read i
do
f1=`echo $i | cut -d'|' -f1`
f2=`echo $i | cut -d'|' -f2-`
echo $f1 | tr ',' '\n' | sed "s:$:|$f2:" >> output.txt
done < pipe_delimited_file.txt
Can anyone suggest a way to achieve this witout using loop.
The file contains a large number of records.
Uses a loop, but it's inside awk, so very fast:
awk -F\| 'BEGIN{OFS="|"}{n = split($1, a, ","); $1=""; for(i=1; i<=n; i++) {print a[i] $0}}' pipe_delimited_file.txt
Perl may be a bit faster than awk:
perl -F'[|]' -ane 'for $n (split /,/, $F[0]) {$F[0] = $n; print join "|", #F}' file
bash is very slow, but here's a quicker way to use it. This uses plain bash without calling any external programs:
( # in a subshell:
IFS=, # use comma as field separator
set -f # turn off filename generation
while IFS='|' read -r f1 rest; do # temporarily using pipe as field separator,
# read first field and rest of line
for word in $f1; do # iterate over comma-separated words
echo "$word|$rest"
done
done
) < file

cut a string after a specified pattern (comma)

I want to cut a string and assign it to a variable after first occurrence of comma.
my_string="a,b,c,d,e,f"
Output expected:
output="b,c,d,e,f"
When I use the command
output=`echo $my_string | cut -d ',' f2
I am getting only b as output.
Adding a dash '-' to the end of your -f2 will output the remainder of the string.
$ echo "a,b,c,d,e,f,g"|cut -d, -f2-
b,c,d,e,f,g
With parameter expansion instead of cut:
$ my_string="a,b,c,d,e,f"
$ output="${my_string#*,}"
$ echo "$output"
b,c,d,e,f
${my_string#*,} stands for "remove everything up to and including the first comma from my_string" (see the Bash manual).
You must add the minus sign (-) after the position you are looking for.
a=`echo $my_string|cut -d "," -f 2-`
echo $a
b,c,d,e,f

Cut one word before delimiter - Bash

How do I use cut to get one word before the delimiter? For example, I have the line below in file.txt:
one two three four five: six seven
when I use the cut command below:
cat file.txt | cut -d ':' -f1
...then I get everything before the delimiter; i.e.:
one two three four five
...but I only want to get "five"
I do not want to use awk or the position, because the file changes all the time and the position of "five" can be anywhere. The only thing fixed is that five will have a ":" delimiter.
Thanks!
Pure bash:
s='one two three four five: six seven'
w=${s%%:*} # cut off everything from the first colon
l=${w##* } # cut off everything until last space
echo $l
# => five
(If you have one colon in your file, s=$(grep : file) should set up your initial variable)
Since you need to use more that one field delimiter here, awk comes to rescue:
s='one two three four five: six seven'
awk -F '[: ]' '{print $5}' <<< "$s"
five
EDIT: If your field positions can change then try this awk:
awk -F: '{sub(/.*[[:blank:]]/, "", $1); print $1}' <<< "$s"
five
Here is a BASH one-liner to get this in a single command:
[[ $s =~ ([^: ]+): ]] && echo "${BASH_REMATCH[1]}"
five
you may want to do something like this:
cat file.txt | while read line
do
for word in $line
do
if [ `echo $word | grep ':$'` ] then;
echo $word
fi
done
done
if it is a consistent structure (with different number of words in line), you can change the first line to:
cat file.txt | cut -d':' -f1 | while read line
do ...
and that way to avoid processing ':' at the right side of the delimeter
Try
echo "one two three four five: six seven" | awk -F ':' '{print $1}' | awk '{print $NF}'
This will always print the last word before first : no matter what happens

I want to re-arrange a file in an order in shell

I have a file test.txt like below spaces in between each record
service[1.1],parttion, service[1.2],parttion, service[1.3],parttion, service[2.1],parttion, service2[2.2],parttion,
Now I want to rearrange it as below into a output.txt
COMPOSITES=parttion/service/1.1,parttion/service/1.2,parttion/service/1.3,parttion/service/2.1,parttion/service/2.2
I've tried:
final_str=''
COMPOSITES=''
# Re-arranging the composites and preparing the composite property file
while read line; do
partition_val="$(echo $line | cut -d ',' -f 2)"
composite_temp1_val="$(echo $line | cut -d ',' -f 1)"
composite_val="$(echo $composite_temp1_val | cut -d '[' -f 1)"
version_temp1_val="$(echo $composite_temp1_val | cut -d '[' -f 2)"
version_val="$(echo $version_temp1_val | cut -d ']' -f 1)"
final_str="$partition_val/$composite_val/$version_val,"
COMPOSITES=$COMPOSITES$final_str
done <./temp/test.txt
We start with the file:
$ cat test.txt
service[1.1],parttion, service[1.2],parttion, service[1.3],parttion, service[2.1],parttion, service2[2.2],parttion,
We can rearrange that file as follows:
$ awk -F, -v RS=" " 'BEGIN{printf "COMPOSITES=";} {gsub(/[[]/, "/"); gsub(/[]]/, ""); if (NF>1) printf "%s%s/%s",NR==1?"":",",$2,$1;}' test.txt
COMPOSITES=parttion/service/1.1,parttion/service/1.2,parttion/service/1.3,parttion/service/2.1,parttion/service2/2.2
The same command split over multiple lines is:
awk -F, -v RS=" " '
BEGIN{
printf "COMPOSITES=";
}
{
gsub(/[[]/, "/")
gsub(/[]]/, "")
if (NF>1) printf "%s%s/%s",NR==1?"":",",$2,$1
}
' test.txt
Here's what I came up with.
awk -F '[],[]' -v RS=" " 'BEGIN{printf("COMPOSITES=")}/../{printf("%s/%s/%s,",$4,$1,$2);}' test.txt
Broken out for easier reading:
awk -F '[],[]' -v RS=" " '
BEGIN {
printf("COMPOSITES=");
}
/../ {
printf("%s/%s/%s,",$4,$1,$2);
}' test.txt
More detailed explanation of the script:
-F '[],[]' - use commas or square brackets as field separators
-v RS=" " - use just the space as a record separator
'BEGIN{printf("COMPOSITES=")} - starts your line
/../ - run the following code on any line that has at least two characters. This avoids the empty field at the end of a line terminating with a space.
printf("%s/%s/%s,",$4,$1,$2); - print the elements using a printf() format string that matches the output you specified.
As concise as this is, the format string does leave a trailing comma at the end of the line. If this is a problem, it can be avoided with a bit of extra code.
You could also do this in sed, if you like writing code in line noise.
sed -e 's:\([^[]*\).\([^]]*\).,\([^,]*\), :\3/\1/\2,:g;s/^/COMPOSITES=/;s/,$//' test.txt
Finally, if you want to avoid external tools like sed and awk, you can do this in bash alone:
a=($(<test.txt))
echo -n "COMPOSITES="
for i in "${a[#]}"; do
i="${i%,}"
t="${i%]*}"
printf "%s/%s/%s," "${i#*,}" "${i%[*}" "${t#*[}"
done
echo ""
This slurps the contents of test.txt into an array, which means your input data must be separated by whitespace, per your example. It then adds the prefix, then steps through the array, using Parameter Expansion to massage the data into the fields you need. The last line (echo "") is helpful for testing; you may want to eliminate it in practice.

Writing to CSV using bash sed not working as expected

I'm having some trouble getting this code to work, and I have no idea why it's not, Maybe one of you gurus can lend me a hand.
To begin with I have two CSV files structured as such:
Book1.csv:
Desc,asset,asset name,something,waiver,waiver name,init date,wrong date,blah,blah,target
akdhfa,2014,adskf,kadsfjh,123-4567,none,none,none,none,none,BOOP
Book2.csv
Desc,asset,asset name,something,waiver,waiver name,init date,wrong date,blah,blah,target
akdhfa,2014,adskf,kadsfjh,123-4567,none,none,none,none,none
(Lack of "BOOP" on the second book)
What I want is to scan Book1.csv for column 11. If it's there, find the matching row in Book2.csv based on asset and waiver. Then simply append the target to that line.
Here's what I've tried so far:
#!/bin/bash
oldIFS=IFS
IFS=$'\n'
HOME=($(cat Book1.csv))
for i in "${HOME[#]}"
do
target=`echo $i | cut -d "," -f 11`
asset=`echo $i | cut -d "," -f 2`
waiv=`echo $i | cut -d "," -f 5`
if [ "$target" != "target" ]
then
sed -i '/*${asset}*${waiv}*/s/$/,${target}/' Book2.csv
fi
done
IFS=oldIFS
Everything seems to be working except for the sed command. Any suggestions?
You are using
sed -i '/*${asset}*${waiv}*/s/$/,${target}/' Book2.csv
which means that the variables are not expanded (the ' quotes "hide" them).
Also the * needs something "in front of it" - probably you meant to use .* (otherwise you are looking for "any number of repeats of the last character in asset, etc.).
Just change it to
sed -i "/.*${asset}.*${waiv}.*/s/$/,${target}/" Book2.csv
Now the variables will be replaced with their value before the sed command runs, and the quantifier (*) should work properly, as it has something to quantify (.)...
You're using single quotes, which inhibit variable expansion. Change to double quotes.
This awk might be tidier:
awk -F, -v OFS=, '
NR == FNR {boop[$2,$5] = $11; next}
NF != 11 {$11 = boop[$2,$5]}
{print}
' Book1.csv Book2.csv > tmpfile && mv tmpfile Book2.csv
awk does not have a -i option.

Resources