how to replace delimiter value present within quotes as part of data in file - bash

I want to replace delimiter which is part of data from each records. For Ex-
echo '"hi","how,are,you","bye"'|sed -nE 's/"([^,]*),([^,]*),([^,]*)"/"\1;\2;\3"/gp'
output -->
"hi","how;are;you","bye"
So, I am able to replace delimiter(comma in this case), which is present in data also with semi colon.
But the challenge is, in real time, we are not sure how many times delmiter will be present and also, it may come in multiple fields as well.
For Ex-
"1","2,3,4,5","6","7,8"
"1","2,4,5","6","7,8,9"
"1","4,5","6","7,8,9.2"
All these are valid records.
Can anyone help me out here. how can we write a generic code to handle this?

When working with anything but the most trivial CSV data, I prefer to use something that understands the format directly instead of messing with regular expressions to try to handle things like quoted fields. For example (Warning: Blatant self promotion ahead!), my tcl-based awk-like utility tawk, which I wrote in part to make it easier to manipulate CSV files:
 $ tawk -csv -quoteall '
line {
for {set n 1} {$n <= $NF} {incr n} {
set F($n) [string map {, \;} $F($n)]
}
print
}' input.csv
"hi","how;are;you","bye"
"1","2;3;4;5","6","7;8"
"1","2;4;5","6","7;8;9"
"1","4;5","6","7;8;9.2"
Or a perl approach using the Text::CSV_XS module:
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({binary=>1, always_quote=>1});
while (my $row = $csv->getline(\*STDIN)) {
tr/,/;/ foreach #$row;
$csv->say(\*STDOUT, $row);
}' < input.csv
"hi","how;are;you","bye"
"1","2;3;4;5","6","7;8"
"1","2;4;5","6","7;8;9"
"1","4;5","6","7;8;9.2"

Assuming the data does not contain embedded double quotes ...
Sample data:
$ cat delim.dat
"hi","how,are,you","bye"
"1","2,3,4,5","6","7,8"
"1","2,4,5","6","7,8,9"
"1","4,5","6","7,8,9.2"
One awk idea whereby we replace , with ; in the even numbered fields:
awk '
BEGIN { FS=OFS="\"" }
{ for (i=2;i<=NF;i=i+2) gsub(",",";",$i) }
1
' delim.dat
This generates:
"hi","how;are;you","bye"
"1","2;3;4;5","6","7;8"
"1","2;4;5","6","7;8;9"
"1","4;5","6","7;8;9.2"

Related

How to process file content differently for each line using shell script?

I have a file which has this data -
view:
schema1.view1:/some-path/view1.sql
schema2.view2:/some-path/view2.sql
tables:
schema1.table1:/some-path/table1.sql
schema2.table2:/some-path/table2.sql
end:
I have to read the file and store the contents in different variables.
viewData=$(sed '/view/,/tables/!d;/tables/q' $file|sed '$d')
tableData=$(sed '/tables/,/end/!d;/end/q' $file|sed '$d')
echo $viewData
view:
schema1.view1:/some-path/view1.sql
schema2.view2:/some-path/view2.sql
echo $tableData
tables:
schema1.table1:/some-path/table1.sql
schema2.table2:/some-path/table2.sql
dataArray=("$viewData" "$tableData")
I need to use a for loop over dataArray so that I get all the components in 4 different variables.
Lets say for $viewData, the loop should be able to print like this -
objType=view
schema=schema1
view=view1
fileLoc=some-path/view1.sql
objType=view
schema=schema2
view=view2
fileLoc=some-path/view2.sql
I have tried sed and cut commands but that is not working properly. And I need to do this using shell script only.
Any help will be appreciated. Thanks!
remark: If you add a space character between the : and / in the input then you would be able to use YAML-aware tools for parsing it robustly.
Given your sample input you, can use this awk for generating the expected blocks:
awk '
match($0,/[^[:space:]]+:/) {
key = substr($0,RSTART,RLENGTH-1)
val = substr($0,RSTART+RLENGTH)
if (i = index(key,".")) {
print "objType=" type
print "schema=" substr(key,1,i-1)
print "view=" substr(key,i+1)
print "fileLoc=" val
printf "%c", 10
} else
type = key
}
' data.txt
objType=view
schema=schema1
view=view1
fileLoc=/some-path/view1.sql
objType=view
schema=schema2
view=view2
fileLoc=/some-path/view2.sql
objType=tables
schema=schema1
view=table1
fileLoc=/some-path/table1.sql
objType=tables
schema=schema2
view=table2
fileLoc=/some-path/table2.sql

Test if a value is in a csv file in bash

I have a 3-4M lines csv file (my_csv.csv) with two columns as :
col1,col2
val11,val12
val21,val22
val31,val32
...
The csv contains only two columns with one comma per line. Col1 and Col2 values are only strings (nothing else). The result shown above is the result of the command head my_csv.cs..
I would like to check if a string test_str is into the col2 values. What I mean here is, if test_str = val12 I would like the test to return True because val12 is located in column 2 (as show in the example).
But if test_str = val1244 I want the code to return False.
In python it would be something as :
import pandas as pd
df = pd.read_csv('my_csv.csv')
test_str = 'val42'
if test_str in df['col2'].to_list():
# Expected to return true
# Do the job
But I have no clues how to do it in bash.
(I know that df['col2'].to_list() is not a good idea, but I didn't want to use built-in pandas function for the code to be easier to understand)
awk is most suited amongst the bash utilities to handle csv data:
awk -F, -v val='val22' '$2 == val {print "found a match:", $0}' file
found a match: val21,val22
An equivalent bash loop would be like this:
while IFS=',' read -ra arr; do
if [[ ${arr[1]} == 'val22' ]]; then
echo "found a match: ${arr[#]}"
fi
done < file
But do keep in mind that Bash while read loop extremely slow compared to cat, why?
Parsing CSV is difficult... unless your fields do not contain commas, newlines... And you don't do what you want in bash, on a large file it will be extremely slow. You do it using utilities like awk or grep that would also be available with dash, zsh or another shell. So, if you have a very simple CSV format you can use, e.g., grep:
if grep -q ',val42$' my_csv.csv; then
<do that>
fi
We can also put the string to search for in a variable but remember that some characters have a special meaning in regular expressions and shall be escaped. Example if there are no special characters in the string to search for:
test_str="val42"
if grep -q ",$test_str$" my_csv.csv; then
<do that>
fi
3-4M rows is a small file to awk. might as well just do
{m,g}awk 'END { exit !index($_,","(__)"\n") }' RS='^$' FS='^$' __="${test_str}"

How to get paragraphs of text by index number

I am wondering if there is a way to get paragraphs of text (source file would be a pyx file) by number as sed does with lines
sed -n ${i}p
At this moment I'd be interested to use awk with:
awk '/custom-pyx-tag\(/,/\)custom-pyx-tag/'
but I can't find documentation or examples about that.
I'm also trying to trim "\r\n" with gsub(/\r\n/,"; ") int the same awk command but it doesn't work, and I can't really figure out why.
Any hint would be very appreciated, thanks
EDIT:
This is just one example and not my exact need but I would need to know how to do it for a multipurpose project
Let's take the case that I have exported the ID3Tags of a huge collection of audio files and these have been stored in a pyx-like format, so in the end I will have a nice big file with this pattern repeating for each file in the collection:
audio-genre(
blablabla
)audio-genre
audio-artist(
bla.blabla
)audio-artist
audio album(
bla-bla-bla
)audio-album
audio-track-num(
0x
)audio-track-num
audio-track-title(
bla.bla-bla
)audio-track-title
audio-lyrics(
blablablablabla
bla.bla.bla.bla
blah-blah-blah
blabla-blabla
)audio-lyrics
...
Now if I want to extract the artist of the 1234th audio file I can use:
awk '/audio-artist\(/, /)audio-artist/' | sed '/audio-artist/d' | sed -n 1234p
so being one line it can be obtained with sed, but I don't know how to get an entire paragraph given its index, for example if I want to get the lyrics of the 6543th file how could I do it?
In the end it is just a question of whether there is a command equivalent to
sed -n $ {num} p
but to be used for paragraphs
awk -v indx=1024
'BEGIN {
RS=""
}
{ split($0,arr,"audio-artist");
for (i=2;i<=length(arr);i=i+2)
{ gsub("[()]","",arr[i]);
arts[cnt+=1]=arr[i]
}
}
END {
print arts[indx]
}' audioartist
One liner:
awk -v indx=1234 'BEGIN {RS=""} NR==1 { split($0,arr,"audio-artist");for (i=2;i<=length(arr);i=i+2) { gsub("[()]","",arr[i]);arts[cnt+=1]=arr[i] } } END { print arts[indx] }' audioartist
Using awk, and the file called audioartist, we consume the file as one line by setting the records separator (RS) to "". We then split the whole file into an array arr, based on the separator audio-artist. We look through the array arr starting from 2 in steps of 2 till the end of the array and strip out the opening and closing brackets, creating another array called arts with an incrementing count as the index and the stripped artist as the value. At the end we print the arts index specified by the passed indx variable (in this case 1234).

How to extract text between two patterns with sed/awk

I know this has been asked 1000 times here, but I read a lot of similar questions and still did not manage to find the right way to do this. I need to extract a number from a line that looks like this:
{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}
Expected output:
2034.2
This version number will not always be the same, but the rest of the line should.
I have tried working with sed but I am new to this and failed:
sed -e 's/version":[\(.*\),"description/\1/'
output:
sed: -e expression #1, char 35: unterminated `s' command
I think the issue is that there are too many special characters involved in the line and I did not write the command very well.
Since it's JSON, use should use JSON aware tools for processing it. If you prefer, for example, awk, the way is to use GNU awk's JSON extension. This is a small how-to.
First download and compile appropriate versions of GNU awk, Gawkextlib and gawk-json. That's pretty straightforward, actually, just ./configure and make. Then, write some code:
awk '
#load "json" # enable json extension
{
lines=lines $0 # read json file records and buffer to var lines
if(json_fromJSON(lines,data)==1) { # once the json is complete
for(i in data["info"]["version"]) # that seems to be an array so all elements
print data["info"]["version"][i] # are outputed
lines="" # once done with the first json object
} # reset the var for more lines
}' file
Output this time:
2034.2
Explained a bit more:
The JSON file structure can vary from one line to multiple lines, for example:
{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}
or:
{
"version": "4.9.123M",
"info": {
"version": [
2034.2
],
"description": ""
},
"status": "OK"
}
so we need to buffer the JSON lines with lines=lines $0 until there is a whole valid object in variable lines. We use the extension function json_fromJSON() to determine that validity in if(json_fromJSON(lines,data)==1). While validated the object gets disentangled and stored to array data. For this particular object the structure of the array is:
data["version"]="4.9.123M"
data["info"]["version"][1]="2034.2"
data["info"]["description"]=""
data["status"]="OK"
We could examine the object and produce some output of it with this recursive array scanning function:
awk '
#load "json"
function scan(a,p, q) { # a is array, p path to it, q is qnd *
if(isarray(a))
for(i in a) {
q=p (p==""?"":"->") i
scan(a[i],q)
}
else
print p ":" a
}
{
lines=lines $0
if(json_fromJSON(lines,data)==1)
scan(data) #
}' file.json
Output:
status:OK
version:4.9.123M
info->version->1:2034.2
info->description:
*) quick'n dirty
Here is a brief example of how to output JSON from an array: https://stackoverflow.com/a/58109715/4162356
If the version is always enclosed in [] and no other [ or ] is present in a line ,you can try this logic
STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}'
echo $STR | awk -F'[' '{print $2}' | awk -F']' '{print $1}'
Simplest Way
Try grep when want to extract simple texts
echo "{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}"| grep -o "\[.*\]" | sed -e 's/\[\|\]//g'
This should do:
STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}'
echo "$STR" | awk -F'[][]' '{print $2}'
2034.2

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Resources