I need to split a string into an array in bash. However, not by a fixed delimiter character, but by a regular expression. Here are the details:
I have a CSV file containing millions of lines of the following format:
123,"usrlogin1","usrname2","companyname1","email#example.com","1970-01-01 00:00:00","USR",0,1
However, some fields may contain escaped quotation marks \" and commas within strings, so a line might just as well look like this:
123,"usrlogin1","usrname2","The \"awesome\" company, NA","email#example.com","1970-01-01 00:00:00","USR",0,1
I need to split this file by the correct commas, that is to say those which are outside of non-escaped quotation marks.
Desired output:
123
usrlogin1
usrname2
The "awesome" company, NA
email#example.com
1970-01-01 00:00:00
USR
0
1
This is where I am at:
#!/bin/sh
#read file line by line
while IFS='' read -r line; do
#replace escaped quotation marks (\") with (")
line=$(echo $line | sed 's/\\"/\"\;/g')
#split this line now by commas
#TODO
done < "$1"
At the point of TODO, I would like to use the following regex as delimiter for splitting my string into an array:
(?!\B"[^"]*),(?![^"]*"\B)
Demo: https://regex101.com/r/xB7rQ7/1
How can I split the $line by the regex above?
Related
How can I escape double quotes inside a double string in Bash?
For example, in my shell script
#!/bin/bash
dbload="load data local infile \"'gfpoint.csv'\" into table $dbtable FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY \"'\n'\" IGNORE 1 LINES"
I can't get the ENCLOSED BY '\"' with double quote to escape correctly. I can't use single quotes for my variable, because I want to use variable $dbtable.
Use a backslash:
echo "\"" # Prints one " character.
A simple example of escaping quotes in the shell:
$ echo 'abc'\''abc'
abc'abc
$ echo "abc"\""abc"
abc"abc
It's done by finishing an already-opened one ('), placing the escaped one (\'), and then opening another one (').
Alternatively:
$ echo 'abc'"'"'abc'
abc'abc
$ echo "abc"'"'"abc"
abc"abc
It's done by finishing already opened one ('), placing a quote in another quote ("'"), and then opening another one (').
More examples: Escaping single-quotes within single-quoted strings
Keep in mind that you can avoid escaping by using ASCII codes of the characters you need to echo.
Example:
echo -e "This is \x22\x27\x22\x27\x22text\x22\x27\x22\x27\x22"
This is "'"'"text"'"'"
\x22 is the ASCII code (in hex) for double quotes and \x27 for single quotes. Similarly you can echo any character.
I suppose if we try to echo the above string with backslashes, we will need a messy two rows backslashed echo... :)
For variable assignment this is the equivalent:
a=$'This is \x22text\x22'
echo "$a"
# Output:
This is "text"
If the variable is already set by another program, you can still apply double/single quotes with sed or similar tools.
Example:
b="Just another text here"
echo "$b"
Just another text here
sed 's/text/"'\0'"/' <<<"$b" #\0 is a special sed operator
Just another "0" here #this is not what i wanted to be
sed 's/text/\x22\x27\0\x27\x22/' <<<"$b"
Just another "'text'" here #now we are talking. You would normally need a dozen of backslashes to achieve the same result in the normal way.
Bash allows you to place strings adjacently, and they'll just end up being glued together.
So this:
echo "Hello"', world!'
produces
Hello, world!
The trick is to alternate between single and double-quoted strings as required. Unfortunately, it quickly gets very messy. For example:
echo "I like to use" '"double quotes"' "sometimes"
produces
I like to use "double quotes" sometimes
In your example, I would do it something like this:
dbtable=example
dbload='load data local infile "'"'gfpoint.csv'"'" into '"table $dbtable FIELDS TERMINATED BY ',' ENCLOSED BY '"'"'"' LINES "'TERMINATED BY "'"'\n'"'" IGNORE 1 LINES'
echo $dbload
which produces the following output:
load data local infile "'gfpoint.csv'" into table example FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY "'\n'" IGNORE 1 LINES
It's difficult to see what's going on here, but I can annotate it using Unicode quotes. The following won't work in Bash – it's just for illustration:
dbload=‘load data local infile "’“'gfpoint.csv'”‘" into ’“table $dbtable FIELDS TERMINATED BY ',' ENCLOSED BY '”‘"’“' LINES ”‘TERMINATED BY "’“'\n'”‘" IGNORE 1 LINES’
The quotes like “ ‘ ’ ” in the above will be interpreted by bash. The quotes like " ' will end up in the resulting variable.
If I give the same treatment to the earlier example, it looks like this:
echo “I like to use” ‘"double quotes"’ “sometimes”
Store the double quote character in a variable:
dqt='"'
echo "Double quotes ${dqt}X${dqt} inside a double quoted string"
Output:
Double quotes "X" inside a double quoted string
Check out printf...
#!/bin/bash
mystr="say \"hi\""
Without using printf
echo -e $mystr
Output: say "hi"
Using printf
echo -e $(printf '%q' $mystr)
Output: say \"hi\"
Make use of $"string".
In this example, it would be,
dbload=$"load data local infile \"'gfpoint.csv'\" into table $dbtable FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY \"'\n'\" IGNORE 1 LINES"
Note (from the man page):
A double-quoted string preceded by a dollar sign ($"string") will cause the string to be translated according to the current locale. If the current locale is C or POSIX, the dollar sign is ignored. If the string is translated and replaced, the replacement is double-quoted.
For use with variables that might contain spaces in you Bash script, use triple quotes inside the main quote, e.g.:
[ "$(date -r """$touchfile""" +%Y%m%d)" -eq "$(date +%Y%m%d)" ]
Add "\" before double quote to escape it, instead of \
#! /bin/csh -f
set dbtable = balabala
set dbload = "load data local infile "\""'gfpoint.csv'"\"" into table $dbtable FIELDS TERMINATED BY ',' ENCLOSED BY '"\""' LINES TERMINATED BY "\""'\n'"\"" IGNORE 1 LINES"
echo $dbload
# load data local infile "'gfpoint.csv'" into table balabala FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY "''" IGNORE 1 LINES
I have key value pairs in a string like this:
key1 = "value1"
key2 = "value2"
key3 = "value3"
In a bash script, I need to extract the value of one of the keys like for key2, I should get value2, not in quote.
My bash script needs to work in both Redhat and Ubuntu Linux hosts.
What would be the easiest and most reliable way of doing this?
I tried something like this simplified script:
pattern='key2\s*=\s*\"(.*?)\".*$'
if [[ "$content" =~ $pattern ]]
then
key2="${BASH_REMATCH[1]}"
echo "key2: $key2"
else
echo 'not found'
fi
But it does not work consistently.
Any better/easier/more reliable way of doing this?
To separate the key and value from your $content variable, you can use:
[[ $content =~ (^[^ ]+)[[:blank:]]*=[[:blank:]]*[[:punct:]](.*)[[:punct:]]$ ]]
That will properly populate the BASH_REMATCH array with both values where your key is in BASH_REMATCH[1] and the value in BASH_REMATCH[2].
Explanation
In bash the [[...]] treats what appears on the right side of =~ as an extended regular expression and matched according to man 3 regex. See man 1 bash under the section heading for [[ expression ]] (4th paragraph). Sub-expressions in parenthesis (..) are saved in the array variable BASH_REMATCH with BASH_REMATCH[0] containing the entire portion of the string (your $content) and each remaining elements containing the sub-expressions enclosed in (..) in the order the parenthesis appear in the regex.
The Regular Expression (^[^ ]+)[[:blank:]]*=[[:blank:]]*[[:punct:]](.*)[[:punct:]]$ is explained as:
(^[^ ]+) - '^' anchored at the beginning of the line, [^ ]+ match one or more characters that are not a space. Since this sub-expression is enclosed in (..) it will be saved as BASH_REMATCH[1], followed by;
[[:blank:]]* - zero or more whitespace characters, followed by;
= - an equal sign, followed by;
[[:blank:]]* - zero or more whitespace characters, followed by;
[[:punct:]] - a punctuation character (matching the '"', which avoids caveats associated with using quotes within the regex), followed by the sub-expression;
(.*) - zero or more characters (the rest of the characters), and since it is a sub-expression in (..) it the characters will be stored in BASH_REMATCH[2], followed by;
[[:punct:]] - a punctuation character (matching the '"' ... ditto), at the;
$ - end of line anchor.
So if you match what your key and value input lines separated by an = sign, it will separate the key and value into the array BASH_REMATCH as you wanted.
Bash supports BRE only and you cannot use \s and .*?.
As an alternative, please try:
while IFS= read -r content; do
# pattern='key2\s*=\s*\"(.*)\".*$'
pattern='key2[[:blank:]]*=[[:blank:]]*"([^"]*)"'
if [[ $content =~ $pattern ]]
then
key2="${BASH_REMATCH[1]}"
echo "key2: $key2"
(( found++ ))
fi
done < input-file.txt
if (( found == 0 )); then
echo "not found"
fi
What you start talking about key-value pairs, it is best to use an associative array:
declare -A map
Now looking at your lines, they look like key = "value" where we assume that:
value is always encapsulated by double quotes, but also could contain a quote
an unknown number of white spaces is before and/or after the equal sign.
So assuming we have a variable line which contains key = "value", the following operations will extract that value:
key="${line%%=*}"; key="${key// /}"
value="${line#*=}"; value="${value#*\042}"; value="${value%\042*}"
IFS=" \t=" read -r value _ <<<"$line"
This allows us now to have something like:
declare -A map
while read -r line; do
key="${line%%=*}"; key="${key// /}"
value="${line#*=}"; value="${value#*\042}"; value="${value%\042*}"
map["$key"]="$value"
done <inputfile
With awk:
awk -v key="key2" '$1 == key { gsub("\"","",$3);print $3 }' <<< "$string"
Reading the output of the variable called string, pass the required key in as a variable called key and then if the first space delimited field is equal to the key, remove the quotes from the third field with the gsub function and print.
Ok, after spending so many hours, this is how I solved the problem:
If you don't know where your script will run and what type of file (win/mac/linux) are you reading:
Try to avoid non-greedy macth in linux bash instead of tweaking diffrent switches.
don't trus end of line match $ when you might get data from windows or mac
This post solved my problem: Non greedy text matching and extrapolating in bash
This pattern works for me in may linux environments and all type of end of lines:
pattern='key2\s*=\s*"([^"]*)"'
The value is in BASH_REMATCH[1]
I have several (JSON) files in a directory. I am wanting to scan all files for the line with the audio property (first string in quotes) and replace all non-alphanumeric characters in the associated value (second string with quotes) with an underscore. For example, if the line is:
"audio": "Sub Dir/my file` nāme.mp3"
I would like to replace it with:
"audio": "Sub_Dir/my_file__n_me.mp3"
What can I do to replace these lines in multiple files?
If perl is your option, would you please try:
perl -i".bak" -pe 's/("audio":\s"*)(.+?)"/
$x = $1;
($y = $2) =~ s#[^A-Za-z0-9_\/\.]#_#g;
$x . $y . "\""/e' *.json
The e option to the s/../../ function enables the substitution
to be the result of evaluation of perl expression.
The substition s#[^A-Za-z0-9_\/\.]#_#g is performed on the
second string with quotes and replaces characters other
than ascii alphabets, digits, underscores, slashes, dots with underscores.
Original files are saved with the suffix .bak.
one_two_three_four_five.rtf
I need five in A variable
I need four in B variable
And remaining in C variable
Should read from the last character
Note after 2 underscore from the last. There could be many underscores but should take has C variable.
Is it possible?
For example using parameter expansion
#!/bin/ksh
string="one_two_three_four_five.rtf"
base=${string%.rtf}
a=${base##*_}; base=${base%_$a}
b=${base##*_}; base=${base%_$b}
c=$base
echo "$a - $b - $c"
s="one_two_three_four_five.rtf"
source <(sed -r 's/(.*)_([^_]*)_([^_]*)[.].*/C="\1"; B="\2";A="\3"/' <<< "${s}")
# Result:
echo "A=$A, B=$B, C=$C"
A=five, B=four, C=one_two_three
Explanation:
sed -r No need for escaping backslashes
(.*)_ Matches largest string until underscore with the condition that there are underscores left for matching the remaining string
([^_]*) String without underscore
[.] A dot without special meaning
"\1" First remembered string
<<< "${s}" Input for sed is like echo "${s}" | sed ...
<(..) Simulates a file, so sourcing these will execute the commands.
I wrote 2 difference scripts but I am stuck at the same problem.
The problem is am making a table from a file ($2) that I get in args and $1 is the numbers of columns. A little bit hard to explain but I am gonna show you input and output.
The problem is now that I don't know how I can save every column now in a difference var so i can build it in my HTML code later
#printf #TR##TD#$...#/TD##TD#$...#/TD##TD#$..#/TD##/TR##TD#$...
so input look like that :
Name\tSize\tType\tprobe
bla\t4711\tfile\t888888888
abcde\t4096\tdirectory\t5555
eeeee\t333333\tblock\t6666
aaaaaa\t111111\tpackage\t7777
sssss\t44444\tfile\t8888
bbbbb\t22222\tfolder\t9999
Code :
c=1
column=$1
file=$2
echo "$( < $file)"| while read Line ; do
Name=$(sed "s/\\\t/ /g" $file | cut -d' ' -f$c,-$column)
printf "$Name \n"
#let c=c+1
#printf "<TR><TD>$Name</TD><TD>$Size</TD><TD>$Type</TD></TR>\n"
exit 0
done
Output:
Name Size Type probe
bla 4711 file 888888888
abcde 4096 directory 5555
eeeee 333333 block 6666
aaaaaa 111111 package 7777
sssss 44444 file 8888
bbbbb 22222 folder 9999
This is tailor-made job for awk. See this script:
awk -F'\t' '{printf "<tr>";for(i=1;i<=NF;i++) printf "<td>%s</td>", $i;print "</tr>"}' input
<tr><td>bla</td><td>4711</td><td>file</td><td>888888888</td></tr>
<tr><td>abcde</td><td>4096</td><td>directory</td><td>5555</td></tr>
<tr><td>eeeee</td><td>333333</td><td>block</td><td>6666</td></tr>
<tr><td>aaaaaa</td><td>111111</td><td>package</td><td>7777</td></tr>
<tr><td>sssss</td><td>44444</td><td>file</td><td>8888</td></tr>
<tr><td>bbbbb</td><td>22222</td><td>folder</td><td>9999</td></tr>
In bash:
celltype=th
while IFS=$'\t' read -a columns; do
rowcontents=$( printf '<%s>%s</%s>' "$celltype" "${columns[#]}" "$celltype" )
printf '<tr>%s</tr>\n' "$rowcontents"
celltype=td
done < <( sed $'s/\\\\t/\t/g' "$2")
Some explanations:
IFS=$'\t' read -a columns reads a line from standard input, using only the tab character to separate fields, and putting each field into a separate element of the array columns. We change IFS so that other whitespace, which could occur in a field, is not treated as a field delimiter.
On the first line read from standard input, <th> elements will be output by the printf line. After resetting the value of celltype at the end of the loop body, all subsequent rows will consist of <td> elements.
When setting the value of rowcontents, take advantage of the fact that the first argument is repeated as many times as necessary to consume all the arguments.
Input is via process substitution from the sed command, which requires a crazy amount of quoting. First, the entire argument is quoted with $'...', which tells bash to replace escaped characters. bash converts this to the literal string s/\\t/^T/g, where I am using ^T to represent a literal ASCII 09 tab character. When sed sees this argument, it performs its own escape replacement, so the search text is a literal backslash followed by a literal t, to be replaced by a literal tab character.
The first argument, the column count, is unnecessary and is ignored.
Normally, you avoid making the while loop part of a pipeline because you set parameters in the loop that you want to use later. Here, all the variables are truly local to the while loop, so you could avoid the process substitution and use a pipeline if you wish:
sed $'s/\\\\t/\t/g' "$2" | while IFS=$'\t' read -a columns; do
...
done