Multiline CSV: output on a single line, with double-quoted input lines, using a different separator - bash

I'm trying to get a multiline output from a CSV into one line in Bash.
My CSV file looks like this:
hi,bye
hello,goodbye
The end goal is for it to look like this:
"hi/bye", "hello/goodbye"
This is currently where I'm at:
INPUT=mycsvfile.csv
while IFS=, read col1 col2 || [ -n "$col1" ]
do
source=$(awk '{print;}' | sed -e 's/,/\//g' )
echo "$source";
done < $INPUT
The output is on every line and I'm able to change the , to a / but I'm not sure how to put the output on one line with quotes around it.
I've tried BEGIN:
source=$(awk 'BEGIN { ORS=", " }; {print;}'| sed -e 's/,/\//g' )
But this only outputs the last line, and omits the first hi/bye:
hello/goodbye
Would anyone be able to help me?

Just do the whole thing (mostly) in awk. The final sed is just here to trim some trailing cruft and inject a newline at the end:
< mycsvfile.csv awk '{print "\""$1, $2"\""}' FS=, OFS=/ ORS=", " | sed 's/, $//'

If you're willing to install trl, a utility of mine, the command can be simplified as follows:
input=mycsvfile.csv
trl -R '| ' < "$input" | tr ',|' '/,'
trl transforms multiline input into double-quoted single-line output separated by ,<space> by default.
-R '| ' (temporarily) uses |<space> as the separator instead; this assumes that your data doesn't contain | instances, but you can choose any char. that you know not be part of your data.
tr ',|' '/,' then translates all , instances (field-internal to the input lines) into / instances, and all | instances (the temporary separator) into , instances, yielding the overall result as desired.
Installation of trl from the npm registry (Linux and macOS)
Note: Even if you don't use Node.js, npm, its package manager, works across platforms and is easy to install; try
curl -L https://git.io/n-install | bash
With Node.js installed, install as follows:
[sudo] npm install trl -g
Note:
Whether you need sudo depends on how you installed Node.js and whether you've changed permissions later; if you get an EACCES error, try again with sudo.
The -g ensures global installation and is needed to put trl in your system's $PATH.
Manual installation (any Unix platform with bash)
Download this bash script as trl.
Make it executable with chmod +x trl.
Move it or symlink it to a folder in your $PATH, such as /usr/local/bin (macOS) or /usr/bin (Linux).

$ awk -F, -v OFS='/' -v ORS='"' '{$1=s ORS $1; s=", "; print} END{printf RS}' file
"hi/bye", "hello/goodbye"

There is no need for a bash loop, which is invariably slow.
sed and tr can do this more efficiently:
input=mycsvfile.csv
sed 's/,/\//g; s/.*/"&", /; $s/, $//' "$input" | tr -d '\n'
s/,/\//g uses replaces all (g) , instances with / instances (escaped as \/ here).
s/.*/"&", / encloses the resulting line in "...", followed by ,<space>:
regex .* matches the entire pattern space (the potentially modified input line)
& in the replacement string represent that match.
$s/, $// removes the undesired trailing ,<space> from the final line ($)
tr -d '\n' then simply removes the newlines (\n) from the result, because sed invariably outputs each line with a trailing newline.
Note that the above command's single-line output will not have a trailing newline; simply append ; printf '\n' if it is needed.

In awk:
$ awk '{sub(/,/,"/");gsub(/^|$/,"\"");b=b (NR==1?"":", ")$0}END{print b}' file
"hi/bye", "hello/goodbye"
Explained:
$ awk '
{
sub(/,/,"/") # replace comma
gsub(/^|$/,"\"") # add quotes
b=b (NR==1?"":", ") $0 # buffer to add delimiters
}
END { print b } # output
' file

I'm assuming you just have 2 lines in your file? If you have alternating 2 line pairs, let me know in comments and I will expand for that general case. Here is a one-line awk conversion for you:
# NOTE: I am using the octal ascii code for the
# double quote char (\42=") in my printf statement
$ awk '{gsub(/,/,"/")}NR==1{printf("\42%s\42, ",$0)}NR==2{printf("\42%s\42\n",$0)}' file
output:
"hi/bye", "hello/goodbye"

Here is my attempt in awk:
awk 'BEGIN{ ORS = " " }{ a++; gsub(/,/, "/"); gsub(/[a-z]+\/[a-z]+/, "\"&\""); print $0; if (a == 1){ print "," }}{ if (a==2){ printf "\n"; a = 0 } }'
Works also if your Input has more than two lines.If you need some explanation feel free to ask :)

Related

Unix sed command - global replacement is not working

I have scenario where we want to replace multiple double quotes to single quotes between the data, but as the input data is separated with "comma" delimiter and all column data is enclosed with double quotes "" got an issue and the same explained below:
The sample data looks like this:
"int","","123","abd"""sf123","top"
So, the output would be:
"int","","123","abd"sf123","top"
tried below approach to get the resolution, but only first occurrence is working, not sure what is the issue??
sed -ie 's/,"",/,"NULL",/g;s/""/"/g;s/,"NULL",/,"",/g' inputfile.txt
replacing all ---> from ,"", to ,"NULL",
replacing all multiple occurrences of ---> from """ or "" or """" to " (single occurrence)
replacing 1 step changes back to original ---> from ,"NULL", to ,"",
But, only first occurrence is getting changed and remaining looks same as below:
If input is :
"int","","","123","abd"""sf123","top"
the output is coming as:
"int","","NULL","123","abd"sf123","top"
But, the output should be:
"int","","","123","abd"sf123","top"
You may try this perl with a lookahead:
perl -pe 's/("")+(?=")//g' file
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
Where input is:
cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
Breakup:
("")+: Match 1+ pairs of double quotes
(?="): If those pairs are followed by a single "
Using sed
$ sed -E 's/(,"",)?"+(",)?/\1"\2/g' input_file
"int","","123","abd"sf123","top"
"int","","NULL","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
In awk with your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of awk.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<=NF;i++){
if($i!~/^""$/){
gsub(/"+/,"\"",$i)
}
}
}
1
' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as , for all the lines of Input_file. Then traversing through each field of line, if a field is NOT NULL then Globally replacing all 1 or more occurrences of " with single occurrence of ". Then printing the line.
With sed you could repeat 1 or more times sets of "" using a group followed by matching a single "
Then in the replacement use a single "
sed -E 's/("")+"/"/g' file
For this content
$ cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
The output is
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
sed s'#"""#"#' file
That works. I will demonstrate another method though, which you may also find useful in other situations.
#!/bin/sh -x
cat > ed1 <<EOF
3s/"""/"/
wq
EOF
cp file stack
cat stack | tr ',' '\n' > f2
ed -s f2 < ed1
cat f2 | tr '\n' ',' > stack
rm -v ./f2
rm -v ./ed1
The point of this is that if you have a big csv record all on one line, and you want to edit a specific field, then if you know the field number, you can convert all the commas to carriage returns, and use the field number as a line number to either substitute, append after it, or insert before it with Ed; and then re-convert back to csv.

convert a file content using shell script

Hello everyone I'm a beginner in shell coding. In daily basis I need to convert a file's data to another format, I usually do it manually with Text Editor. But I often do mistakes. So I decided to code an easy script who can do the work for me.
The file's content like this
/release201209
a1,a2,"a3",a4,a5
b1,b2,"b3",b4,b5
c1,c2,"c3",c4,c5
to this:
a2>a3
b2>b3
c2>c3
The script should ignore the first line and print the second and third values separated by '>'
I'm half way there, and here is my code
#!/bin/bash
#while Loops
i=1
while IFS=\" read t1 t2 t3
do
test $i -eq 1 && ((i=i+1)) && continue
echo $t1|cut -d\, -f2 | { tr -d '\n'; echo \>$t2; }
done < $1
The problem in my code is that the last line isnt printed unless the file finishes with an empty line \n
And I want the echo to be printed inside a new CSV file(I tried to set the standard output to my new file but only the last echo is printed there).
Can someone please help me out? Thanks in advance.
Rather than treating the double quotes as a field separator, it seems cleaner to just delete them (assuming that is valid). Eg:
$ < input tr -d '"' | awk 'NR>1{print $2,$3}' FS=, OFS=\>
a2>a3
b2>b3
c2>c3
If you cannot just strip the quotes as in your sample input but those quotes are escaping commas, you could hack together a solution but you would be better off using a proper CSV parsing tool. (eg perl's Text::CSV)
Here's a simple pipeline that will do the trick:
sed '1d' data.txt | cut -d, -f2-3 | tr -d '"' | tr ',' '>'
Here, we're just removing the first line (as desired), selecting fields 2 & 3 (based on a comma field separator), removing the double quotes and mapping the remaining , to >.
Use this Perl one-liner:
perl -F',' -lane 'next if $. == 1; print join ">", map { tr/"//d; $_ } #F[1,2]' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Update version number in property file using bash

I am new in bash scripting and I need help with awk. So the thing is that I have a property file with version inside and I want to update it.
version=1.1.1.0
and I use awk to do that
file="version.properties"
awk -F'["]' -v OFS='"' '/version=/{
split($4,a,".");
$4=a[1]"."a[2]"."a[3]"."a[4]+1
}
;1' $file > newFile && mv newFile $file
but I am getting strange result version="1.1.1.0""...1
Could someone help me please with this.
You mentioned in your comment you want to update the file in place. You can do that in a one-liner with perl:
perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i version.properties
Explanation
-e is followed by a script to run. With -p and -i, the effect is to run that script on each line, and modify the file in place if the script changes anything.
The script itself, broken down for explanation, is:
/^version=/ and # Do the following on lines starting with `version=`
s/ # Make a replacement on those lines
(\d+\.\d+\.\d+\.)(\d+)/ # Match x.y.z.w, and set $1 = `x.y.z.` and $2 = `w`
$1 . ($2+1)/ # Replace x.y.z.w with a copy of $1, followed by w+1
e # This tells Perl the replacement is Perl code rather
# than a text string.
Example run
$ cat foo.txt
version=1.1.1.2
$ perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i foo.txt
$ cat foo.txt
version=1.1.1.3
This is not the best way, but here's one fix.
Test case
I am assuming the input file has at least one line that is exactly version=1.1.1.0.
$ awk -F'["]' -v OFS='"' '/version=/{
> split($4,a,".");
> $4=a[1]"."a[2]"."a[3]"."a[4]+1
> }
> ;1' <<<'version=1.1.1.0'
Output:
version=1.1.1.0"""...1
The """ is because you are assigning to field 4 ($4). When you do that, awk adds field separators (OFS) between fields 1 and 2, 2 and 3, and 3 and 4. Three OFS => """, in your example.
Minimal change
$ awk -F'["]' -v OFS='"' '/version=/{
split($1,a,".");
$1=a[1]"."a[2]"."a[3]"."a[4]+1;
print
}
' <<<'version=1.1.1.0'
version=1.1.1.1
Two changes:
Change $4 to $1
Since the input field separator (-F) is ["], $4 is whatever would be after the third " (if there were any in the input). Therefore, split($4, ...) splits an empty field. The contents of the line, before the first " (if any), are in $1.
print at the end instead of ;1
The 1 after the closing curly brace is the next condition, and there is no action specified. The default action is to print the current line, as modified, so the 1 triggers printing. Instead, just print within your action when you are done processing. That way your action is self-contained. (Of course, if you needed to do other processing, you might want to print later, after that processing.)
You can use the = as the delimiter, like this:
awk -F= -v v=1.0.1 '$1=="version"{printf "version=\"%s\"\n", v}' file.properties

Looking for a regex pattern, passing that pattern to a script, and replacing the pattern with the output of the script

For every time the pattern shows up (In this example the case of a 2 digit number) I want to pass that pattern to a script and replace that pattern with the output of a script.
I'm using sed an example of what it should look like would be
echo 'siedi87sik65owk55dkd' | sed 's/[0-9][0-9]/.\/script.sh/g'
Right now this returns
siedi./script.shsik./script.showk./script.shdkd
But I would like it to return
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
This is what is in ./script.sh
#!/bin/bash
echo "!!!$1!!!"
It has to be replaced with the output. In this example I know I could just use a normal sed substitution but I don't want that as an answer.
sed is for simple substitutions on individual lines, that is all. Anything else, even if it can be done, requires arcane language constructs that became obsolete in the mid-1970s when awk was invented and are used today purely for the mental exercise. Your problem is not a simple substitution so you shouldn't try to use sed to solve it.
You're going to want something like:
awk '{
head = ""
tail = $0
while ( match(tail,/[0-9]{2}/) ) {
tgt = substr(tail,RSTART,RLENGTH)
cmd = "./script.sh " tgt
if ( (cmd | getline line) > 0) {
tgt = line
}
close(cmd)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
e.g. using an echo in place of your script.sh command:
$ echo 'siedi87sik65owk55dkd' |
awk '{
head = ""
tail = $0
while ( match(tail,/[0-9]{2}/) ) {
tgt = substr(tail,RSTART,RLENGTH)
cmd = "echo !!!" tgt "!!!"
if ( (cmd | getline line) > 0) {
tgt = line
}
close(cmd)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
Ed's awk solution is obviously the way to go here.
For fun, I tried to come up with a sed solution, and here is (a convoluted GNU sed) one that takes the pattern and the script to be run as parameters; the input is either read from standard input (i.e., you can pipe to it) or from a file supplied as the third argument.
For your example, we'd have infile with contents
siedi87sik65owk55dkd
siedi11sik22owk33dkd
(two lines to demonstrate how this works for multiple lines), then script with contents
#!/bin/bash
echo "!!!${1}!!!"
and finally the solution script itself, so. Usage is
./so pattern script [input]
where pattern is an extended regular expression as understood by GNU sed (with the -r option), script is the name of the command you want to run for each match, and the optional input is the name of the input file if input is not standard input.
For your example, this would be
./so '[[:digit:]]{2}' script infile
or, as a filter,
cat infile | ./so '[[:digit:]]{2}' script
with output
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
siedi!!!11!!!sik!!!22!!!owk!!!33!!!dkd
This is what so looks like:
#!/bin/bash
pat=$1 # The pattern to match
script=$2 # The command to run for each pattern
infile=${3:-/dev/stdin} # Read from standard input if not supplied
# Use sed and have $pattern and $script expand to the supplied parameters
sed -r "
:build_loop # Label to loop back to
h # Copy pattern space to hold space
s/.*($pat).*/.\/\"$script\" \1/ # (1) Extract last match and prepare command
# Replace pattern space with output of command
e
G # (2) Append hold space to pattern space
s/(.*)$pat(.*)/\1~~~\2/ # (3) Replace last match of pattern with ~~~
/\n[^\n]*$pat[^\n]*$/b build_loop # Loop if string contains match
:fill_loop # Label for second loop
s/(.*\n)(.*)\n([^\n]*)~~~([^\n]*)$/\1\3\2\4/ # (4) Replace last ~~~
t fill_loop # Loop if there was a replacement
s/(.*)\n(.*)~~~(.*)$/\2\1\3/ # (5) Final ~~~ replacement
" < "$infile"
The sed command works with two loops. The first one copies the pattern space to the hold space, then removes everything but the last match from the pattern space and prepares the command to be run. After the substitution with (1) in its comment, the pattern space looks like this:
./script 55
The e command (a GNU extension) then replaces the pattern space with the output of this command. After this, G appends the hold space to the pattern space (2). The pattern space now looks like this:
!!!55!!!
siedi87sik65owk55dkd
The substitution at (3) replaces the last match with a string hopefully not equal to the pattern and we get
!!!55!!!
siedi87sik65owk~~~dkd
The loop repeats if the last line of the pattern space still has a match for the pattern. After three loops, the pattern space looks like this:
!!!87!!!
!!!65!!!
!!!55!!!
siedi~~~sik~~~owk~~~dkd
The second loop now replaces the last ~~~ with the second to last line of the pattern space with substitution (4). The command uses lots of "not a newline" ([^\n]) to make sure we're not pulling the wrong replacement for ~~~.
Because of the way command (4) is written, the loop ends with one last substitution to go, so before command (5), we have this pattern space:
!!!87!!!
siedi~~~sik!!!65!!!owk!!!55!!!dkd
Command (5) is a simpler version of command (4), and after it, the output is as desired.
This seems to be fairly robust and can deal with spaces in the name of the script to be run as long as it's properly quoted when calling:
./so '[[:digit:]]{2}' 'my script' infile
This would fail if
The input file contains ~~~ (solvable by replacing all occurrences at the start, putting them back at the end)
The output of script contains ~~~
The pattern contains ~~~
i.e., the solution very much depends on ~~~ being unique.
Because nobody asked: so as a one-liner.
#!/bin/bash
sed -re ":b;h;s/.*($1).*/.\/\"$2\" \1/;e" -e "G;s/(.*)$1(.*)/\1~~~\2/;/\n[^\n]*$1[^\n]*$/bb;:f;s/(.*\n)(.*)\n([^\n]*)~~~([^\n]*)$/\1\3\2\4/;tf;s/(.*)\n(.*)~~~(.*)$/\2\1\3/" < "${3:-/dev/stdin}"
Still works!
A conceptually simpler multi-utility solution:
Using GNU utilities:
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' |
xargs -d'\n' -I% sh -c 'echo '\"%\"
Using BSD utilities (also works with GNU utilities):
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' | tr '\n' '\0' |
xargs -0 -I% sh -c 'echo '\"%\"
The idea is to use sed to translate the tokens of interest lexically into a string containing shell command substitutions that invoke the target script with the token, and then pass the result to the shell for evaluation.
Note:
Any embedded " and $ characters in the input must be \-escaped.
xargs -d'\n' (GNU) and tr '\n' '\0' / xargs -0 (BSD) are only needed to correctly preserve whitespace in the input - if that is not needed, the following POSIX-compliant solution will do:
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' | tr '\n' '\0' |
xargs -I% sh -c 'printf "%s\n" '\"%\"

Bash command to extract characters in a string

I want to write a small script to generate the location of a file in an NGINX cache directory.
The format of the path is:
/path/to/nginx/cache/d8/40/32/13febd65d65112badd0aa90a15d84032
Note the last 6 characters: d8 40 32, are represented in the path.
As an input I give the md5 hash (13febd65d65112badd0aa90a15d84032) and I want to generate the output: d8/40/32/13febd65d65112badd0aa90a15d84032
I'm sure sed or awk will be handy, but I don't know yet how...
This awk can make it:
awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}'
Explanation
BEGIN{FS=""; OFS="/"}. FS="" sets the input field separator to be "", so that every char will be a different field. OFS="/" sets the output field separator as /, for print matters.
print ... $(NF-1)$NF, $0 prints the penultimate field and the last one all together; then, the whole string. The comma is "filled" with the OFS, which is /.
Test
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' <<< "13febd65d65112badd0aa90a15d84032"
d8/40/32/13febd65d65112badd0aa90a15d84032
Or with a file:
$ cat a
13febd65d65112badd0aa90a15d84032
13febd65d65112badd0aa90a15f1f2f3
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' a
d8/40/32/13febd65d65112badd0aa90a15d84032
f1/f2/f3/13febd65d65112badd0aa90a15f1f2f3
With sed:
echo '13febd65d65112badd0aa90a15d84032' | \
sed -n 's/\(.*\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\)$/\2\/\3\/\4\/\1/p;'
Having GNU sed you can even simplify the pattern using the -r option. Now you won't need to escape {} and () any more. Using ~ as the regex delimiter allows to use the path separator / without need to escape it:
sed -nr 's~(.*([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2}))$~\2/\3/\4/\1~p;'
Output:
d8/40/32/13febd65d65112badd0aa90a15d84032
Explained simple the pattern does the following: It matches:
(all (n-5 - n-4) (n-3 - n-2) (n-1 - n-0))
and replaces it by
/$1/$2/$3/$0
You can use a regular expression to separate each of the last 3 bytes from the rest of the hash.
hash=13febd65d65112badd0aa90a15d84032
[[ $hash =~ (..)(..)(..)$ ]]
new_path="/path/to/nginx/cache/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}/${BASH_REMATCH[3]}/$hash"
Base="/path/to/nginx/cache/"
echo '13febd65d65112badd0aa90a15d84032' | \
sed "s|\(.*\(..\)\(..\)\(..\)\)|${Base}\2/\3/\4/\1|"
# or
# sed sed 's|.*\(..\)\(..\)\(..\)$|${Base}\1/\2/\3/&|'
Assuming info is a correct MD5 (and only) string
First of all - thanks to all of the responders - this was extremely quick!
I also did my own scripting meantime, and came up with this solution:
Run this script with a parameter of the URL you're looking for (www.example.com/article/76232?q=hello for example)
#!/bin/bash
path=$1
md5=$(echo -n "$path" | md5sum | cut -f1 -d' ')
p3=$(echo "${md5:0-2:2}")
p2=$(echo "${md5:0-4:2}")
p1=$(echo "${md5:0-6:2}")
echo "/path/to/nginx/cache/$p1/$p2/$p3/$md5"
This assumes the NGINX cache has a key structure of 2:2:2.

Resources