convert table into comma separated in text file using bash - bash

I have a text file like this:
+------------------+------------+----------+
| col_name | data_type | comment |
+------------------+------------+----------+
| _id | bigint | |
| starttime | string | |
+------------------+------------+----------+
how can i get a result like this using bash
(_id bigint, starttime string )
so just the column names and type
#remove first 3 lines
sed -e '1,3d' < columnnames.txt >clean.txt
#remove first character from each line
sed 's/^.//' < clean.txt >clean.txt
#remove last character from each line
sed 's/.$//' < clean.txt >clean.txt
# remove certain characters
sed 's/[+-|]//g' < clean.txt >clean.txt
# remove last line
sed '$ d' < clean.txt >clean.txt
so this is what i have so far, if there is a better implementation let me know!

Something similar, using only awk:
awk -F ' *[|]' 'BEGIN {printf("(")} NR>3 && NF>1 {printf("%s%s%s", NR>4 ? "," : "", $2, $3)} END {printf(" )\n")}' columnnames.txt

# Set the field separator to vertical bar surrounded by any number of spaces.
# BEGIN and END blocks print the opening and closing parens
# The line between skips the header lines and any line starting with '+'
$ awk -F"[[:space:]]*[|][[[:space:]]*" '
BEGIN { printf "%s", "( "}
NR > 3 && $0 !~ /^[+]/ { printf("%s%s %s", c, $2, $3); c = ", " }
END { print " )" }' file
( _id bigint, starttime string )

$ awk -F'[| ]+' 'NR>3 && NF>1{v=v s $2" "$3; s=", "} END{print "("v")"}' file
(_id bigint, starttime string)

I would do this :
cat input.txt \
| tail -n +4 \
| awk -F'[^a-zA-Z_]+' '{ for(i=1;i<=NF;i++) { printf $i" " }}'
Its a little bit shorter.

Another way to implement Diego Torres Milano's solution as a stand-alone awk program:
tableconvert
#!/usr/bin/env -S awk -f
BEGIN {
FS="[[:space:]]*[|][[[:space:]]*"
printf "%s", "( "
}
{
if (FNR <= 3 || match($0, /^[+]/))
next
else {
printf("%s%s %s", c, $2, $3)
c = ", "
}
}
END {
print " )"
}
Make tableconvert an executable:
chmod +x tableconvert
Run tableconvert on intablefile.txt
./tableconvert intablefile.txt
( _id bigint, starttime string )
With added bonus that using FNR instead of NR allow the awk program to process multiple input files as arguments:
./tableconvert infille1.txt infile2.txt infile3.txt ...

A variation on the other answers using awk with the field-separator being the '|' with optional spaces on either side as GNU awk allows, then taking fields 2 and 3 as the fields wanted in each record, and formatting the output as described in the question with the closing " )" provided in the END rule:
$ awk -F' *\\| *' '
NR>3 && $1~/^[+]/{exit} # exit condition first line w/^+
NR==4{$1=$1; printf "(%s %s", $2,$3} # 1st data record is 4
NR>4{$1=$1; printf ", %s %s", $2,$3} # process all remainng records
END{print " )"} # output closing " )"
' table
(_id bigint, starttime string )
(note: if you don't want the two-spaces before the closing ")", just remove them from the print in the END rule)
Rather than using a BEGIN the first record of interest (4) is used to provide the opening "(". Look things over and let me know if you have questions.

Related

Search keywords in master csv if keyword exist then update input csv 2nd column with value true or false

Input csv - new_param.csv
value like -
ID
Identity
as-uid
cp_cus_id
evs
k_n
master.csv has value like -
A, xyz, id, abc
n, xyz, as-uid, abc, B, xyz, ne, abc
q, xyz, id evs, abc
3, xyz, k_n, abc, C, xyz, ad, abc
1, xyz, zd, abc
z, xyz, ID, abc
Require Output Updated new_param.csv - true or false in 2nd column
ID,true
Identity,false
as-uid,true
cp_cus_id,false
evs,true
k_n,true
tried below code no output -
#!/bin/bash
declare -a keywords=(`cat new_param.csv`)
length=${#keywords[#]}
for (( j=0; j<length; j++ ));
do
a= LC_ALL=C awk -v kw="${keywords[$j]}" -F, '{for (i=1;i<=NF;i++) if ($i ~ kw) {print i}}' master.csv
b=0
if [ $a -gt $b ]
then
echo true $2 >> new_param.csv
else
echo false $2 >> new_param.csv
fi
done
Please help someone !
Tried above mention code but does not helping me
getings error like -
test.sh: line 29: [: -gt: unary operator expected test.sh: line 33: -f2: command not found
awk -v RS=', |\n' 'NR == FNR { a[$0] = 1; next }
{ gsub(/,.*/, ""); b = "" b $0 (a[$0] ? ",true" : ",false") "\n" }
END { if (FILENAME == "new_param.csv") printf "%s", b > FILENAME }' master.csv new_param.csv
Try this Shellcheck-clean pure Bash code:
#! /bin/bash -p
outputs=()
while read -r kw; do
if grep -q -E "(^|[[:space:],])$kw([[:space:],]|\$)" master.csv; then
outputs+=( "$kw,true" )
else
outputs+=( "$kw,false" )
fi
done <new_param.csv
printf '%s\n' "${outputs[#]}" >new_param.csv
You may need to tweak the regular expression used with grep -E depending on what exactly you want to count as a match.
Using grep to find exact word matches:
$ grep -owf new_param.csv master.csv | sort -u
ID
as-uid
evs
k_n
Then feed this to awk to match against new_param.csv entries:
awk '
BEGIN { OFS="," }
FNR==NR { a[$1]; next }
{ print $1, ($1 in a) ? "true" : "false" }
' <(grep -owf new_param.csv master.csv | sort -u) new_param.csv
This generates:
ID,true
Identity,false
as-uid,true
cp_cus_id,false
evs,true
k_n,true
Once the results are confirmed as correct OP can add > new_param.csv to the end of the awk script, eg:
awk 'BEGIN { OFS="," } FNR==NR ....' <(grep -owf ...) new_parame.csv > new_param.csv
^^^^^^^^^^^^^^^
Alternative awk option:
Use a , for the field separator and concatenate the 3rd field for each record of the master.csv to the variable m. Second, read each record from the new-params.csv file and use the index funtion to determine whether that record exists in the m variable string.
awk -F", " '
FNR==NR{m=m$3}
FNR<NR{print $0 (index(m,$0) ? ",true" : ",false")}
' master.csv new-params.csv
Output:
ID,true
Identity,false
as-uid,true
cp_cus_id,false
evs,true
k_n,true

Extracting string from line, give as input to a command and then output the entire line with replacing the string

I have a file containing like below, multiple rows are there
test1| 1234 | test2 | test3
Extract second column 1234 and run a command feeding that as input
lets say we get X as output to the command
Print the output as below for each of the line
test1 | X | test2 | test3
Prefer if I could do it in one-liner, but open to ideas.
I am able to extract string using awk, but I am not sure how I can still preserve the initial output and replace it in the output. Below is what I tested
cat file.txt | awk -F '|' '{newVar=system("command "$2); print newVar $4}'
#
Sample command output, where we extract the "name"
openstack show 36a6c06e-5e97-4a53-bb42
+----------------------------+-----------------------------------+
| Property | Value |
+----------------------------+-----------------------------------+
| id | 36a6c06e-5e97-4a53-bb42 |
| name | testVM1 |
+----------------------------+-----------------------------------+
Perl to the rescue!
perl -lF'/\|/' -ne 'chomp( $F[1] = qx{ command $F[1] }); print join "|", #F' < file.txt
-n reads the input line by line
-l removes newlines from input and adds them to prints
F specifies how to split each input line into the #F array
$F[1] corresponds to the second column, we replace it with the output of the command
chomp removes the trailing newline from the command output
join glues the array back to one line
Using awk:
awk -F ' *| *' '{("command "$2) | getline $2}1' file.txt
e.g.
$ awk -F ' *| *' '{("date -d #"$2) | getline $2}1' file.txt
test1| Thu 01 Jan 1970 05:50:34 AM IST | test2 | test3
I changed the field separator from | to *| * to accommodate the spaces surrounding the fields. You can remove those based on your actual input.
This finally did the trick..
awk -F' *[|] *' -v OFS=' | ' '{
cmd = "openstack show \047" $2 "\047"
while ( (cmd | getline line) > 0 ) {
if ( line ~ /name/ ) {
split(line,flds,/ *[|] */)
$2 = flds[3]
break
}
}
close(cmd)
print
}' file
If command can take the whole list of values once and generate the converted list as output (e.g. tr 'a-z' 'A-Z') then you'd want to do something like this to avoid spawning a shell once per input line (which is extremely slow):
awk -F' *[|] *' '{print $2}' file |
command |
awk -F' *[|] *' -v OFS=' | ' 'NR==FNR{a[FNR]=$0; next} {$2=a[FNR]} 1' - file
otherwise if command needs to be called with one value at a time (e.g. echo) or you just don't care about execution speed then you'd do:
awk -F' *[|] *' -v OFS=' | ' '{
cmd = "command \047" $2 "\047"
if ( (cmd | getline line) > 0 ) {
$2 = line
}
close(cmd)
print
}' file
The \047s will produce single quotes around $2 when it's passed to command and so shield it from shell interpretation (see https://mywiki.wooledge.org/Quotes) and the test on the result of getline will protect you from silently overwriting the current $2 with the output of an earlier command execution in the event of a failure (see http://awk.freeshell.org/AllAboutGetline). The close() ensures that you don't end up with a "too many open files" error or other cryptic problem if the pipe isn't being closed properly, e.g. if command is generating multiple lines and you're just reading the first one.
Given your comment below, if you're going with the 2nd approach above then you'd write something like:
awk -F' *[|] *' -v OFS=' | ' '{
cmd = "openstack show \047" $2 "\047"
while ( (cmd | getline line) > 0 ) {
split(line,flds)
if ( flds[2] == "name" ) {
$2 = flds[3]
break
}
}
close(cmd)
print
}' file

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Insert a date in a column using awk

I'm trying to format a date in a column of a csv.
The input is something like: 28 April 1966
And I'd like this output: 1966-04-28
which can be obtain with this code:
date -d "28 April 1966" +%F
So now I thought of mixing awk and this code to format the entire column but I can't find out how.
Edit :
Example of input : (separators "|" are in fact tabs)
1 | 28 April 1966
2 | null
3 | null
4 | 30 June 1987
Expected output :
1 | 1966-04-28
2 | null
3 | null
4 | 30 June 1987
A simple way is
awk -F '\\| ' -v OFS='| ' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename
That is:
{
cmd = "date -d \"" $3 "\" +%F 2> /dev/null" # build shell command
cmd | getline $3 # run, capture output
close(cmd) # close pipe
}
1 # print
This works because date doesn't print anything to its stdout if the date is invalid, so the getline fails and $3 is not changed.
Caveats to consider:
For very large files, this will spawn a lot of shells and processes in those shells (one each per line). This can become a noticeable performance drag.
Be wary of code injection. If the CSV file comes from an untrustworthy source, this approach is difficult to defend against an attacker, and you're probably better off going the long way around, parsing the date manually with gawk's mktime and strftime.
EDIT re: comment: To use tabs as delimiters, the command can be changed to
awk -F '\t' -v OFS='\t' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename
EDIT re: comment 2: If performance is a worry, as it appears to be, spawning processes for every line is not a good approach. In that case, you'll have to do the parsing manually. For example:
BEGIN {
OFS = FS
m["January" ] = 1
m["February" ] = 2
m["March" ] = 3
m["April" ] = 4
m["May" ] = 5
m["June" ] = 6
m["July" ] = 7
m["August" ] = 8
m["September"] = 9
m["October" ] = 10
m["November" ] = 11
m["December" ] = 12
}
$3 !~ /null/ {
split($3, a, " ")
$3 = sprintf("%04d-%02d-%02d", a[3], m[a[2]], a[1])
}
1
Put that in a file, say foo.awk, and run awk -F '\t' -f foo.awk filename.csv.
This should work with your given input
awk -F'\\|' -vOFS="|" '!/null/{cmd="date -d \""$3"\" +%F";cmd | getline $3;close(cmd)}1' file
Output
| 1 |1966-04-28
| 2 | null
| 3 | null
| 4 |1987-06-30
I would suggest using a language that supports parsing dates, like perl:
$ cat file
1 28 April 1966
2 null
3 null
4 30 June 1987
$ perl -F'\t' -MTime::Piece -lane 'print "$F[0]\t",
$F[1] eq "null" ? $F[1] : Time::Piece->strptime($F[1], "%d %B %Y")->strftime("%F")' file
1 1966-04-28
2 null
3 null
4 1987-06-30
The Time::Piece core module allows you to parse and format dates, using the standard format specifiers of strftime. This solution splits the input on a tab character and modifies the format if the second field is not "null".
This approach will be much faster than using system calls or invoking subprocesses, as everything is done in native perl.
Here is how you can do this in pure BASH and avoid calling system or getline from awk:
while IFS=$'\t' read -ra arr; do
[[ ${arr[1]} != "null" ]] && arr[1]=$(date -d "${arr[1]}" +%F)
printf "%s\t%s\n" "${arr[0]}" "${arr[1]}"
done < file
1 1966-04-28
2 null
3 null
4 1987-06-30
Only one date call and no code injection problem is possible, see the following:
This script extracts the dates (using awk) into a temporary file processes them with one "date" call and merges the results back (using awk).
Code
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' input > temp.$$
date --file=temp.$$ +%F > dates.$$
awk -F '\t' -v OFS='\t' 'BEGIN {
while ( getline < "'"dates.$$"'" > 0 )
{
f1_counter++
if ($0 == "0000-01-01") {$0 = "null"}
date[f1_counter] = $0
}
}
{$3 = date[NR]}
1' input.$$
One-liner using bash process redirections (no temporary files):
inputfile=/path/to/input
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'<(date -f <(awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$inputfile") +%F)'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$inputfile"
Details
here is how it can be used:
# configuration
input=/path/to/input
temp1=temp.$$
temp2=dates.$$
output=output.$$
# create the sample file (optional)
#printf "\t%s\n" $'1\t28 April 1966' $'2\tnull' $'3\tnull' $'4\t30 June 1987' > "$input"
# Extract all dates
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$input" > "$temp1"
# transform the dates
date --file="$temp1" +%F > "$temp2"
# merge csv with transformed date
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'"$temp2"'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$input" > "$output"
# print the output
cat "$output"
# cleanup
rm "$temp1" "$temp2" "$output"
#rm "$input"
Caveats
Using "0000-01-01" as a temporary placeholder for invalid (null) dates
The code should be faster than other methods calling "date" a lot of times, but it reads the input file two times.

How to replace all but last matching in a file using bash?

Assuming using bash, having a configuration file like:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence <-- Replace
param-c=cccccc
# param-foo=first commented foo <-- Commented: don't replace
param-d=dddddd
param-e=eeeeee
param-foo=second occurence <-- Rreplace
param-foo=third occurence <-- Last active: don't replace
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Commented: don't replace
param-x=xxxxxx2
In which you can find multiple commented or uncommented lines of the param-foo,
how can you comment all the uncommented param-foos except the very last active one,
resulting in:
param-a=aaaaaa
param-b=bbbbbb
# param-foo=first occurence <-- Replaced
param-c=cccccc
# param-foo=commented foo <-- Left
param-d=dddddd
param-e=eeeeee
# param-foo=second occurence <-- Replaced
param-foo=third occurence <-- Left
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Left
param-x=xxxxxx2
Two parts of the question:
1. How to do it with only one known repeating param?
(only param-foo in the example above)
2. How to do it with all multiple active params at once?
(param-foo + param-x in the example above)
Attention: In this case I don't know previously the name of the repeating params!
Thanks
If awk is acceptable, this will do it for param-foo and param-x:
awk -F= -v p='param-foo param-x' 'BEGIN {
ARGV[ARGC++] = ARGV[ARGC - 1]
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile
You may use a single parameter: p=param-x or add more parameters separated by spaces: p='param-1 param-2 ... param-n'.
Edit: I'm assuming the real input file looks like this:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence
param-c=cccccc
# param-foo=commented foo
param-d=dddddd
param-e=eeeeee
param-foo=second occurence
param-foo=third occurence
param-x=xxxxxx1
param-f=ffffff
param-x=xxxxxx2
Let me know if it's different.
Second edit: providing a solution for mawk users:
awk -F= -v p='param-foo param-x' 'BEGIN {
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile infile
Adding solution for the latest requirement:
awk -F= 'NR == FNR {
if (NF && !/^#/)
_p[$1]++ && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
FNR != nr[$1] && $0 = "# " $0
}1' infile infile
I have not tested fully the script, but it worked on the first example:
#!/bin/bash
input_file=/path/to/your/input/file
last_occurence=`nl $input_file | grep 'param-foo' | grep -v '#' | tail -1 | awk -F" " '{print $1}'`
sed -i '/#/!s/param-foo/# param-foo/g' $input_file
sed -i "${last_occurence}s/# param-foo/param-foo/" $input_file
It's very straight forward logic. First we get the last occurrence of param-foo, which is not commented.
The first sed goes and comments all param-foo, which are not commented.
The second sed uses the line_number of last occurence of param-foo and removes the # character. You can easily wrap that in a function and use it inside a loop, providing a list of parameters, instead of only one.
A bit slow for long files, but should work for all the parameters:
grep -v ^# $file |
cut -f1 -d= |
sort -u |
sed 's/^/grep -n . '$file' |
tac |
grep -m1 :/;s/$/= /' |
bash |
sed -r 's%([0-9]+):(.*)=(.*)%\1!s/^\2=/# \2=/%' |
sed -f- $file
This might work:
param="param-foo"
tac input_file |sed '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}'|tac >output_file
For multiple params:
cp input_file{,.backup}
params=(param-{foo,bar,baz})
tac input_file >backwards_file
for param in "${params[#]}"; do
sed -i '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}' backwards_file
done
tac backwards_file >output_file
Turn input_file backwards, preprend all but the first occurrence of $param with a comment #,then revert the file.
EDIT:
To extract the params from the file use this piece of code:
params=($(sed -rn '/^#/d;/^$/!s/^\s*([^=]*).*/\1/gp' input_file | sort | uniq))

Resources