Parsing CSV records when a value is multiline

Parsing CSV records when a value is multiline - bash

Source file looks like this:
"google.com", "vuln_example1
vuln_example2
vuln_example3"
"facebook.com", "vuln_example2"
"reddit.com", "stupidly_long_vuln_name1"
"stackoverflow.com", ""
I've been trying to get the output to be something like this but the line breaks seem to cause me no end of problems. I'm using a "while read line" job to do this because I do some processing on the columns (e.g Vulnerability count and url in this example). This is output into a jenkins job (yuk).
The basic summary of the problem is getting the linebreaks in the csv to be output into the third column while retaining the table structure. I've got a sort of weird example of the desired output below.
||hostname ||Vulnerability count|| Vulnerability list || URL ||
|google.com |3 |vuln_example1 |http://cve.com/vuln_example1|
| | |vuln_example2 |http://cve.com/vuln_example2|
| | |vuln_example3 |http://cve.com/vuln_example3|
|facebook.com |1 |vuln_example2 |http://cve.com/vuln_example2|
|reddit.com |1 |stupidly_long_vuln_name1 |http://cve.com/stupidly_long_vuln_name1|
|stackoverflow.com |0 | ||
Looking at this... I've got a feeling it might be easier by showing some code and example output.

Parsing your input with the command line below makes the problem easier (I'm assuming the inputs are correct):
perl -0777 -pe 's/([^"])\s*\n/\1 /g ; s/[",]//g' < sample.txt
This line invokes Perl to perform two regex substitutions:
s/([^"])\s*\n/\1 /g: This substitution removes an end of line if it doesn't terminate by a quote " (i.e. if a host entry, with all vulnerabilities isn't yet complete).
s/[",]//g removes all quotes and commas remaining.
For each host entry like this one:
"google.com", "vuln_example1
vuln_example2
vuln_example3"
You'll get:
google.com vuln_example1 vuln_example2 vuln_example3
Then you can assume for each line, you have an host and a set of vulnerabilities.
The given example below stores vulnerabilities in an array and loop through it, formatting and printing each line:
# Replace this by your custom function
# to get an URL for a given vulnerability
function get_vuln_url () {
# This just displays a random url for an non-empty arg
[[ -z "$1" ]] || echo "http://host/$1.htm"
}
# Format your line (see printf help)
function print_row () {
printf "%-20s|%5s|%-30s|%s\n" "$#"
}
# The perl line reformat
perl -0777 -pe 's/([^"])\s*\n/\1 /g ; s/[",]//g' < sample.txt |
while read -r line ; do
arr=(${line})
print_row "${arr[0]}" "$((${#arr[#]} - 1))" "${arr[1]}" "$(get_vuln_url ${arr[1]})"
#echo -e "${arr[0]}\t|$vul_count\t|${arr[1]}\t|$(get_vuln_url ${arr[1]})"
for v in "${arr[#]:2}" ; do
print_row " " " " "$v" "$(get_vuln_url ${arr[1]})"
done
done
Output:
google.com | 3|vuln_example1 |http://host/vuln_example1.htm
| |vuln_example2 |http://host/vuln_example1.htm
| |vuln_example3 |http://host/vuln_example1.htm
facebook.com | 1|vuln_example2 |http://host/vuln_example2.htm
reddit.com | 1|stupidly_long_vuln_name1 |http://host/stupidly_long_vuln_name1.htm
stackoverflow.com | 0| |
Update.
If you don't have Perl, and if your file doesn't have tabulations, you can use this command as a workaround instead:
tr '\n' '\t' < sample.txt | sed -r -e 's/([^"])\s*\t/\1 /g' -e 's/[",]//g' -e 's/\t/\n/g'
tr '\n' '\t' replaces all ends of line by tabulations
sed part acts like Perl line, except it deals with tabulations instead of ends of line and restores tabulations back to ends of line.

Related

Bash regex: get value in conf file preceded by string with dot

I have to get my db credentials from this configuration file:
# Database settings
Aisse.LocalHost=localhost
Aisse.LocalDataBase=mydb
Aisse.LocalPort=5432
Aisse.LocalUser=myuser
Aisse.LocalPasswd=mypwd
# My other app settings
Aisse.NumDir=../../data/Num
Aisse.NumMobil=3000
# Log settings
#Aisse.Trace_AppliTpv=blabla1.tra
#Aisse.Trace_AppliCmp=blabla2.tra
#Aisse.Trace_AppliClt=blabla3.tra
#Aisse.Trace_LocalDataBase=blabla4.tra
In particular, I want to get the value mydb from line
Aisse.LocalDataBase=mydb
So far, I have developed this
mydbname=$(echo "$my_conf_file.conf" | grep "LocalDataBase=" | sed "s/LocalDataBase=//g" )
that returns
mydb #Aisse.Trace_blabla4.tra
that would be ok if it did not return also the comment string.
Then I have also tryed
mydbname=$(echo "$my_conf_file.conf" | grep "Aisse.LocalDataBase=" | sed "s/LocalDataBase=//g" )
that retruns void string.
How can I get only the value that is preceded by the string "Aisse.LocalDataBase=" ?

Using sed
$ mydbname=$(sed -n 's/Aisse\.LocalDataBase=//p' input_file)
$ echo $mydbname
mydb

I'm afraid you're being incomplete:
You mention you want the line, containing "LocalDataBase", but you don't want the line in comment, let's start with that:
A line which contains "LocalDataBase":
grep "LocalDataBase" conf.conf.txt
A line which contains "LocalDataBase" but who does not start with a hash:
grep "LocalDataBase" conf.conf.txt | grep -v "^ *#"
??? grep -v "^ *#"
That means: don't show (-v) the lines, containing:
^ : the start of the line
* : a possible list of space characters
# : a hash character
Once you have your line, you need to work with it:
You only need the part behind the equality sign, so let's use that sign as a delimiter and show the second column:
cut -d '=' -f 2
All together:
grep "LocalDataBase" conf.conf.txt | grep -v "^ *#" | cut -d '=' -f 2
Are we there yet?
No, because it's possible that somebody has put some comment behind your entry, something like:
LocalDataBase=mydb #some information
In order to prevent that, you need to cut that comment too, which you can do in a similar way as before: this time you use the hash character as a delimiter and you show the first column:
grep "LocalDataBase" conf.conf.txt | grep -v "^ *#" | cut -d '=' -f 2 | cut -d '#' -f 1
Have fun.

You may use this sed:
mydbname=$(sed -n 's/^[^#][^=]*LocalDataBase=//p' file)
echo "$mydbname"
mydb
RegEx Details:
^: Start
[^#]: Matches any character other than #
[^=]*: Matches 0 or more of any character that is not =
LocalDataBase=: Matches text LocalDataBase=

You can use
mydbname=$(sed -n 's/^Aisse\.LocalDataBase=\(.*\)/\1/p' file)
If there can be leading whitespace you can add [[:blank:]]* after ^:
mydbname=$(sed -n 's/^[[:blank:]]*Aisse\.LocalDataBase=\(.*\)/\1/p' file)
See this online demo:
#!/bin/bash
s='# Database settings
Aisse.LocalHost=localhost
Aisse.LocalDataBase=mydb
Aisse.LocalPort=5432
Aisse.LocalUser=myuser
Aisse.LocalPasswd=mypwd
# My other app settings
Aisse.NumDir=../../data/Num
Aisse.NumMobil=3000
# Log settings
#Aisse.Trace_AppliTpv=blabla1.tra
#Aisse.Trace_AppliCmp=blabla2.tra
#Aisse.Trace_AppliClt=blabla3.tra
#Aisse.Trace_LocalDataBase=blabla4.tra'
sed -n 's/^Aisse\.LocalDataBase=\(.*\)/\1/p' <<< "$s"
Output:
mydb
Details:
-n - suppresses default line output in sed
^[[:blank:]]*Aisse\.LocalDataBase=\(.*\) - a regex that matches the start of a string (^), then zero or more whiespaces ([[:blank:]]*), then a Aisse.LocalDataBase= string, then captures the rest of the line into Group 1
\1 - replaces the whole match with the value of Group 1
p - prints the result of the successful substitution.

Bash script to add double quotes in .CSV comma delimited file

I need to add double quotes to the csv file. My sample data is like this..
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296
I have tried some code available online with awk and sed, it is resulting as below , Error - **First digit in the number is being trimmed like for ex. in '378478' it is only displaying '78478'.
Also it is adding double quotes to already existing double quotes too!** nothing seems to be perfectly working. Please guide me!
"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""
This is the code I am using:
awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"\1" , "\2"/' file
My total code is the below one. my Intention was to first convert all .xlsx to .csv and then add double quotes to same csv and save it in the same file.i know the $file.csv part is wrong, hence i need some help
find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp
cat path/temp | while IFS="" read -r -d $'\0' file;
do
echo $file
ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done

If you want to handle anything other than the simplest CSV files, you should probably move away from sed and awk. There are much better tools available.
For example, if you sudo apt install csvtool (or equivalent) on your favourite distro, you can use its call-per-line functionality to process each line in the input file. See the following script for an example:
#!/bin/bash
function quotify {
# Start empty line, process every field.
line=""
while [[ $# -ne 0 ]] ; do
# Append comma for all but first field, then quoted field.
[[ -n "${line}" ]] && line="${line},"
line="${line}\"$1\""
shift
done
# Output the fully quoted line.
echo "${line}"
}
# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify
# Pretty-print input and output.
echo "Input file:"
sed 's/^/ /' inputFile.csv
echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/ /'
Note the quotify function which is called for each line in the CSV file, with the arguments set to each field within that line (sans quotes, whether the original fields had quotes or not).
It basically constructs a string of all the fields in the line, with quotes around them, then writes that to standard output, as shown below in the output from that script:
Input file:
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"
Even though using a separate tool is probably the easiest way to go, if you absolutely cannot install other packages, then you're going to have to code up something in a package you already have. The following bash script is a good place to start, as it uses no other tools to achieve its goal.
At the moment, it's tied to a very specific set of rules, as follows:
White space matters. Anything between the commas is considered part of the field. This especially matters when detecting a quoted field, it must have the quote as the first character, no abc, "d,e,f",ghi stuff since the "d,e,f" won't be handled correctly.
Quoted fields are allowed to contain commas, and "" sequences within them are turned into ".
It's probably not a good idea to supply ill-formatted CSV files :-)
But, with that in mind, here we go. I'll offer a brief textual description of each section but hopefully the comments in the code will be enough to figure out what's going on.
First, a function for finding the position if some string within another string, useful for working out the field bounds:
function findPos {
haystack="$1"
needle="$2"
# Remove everything past the needle.
prefix="${haystack%%${needle}*}"
# If nothing was removed, it wasn't found, so supply massive number.
# Otherwise, it was found at the length of the string with removed stuff.
position=999999
[[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
echo ${position}
}
Then we can use that in the function that works out the length of the next field. This basically just looks for the next comma for unquoted fields, and does special handling for quoted fields by building up the field from segments (it has to handle quotes within quotes and commas):
function getNextFieldLen {
line="$1"
# Empty line means all work done.
[[ -z "${line}" ]] && echo -1 && return
# Handle unquoted first, this is easy.
[[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }
# Now handle quoted. Loop over all segments where a segment is defined as
# the text up to the next <"">, assuming it's before the next <",>.
field=""
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
# Append segment to the field and go back for next segment.
field="${field}${line:0:${nextDoubleQuote}}\"\""
line="${line:${nextDoubleQuote}}"
line="${line:2}"
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
done
# Add final segment (up to the comma) and output entire field.
field="${field}${line:0:${nextQuoteComma}}\""
echo "${#field}"
}
Finally, there's the top-level function which will quotify whatever comes in via standard input:
function quotifyStdIn {
# Process file line by line.
while read -r line; do
# Start with empty output line and non-comma separator.
outLine="" ; sep=""
# Place terminator to make processing easier, start field loop.
line="${line},"
fieldLen=$(getNextFieldLen "${line}")
while [[ ${fieldLen} -ge 0 ]]; do
# Get field and quotify if needed, adjust line (remove field and comma).
field="${line:0:${fieldLen}}"
[[ "${field:0:1}" = '"' ]] || field="\"${field}\""
line="${line:$((fieldLen+1))}"
#line="${line:${fieldLen}}"
#line="${line:1}"
# Append to output line and prepare for next field.
outLine="${outLine}${sep}${field}"; sep=","
fieldLen=$(getNextFieldLen "${line}")
done
# Output built line.
echo "${outLine}"
done
}
And, on the off-chance you want to read directly from a file (though providing a file name that's empty or "-" will use standard input so you can probably just use the file-based function for everything):
function quotifyFile {
file="$1"
# Empty file or "-" means standard input, otherwise take input from real file.
[[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
[[ "${file}" = "-" ]] && { quotifyStdIn; return; }
quotifyStdIn < "${file}"
}
And, finally, because every program that's not a "Hello, world" one deserves some form of test harness, this is what you can use to test the various capabilities:
(
echo 'paxdiablo,was here'
echo 'and,"then, strangely,",he,was,not'
echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
echo '17,"""Love"" is grand",19'
) > harness.csv
echo "Before:"
sed "s/^/ /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/ /"
rm -rf harness.csv
And, since a test harness is of little use unless you run the tests, here's the results of the first run:
Before:
paxdiablo,was here
and,"then, strangely,",he,was,not
50,"My name is ""Pax"", and yours is ""Bob""",42
17,"""Love"" is grand",19
After:
"paxdiablo","was here"
"and","then, strangely,","he","was","not"
"50","My name is ""Pax"", and yours is ""Bob""","42"
"17","""Love"" is grand","19"
Hopefully, that will be enough to get you going in the absence of being able to install packages. Of course, if one of the packages you can't install in bash itself, then you have problems that I can't help you with :-)

Your starting CSV is not a good CSV: the 2 rows have different number of columns
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone | - | - | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29 | - | - | - | - | - | - |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
Using Miller (https://github.com/johnkerl/miller) you could run
mlr --csv --quote-all -N unsparsify input >output
to have
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""
You can use it downloading the executable https://github.com/johnkerl/miller/releases/tag/v5.7.0

Command execution in sed while preserving unmatched part of the line

It is simple - I have a data stream with IPv4 addresses encoded into hexadecimal representation like for example 0c22384e which stands for 12.34.56.78.
I figured out sed command with substitution of captured octets into decimal numbers separated by dot.
echo "0c22384e" | sed -E 's/([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})/printf "%d.%d.%d.%d" 0x\1 0x\2 0x\3 0x\4/eg'
This works with a single number BUT as soon I add some text that is not supposed to be matched, it is also passed for the execution - via printf in this case.
How can I preserve the unmatched part of the line without being passed for the execution?

With only one address in a line you could use
echo "Something 0c22384e more" |
sed -r 's/(.*)([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})(.*)/"\1" 0x\2 0x\3 0x\4 0x\5 "\6"/' |
xargs -n6 printf '%s%d.%d.%d.%d%s\n'
EDIT:
Replaced solution for one line and more addresses
with solution for more lines (assuming no '\r' in the stream):
echo "Something 0c22384e more 0c22385e
Second line: 0c22386e and 0c223870
Third line: 0c22388e and 0c223890
4th line: 0c2238ae and 0c2238b0" |
sed 's/$/\r/' |
sed -r 's/[0-9a-f]{8}/\n&\n/g' |
sed -r 's/([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})/printf '%d.%d.%d.%d' 0x\1 0x\2 0x\3 0x\4/e' |
tr -d '\n' |
tr '\r' '\n'

How to use sed to extract a string [duplicate]

This question already has answers here:
BASH extract value after string in variable Not file [duplicate]
(2 answers)
Closed last year.
I need to extract a number from the output of a command: cmd. The output is type: 1000
So my question is how to execute the command, store its output in a variable and extract 1000 in a shell script. Also how do you store the extracted string in a variable?

This question has been answered in pieces here before, it would be something like this:
line=$(sed -n '2p' myfile)
echo "$line"
if [ `echo $line || grep 'type: 1000' ` ] then;
echo "It's there!";
fi;
Store output of sed into a variable
String contains in Bash
EDIT: sed is very limited, you would need to use bash, perl or awk for what you need.

This is a typical use case for grep:
output=$(cmd | grep -o '[0-9]\+')
You can write the output of a command or even a pipeline of commands into a shell variable using so called command substitution:
variable=$(cmd);
In comments it appeared that the output of cmd contains more lines than the type : 1000. In this case I would suggest sed:
output=$(cmd | sed -n 's/type : \([0-9]\+\)/\1/p;q')

You tagged your question as sed but your question description does not restrict other tools, so here's a solution using awk.
output = `cmd | awk -F':' '/type: [0-9]+/{print $2}'`
Alternatively, you can use the newer $( ) syntax. Some find the newer syntax preferable and it can be conveniently nested, without the need for escaping backtics.
output = $(cmd | awk -F':' '/type: [0-9]+/{print $2}')

If the output is rigidly restricted to "type: " followed by a number, you can just use cut.
var=$(echo 'type: 1000' | cut -f 2 -d ' ')
Obviously you'll have to pipe the output of your command to cut, I'm using echo as a demo.
In addition, I'd use grep and then cut if the string you are searching is more complex. If we assume there can be all kind of numbers in the text, but only one occurrence of "type: " followed by a number, you can use the command:
>> var=$(echo "hello 12 type: 1000 foo 1001" | grep -oE "type: [0-9]+" | cut -f 2 -d ' ')
>> echo $var
1000

You can use the | operator to send the output of one command to another, like so:
echo " 1\n 2\n 3\n" | grep "2"
This sends the string " 1\n 2\n 3\n" to the grep command, which will search for the line containing 2. It sound like you might want to do something like:
cmd | grep "type"

Here is a plain sed solution that uses a regualar expression to find the number in your string:
cmd | sed 's/^.*type: \([0-9]\+\)/\1/g'
^ means from the start
.* can be any character (also none)
\([0-9]\+\) are numbers (minimum one character)
\1 means it takes the first pattern it finds (and only in this case) and uses it as replacement for the whole string

to print words seperated with special charecters in shell script

shell script to print three words differently I have tried
{
a="Uname/pass#last"
echo $a | tr "/" "\n" | tr "#" "\n"
output is:
Uname
pass
last
}
I want it as
{Username- Uname
Password- pass
lastname-last}

Ok, I guess you want to add a prefix to each results:
printf 'Username\nPassword\nlastname' > /tmp/prefixes
a="Uname/pass#last"
echo "${a}" | tr '/#' '\n\n' | paste -d':' /tmp/prefixes -
ie: paste together the output of /tmp/prefixes and of the Standard Input (-), which is receiving the output of : echo ".../...#..." | tr '/#' '\n\n'
(and in the resulting output, separate the 2 with a : in this example, or whatever else you would want. Ex: - like in your question.)
and it outputs :
Username:user
Password:pass
lastname:last
(I know you wanted a - instead of a : but I give my example with : to better separate the "-" denoting the standard input, and the ":" denoting the field-separator character in the output. Just change -d':' into -d'-' to have a - instead.)

First off, I hope you're not going to manipulate important passwords in a shell script and external commands. There are some risks involved with that.
Defining the problem
I suspect you want split a string encoding a user's Username, password and surname into a three line structure, adding tags to document which is which. For that, tr is insufficient.
However, it can be done inside the shell.
Example (bash, ksh):
function split_account_string {
typeset account=${1:?account string} uname pass last t
uname=${account%%/*}
last=${account##*#}
t=${account#$uname/}
pass=${t%#*}
[[ $uname/$pass#$last == "$account" ]] || return
echo "{Username-$uname"
echo "Password-$pass"
echo "lastname-$last}"
}
split_account_string "USER_A/seKreT#John.Doe"
This function will extract all tokens between the first / and the last # as the value of the password. If either one is missing, it will print nothing, and return an error status.
When run, this gives:
{Username-USER_A
Password-seKreT
lastname-John.Doe}

Use this simple script and get the output.
#!/bin/bash
a="Uname/pass#last"
array2=(`echo $a | tr "/" "\n" | tr "#" "\n"`)
array1=(`echo -e "Username\nPassword\nlastname"`)
i=${#array1[#]}
for (( j=0 ; j<$i ; j++ ))
do
echo "${array1[$j]}=${array2[$j]}"
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parsing CSV records when a value is multiline - bash

Related

Bash regex: get value in conf file preceded by string with dot

Bash script to add double quotes in .CSV comma delimited file

Command execution in sed while preserving unmatched part of the line

How to use sed to extract a string [duplicate]

to print words seperated with special charecters in shell script

Categories

Resources