I have scenario where we want to replace multiple double quotes to single quotes between the data, but as the input data is separated with "comma" delimiter and all column data is enclosed with double quotes "" got an issue and the same explained below:
The sample data looks like this:
"int","","123","abd"""sf123","top"
So, the output would be:
"int","","123","abd"sf123","top"
tried below approach to get the resolution, but only first occurrence is working, not sure what is the issue??
sed -ie 's/,"",/,"NULL",/g;s/""/"/g;s/,"NULL",/,"",/g' inputfile.txt
replacing all ---> from ,"", to ,"NULL",
replacing all multiple occurrences of ---> from """ or "" or """" to " (single occurrence)
replacing 1 step changes back to original ---> from ,"NULL", to ,"",
But, only first occurrence is getting changed and remaining looks same as below:
If input is :
"int","","","123","abd"""sf123","top"
the output is coming as:
"int","","NULL","123","abd"sf123","top"
But, the output should be:
"int","","","123","abd"sf123","top"
You may try this perl with a lookahead:
perl -pe 's/("")+(?=")//g' file
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
Where input is:
cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
Breakup:
("")+: Match 1+ pairs of double quotes
(?="): If those pairs are followed by a single "
Using sed
$ sed -E 's/(,"",)?"+(",)?/\1"\2/g' input_file
"int","","123","abd"sf123","top"
"int","","NULL","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
In awk with your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of awk.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<=NF;i++){
if($i!~/^""$/){
gsub(/"+/,"\"",$i)
}
}
}
1
' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as , for all the lines of Input_file. Then traversing through each field of line, if a field is NOT NULL then Globally replacing all 1 or more occurrences of " with single occurrence of ". Then printing the line.
With sed you could repeat 1 or more times sets of "" using a group followed by matching a single "
Then in the replacement use a single "
sed -E 's/("")+"/"/g' file
For this content
$ cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
The output is
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
sed s'#"""#"#' file
That works. I will demonstrate another method though, which you may also find useful in other situations.
#!/bin/sh -x
cat > ed1 <<EOF
3s/"""/"/
wq
EOF
cp file stack
cat stack | tr ',' '\n' > f2
ed -s f2 < ed1
cat f2 | tr '\n' ',' > stack
rm -v ./f2
rm -v ./ed1
The point of this is that if you have a big csv record all on one line, and you want to edit a specific field, then if you know the field number, you can convert all the commas to carriage returns, and use the field number as a line number to either substitute, append after it, or insert before it with Ed; and then re-convert back to csv.
Related
Hello everyone I'm a beginner in shell coding. In daily basis I need to convert a file's data to another format, I usually do it manually with Text Editor. But I often do mistakes. So I decided to code an easy script who can do the work for me.
The file's content like this
/release201209
a1,a2,"a3",a4,a5
b1,b2,"b3",b4,b5
c1,c2,"c3",c4,c5
to this:
a2>a3
b2>b3
c2>c3
The script should ignore the first line and print the second and third values separated by '>'
I'm half way there, and here is my code
#!/bin/bash
#while Loops
i=1
while IFS=\" read t1 t2 t3
do
test $i -eq 1 && ((i=i+1)) && continue
echo $t1|cut -d\, -f2 | { tr -d '\n'; echo \>$t2; }
done < $1
The problem in my code is that the last line isnt printed unless the file finishes with an empty line \n
And I want the echo to be printed inside a new CSV file(I tried to set the standard output to my new file but only the last echo is printed there).
Can someone please help me out? Thanks in advance.
Rather than treating the double quotes as a field separator, it seems cleaner to just delete them (assuming that is valid). Eg:
$ < input tr -d '"' | awk 'NR>1{print $2,$3}' FS=, OFS=\>
a2>a3
b2>b3
c2>c3
If you cannot just strip the quotes as in your sample input but those quotes are escaping commas, you could hack together a solution but you would be better off using a proper CSV parsing tool. (eg perl's Text::CSV)
Here's a simple pipeline that will do the trick:
sed '1d' data.txt | cut -d, -f2-3 | tr -d '"' | tr ',' '>'
Here, we're just removing the first line (as desired), selecting fields 2 & 3 (based on a comma field separator), removing the double quotes and mapping the remaining , to >.
Use this Perl one-liner:
perl -F',' -lane 'next if $. == 1; print join ">", map { tr/"//d; $_ } #F[1,2]' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
I have the following string:
|**barak**.version|2001.0132012031539|
in file text.txt.
I would like to replace it with the following:
|**barak**.version|2001.01.2012031541|
So I run:
sed -i "s/\|\*\*$module\*\*.version\|2001.0132012031539/|**$module**.version|$version/" text.txt
but the result is a duplicate instead of replacing:
|**barak**.version|2001.01.2012031541|**barak**.version|2001.0132012031539|
What am I doing wrong?
Here is the value for module and version:
$ echo $module
barak
$ echo $version
2001.01.2012031541
Assumptions:
lines of interest start and end with a pipe (|) and have one more pipe somewhere in the middle of the data
search is based solely on the value of ${module} existing between the 1st/2nd pipes in the data
we don't know what else may be between the 1st/2nd pipes
the version number is the only thing between the 2nd/3rd pipes
we don't know the version number that we'll be replacing
Sample data:
$ module='barak'
$ version='2001.01.2012031541'
$ cat text.txt
**barak**.version|2001.0132012031539| <<<=== leave this one alone
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.0132012031539| <<<=== replace this one
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.0132012031539| <<<=== replace this one
One sed solution with -Extended regex support enabled and making use of a capture group:
$ sed -E "s/^(\|[^|]*${module}[^|]*).*/\1|${version}|/" text.txt
Where:
\| - first occurrence (escaped pipe) tells sed we're dealing with a literal pipe; follow-on pipes will be treated as literal strings
^(\|[^|]*${module}[^|]*) - first capture group that starts at the beginning of the line, starts with a pipe, then some number of non-pipe characters, then the search pattern (${module}), then more non-pipe characters (continues up to next pipe character)
.* - matches rest of the line (which we're going to discard)
\1|${version}| - replace line with our first capture group, then a pipe, then the new replacement value (${version}), then the final pipe
The above generates:
**barak**.version|2001.0132012031539|
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.01.2012031541| <<<=== replaced
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.01.2012031541| <<<=== replaced
An awk alternative using GNU awk:
awk -v mod="$module" -v vers="$version" -F \| '{ OFS=FS;split($2,map,".");inmod=substr(map[1],3,length(map[1])-4);if (inmod==mod) { $3=vers } }1' file
Pass two variables mod and vers to awk using $module and $version. Set the field delimiter to |. Split the second field into array map using the split function and using . as the delimiter. Then strip the leading and ending "**" from the first index of the array to expose the module name as inmod using the substr function. Compare this to the mod variable and if there is a match, change the 3rd delimited field to the variable vers. Print the lines with short hand 1
Pipe is only special when you're using extended regular expressions: sed -E
There's no reason why you need extended here, stick with basic regex:
sed "
# for lines matching module.version
/|\*\*$module\*\*.version|/ {
# replace the version
s/|2001.0132012031539|/|$version|/
}
" text.txt
or as an unreadable one-liner
sed "/|\*\*$module\*\*.version|/ s/|2001.0132012031539|/|$version|/" text.txt
I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry.
The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'.
This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file:
while read -r line
do
if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line))
while read -r glline
do
if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) |
"$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv
done < file.txt
done < file.txt
This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement.
Example of text that needs to be parsed, and paired:
gl_one\User Defined\1\String\1\\1\Some Text
gl_two\User Defined\1\String\1\\1\Some Text also
gl_three\User Defined\1\Time\1\\1\Datetime now
some\junk
"gl_one\1\Value1
some\junk
"gl_two\1\Value2
"gl_three\1\Value3
So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'.
It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the
script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three.
The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values.
Wanted regex output stored in variables would be something like this:
first while loop:
$param = gl_one, $def = User Defined, $type = String, $prompt = Some Text
second while loop:
$val = Value1
Then it stores these variables to the file.csv, with semi-colon separators.
Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this,
I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the
semi-colons as .csv separators are added correctly.
Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file?
Any ideas and comments?
This will generate a file containing the data you want:
cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv
It uses grep to extract all lines containing 'gl_'
then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line]
The lines are sorted
sed removes the return from each pair of lines
awk then prints
the required columns according to your requirements
Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\
sed '
/^gl_/{ # if definition
N; # append next line to buffer
s/\n"gl_[^\\]*//; # if value, strip first column
t; # and start next loop
}
D; # otherwise, delete the line
' |\
awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \
>>/filepath/file.csv
sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value
sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk
awk to pull out relevant fields
I have a .csv file that contains double quoted multi-line fields. I need to convert the multi-line cell to a single line. It doesn't show in the sample data but I do not know which fields might be multi-line so any solution will need to check every field. I do know how many columns I'll have. The first line will also need to be skipped. I don't how much data so performance isn't a consideration.
I need something that I can run from a bash script on Linux. Preferably using tools such as awk or sed and not actual programming languages.
The data will be processed further with Logstash but it doesn't handle double quoted multi-line fields hence the need to do some pre-processing.
I tried something like this and it kind of works on one row but fails on multiple rows.
sed -e :0 -e '/,.*,.*,.*,.*,/b' -e N -e '1n;N;N;N;s/\n/ /g' -e b0 file.csv
CSV example
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
The output I want is
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
Jane,Doe,Country City Street,67890
etc.
etc.
First my apologies for getting here 7 months late...
I came across a problem similar to yours today, with multiple fields with multi-line types. I was glad to find your question but at least for my case I have the complexity that, as more than one field is conflicting, quotes might open, close and open again on the same line... anyway, reading a lot and combining answers from different posts I came up with something like this:
First I count the quotes in a line, to do that, I take out everything but quotes and then use wc:
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
If you think of a single multi-line field, knowing if the quotes are 1 or 2 is enough. In a more generic scenario like mine I have to know if the number of quotes is odd or even to know if the line completes the record or expects more information.
To check for even or odd you can use the mod operand (%), in general:
even % 2 = 0
odd % 2 = 1
For the first line:
Odd means that the line expects more information on the next line.
Even means the line is complete.
For the subsequent lines, I have to know the status of the previous one. for instance in your sample text:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
You can say line 1 (John,Doe,"Country) has 1 quote (odd) what means the status of the record is incomplete or open.
When you go to line 2, there is no quote (even). Nevertheless this does not mean the record is complete, you have to consider the previous status... so for the lines following the first one it will be:
Odd means that record status toggles (incomplete to complete).
Even means that record status remains as the previous line.
What I did was looping line by line while carrying the status of the last line to the next one:
incomplete=0
cat file.csv | while read line; do
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
incomplete=$((($quotes+$incomplete)%2)) # Check if Odd or Even to decide status
if [ $incomplete -eq 1 ]; then
echo -n "$line " >> new.csv # If line is incomplete join with next
else
echo "$line" >> new.csv # If line completes the record finish
fi
done
Once this was executed, a file in your format generates a new.csv like this:
First name,Last name,Address,ZIP
John,Doe,"Country City Street",12345
I like one-liners as much as everyone, I wrote that script just for the sake of clarity, you can - arguably - write it in one line like:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
I would appreciate it if you could go back to your example and see if this works for your case (which you most likely already solved). Hopefully this can still help someone else down the road...
Recovering the multi-line fields
Every need is different, in my case I wanted the records in one line to further process the csv to add some bash-extracted data, but I would like to keep the csv as it was. To accomplish that, instead of joining the lines with a space I used a code - likely unique - that I could then search and replace:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l ~newline~ " || echo "$l";done >new.csv
the code is ~newline~, this is totally arbitrary of course.
Then, after doing my processing, I took the csv text file and replaced the coded newlines with real newlines:
sed -i 's/ ~newline~ /\n/g' new.csv
References:
Ternary operator: https://stackoverflow.com/a/3953666/6316852
Count char occurrences: https://stackoverflow.com/a/41119233/6316852
Other peculiar cases: https://www.linuxquestions.org/questions/programming-9/complex-bash-string-substitution-of-csv-file-with-multiline-data-937179/
TL;DR
Run this:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
... and collect results in new.csv
I hope it helps!
If Perl is your option, please try the following:
perl -e '
while (<>) {
$str .= $_;
}
while ($str =~ /("(("")|[^"])*")|((^|(?<=,))[^,]*((?=,)|$))/g) {
if (($el = $&) =~ /^".*"$/s) {
$el =~ s/^"//s; $el =~ s/"$//s;
$el =~ s/""/"/g;
$el =~ s/\s+(?!$)/ /g;
}
push(#ary, $el);
}
foreach (#ary) {
print /\n$/ ? "$_" : "$_,";
}' sample.csv
sample.csv:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
John,Doe,"Country
City
Street",67890
Result:
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
John,Doe,Country City Street,67890
This might work for you (GNU sed):
sed ':a;s/[^,]\+/&/4;tb;N;ba;:b;s/\n\+/ /g;s/"//g' file
Test each line to see that it contains the correct number of fields (in the example that was 4). If there are not enough fields, append the next line and repeat the test. Otherwise, replace the newline(s) by spaces and finally remove the "'s.
N.B. This may be fraught with problems such as ,'s between "'s and quoted "'s.
Try cat -v file.csv. When the file was made with Excel, you might have some luck: When the newlines in a field are a simple \n and the newline at the end is a \r\n (which will look like ^M), parsing is simple.
# delete all newlines and replace the ^M with a new newline.
tr -d "\n" < file.csv| tr "\r" "\n"
# Above two steps with one command
tr "\n\r" " \n" < file.csv
When you want a space between the joined line, you need an additional step.
tr "\n\r" " \n" < file.csv | sed '2,$ s/^ //'
EDIT: #sjaak commented this didn't work is his case.
When your broken lines also have ^M you still can be a lucky (wo-)man.
When your broken field is always the first field in double quotes and you have GNU sed 4.2.2, you can join 2 lines when the first line has exactly one double quote.
sed -rz ':a;s/(\n|^)([^"]*)"([^"]*)\n/\1\2"\3 /;ta' file.csv
Explanation:
-z don't use \n as line endings
:a label for repeating the step after successful replacement
(\n|^) Search after a newline or the very first line
([^"]*) Substring without a "
ta Go back to label a and repeat
awk pattern matching is working.
answer in one line :
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile
if you'd like to drop quotes, you could use:
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile | sed 's/"//gw NewFile'
but I prefer to keep it.
to explain the code:
/Pattern/ : find pattern in current line.
ORS : indicates the output line record.
$0 : indicates the whole of the current line.
's/OldPattern/NewPattern/': substitude first OldPattern with NewPattern
/g : does the previous action for all OldPattern
/w : write the result to Newfile
I have files that need to be removed from comments and white space until keyword . Line number varies . Is it possible to limit multiple continued sed substitutions based on Keyword ?
This removes all comments and white spaces from file :
sed -i -e 's/#.*$//' -e 's/;.*$//' -e '/^$/d' file
For example something like this :
# string1
# string2
some string
; string3
; string4
####
<Keyword_Keep_this_line_and_comments_white_space_after_this>
# More comments that need to be here
; etc.
sed -i '1,/keyword/{/^[#;]/d;/^$/d;}' file
I would suggest using awk and setting a flag when you reach your keyword:
awk '/Keyword/ { stop = 1 } stop || !/^[[:blank:]]*([;#]|$)/' file
Set stop to true when the line contains Keyword. Do the default action (print the line) when stop is true or when the line doesn't match the regex. The regex matches lines whose first non-blank character is a semicolon or hash, or blank lines. It's slightly different to your condition but I think it does what you want.
The command prints to standard output so you should redirect to a new file and then overwrite the original to achieve an "in-place edit":
awk '...' input > tmp && mv tmp input
Use grep -n keyword to get the line number that contains the keyword.
Use sed -i -e '1,N s/#..., when N is the line number that contains the keyword, to only remove comments on the lines 1 to N.