Converting a "A B [C] D" formatted file to a CSV file

Converting a "A B [C] D" formatted file to a CSV file - bash

I have a file which has contents on every line following this format (A, B, C, and D represent text):
A B [C] D
E.g.:
cat Cat [noun] This animal likes to eat mice.
The first separator is the first occurrence of a space (" ") on a line.
The second separator is the first occurrence of a space followed by a square opening bracket (" [").
The final separator is the first occurrence of a square closing bracket followed by a space ("] ").
I want to convert all of the content in this file to a CSV file, where # is used in place of commas:
A#B#C#D
The original file contains many foreign characters in UTF-8.
There are no spaces or brackets within the contents of A and B.
C sometimes contains spaces, but no brackets inside the two given.
D contains anything from spaces, square brackets, etc. and the contents should remain unchanged by the conversion.
How can I convert this file to that format?

Sounds like a task for regular expressions. The literal brackets make this a bit ugly, but here's one that matches your example text.
^([^ ]+) ([^ ]+) \[([^]]+)\] (.*)$
You'll have to check the regular expression api of whatever language you're writing your code in. For help in creating regexes, I recommend Expresso: http://www.ultrapico.com/Expresso.htm

You need to perform char substitution. I suggest you use sed with regular expression. This is a piece of code corresponding to your example:
sed -r 's/( |\[|\])+/#/g' file_to_modify.txt > file_for_output.txt
For substituting every column in a specific way, the following form is used:
sed -r 's/([^ ]+) ([^ ]+) \[([^]]+)] (.*$)/\1#\2#\3#\4/g' f1.txt > f2.txt

The string looks like a user-defined csv fomart.
Maybe you can try csv module in python:
$ python3
>>> import csv, io, re
>>> '#'.join(next(csv.reader(io.StringIO(re.sub('[\[\]]', '\034', 'A B [c c c] D')), delimiter=' ', quotechar='\034')))
'A#B#c c c#D'

Related

Swap two characters in bash with tr or similar

I'm doing a bash script and I have a problem. I would like to change the position of two characters in a string.
My input is the following:
"aaaaa_eeeee"
The desired output is:
"eeeee_aaaaa"
I don't want to invert the string or anything else like that, what I need is to replace the character "a" by the "e" and the "e" by the "a". I have tried to make a echo "aaaaa_eeeee" | tr "a" "e . The first replacement is simple but the second one I don't know how to do it.

You can give multiple original and replacement characters to tr. Each character in the original string is replaced with the corresponding replacement character.
echo "aaaaa_eeeee" | tr "ae" "ea"

Pass Translation Sets as Arguments
To make the substitutions work in a single logical pass, you need to pass multiple characters to the tr utility. The man page for the BSD version of tr describes the use of translation sets as follows:
[T]he characters in string1 are translated into the characters in string2 where the first character in string1 is translated into the first character in string2 and so on. If string1 is longer than string2, the last character found in string2 is duplicated until string1 is exhausted.
For example:
$ tr "ae" "ea" <<< "aaaaa_eeeee"
eeeee_aaaaa
This maps a => e and e => a in a single logical pass, avoiding the issues that would result in trying to map the replacements sequentially.

This is a job for rev:
echo "aaaaa_eeeee"|rev
eeeee_aaaaa

Sed converting underscore string to CamelCase fails on numbers

I have an assigment to convert function names that are written like this: function_name() to camelCase. There are some restrictions:
don't convert functions with uppercase character in them
don't convert part of function with two underscores (two__underscores())
I thought of sed command that works fairly well, except it fails on single digit between underscores:
command:
sed -re '/[A-Z]+/!s/([0-9a-z])(_)([a-z0-9])/\1\u\3/g'
What it does:
this_is_simple() -> thisIsSimple()
this_is_2_simple() -> thisIs2_simple()
this_is_22_simple() -> thisIs22Simple()
The problem is second example. Why it fails on single digit but not on number with more digits? I tried using [[:digit:]] and replacing ([0-9a-z]) with ([a-z0-9]|[[:digit:]]) . They work same.
Thank you in advance.

Loop through it manually and replace up until there is nothing more to replace.
sed -re '/[A-Z]+/!{ : again; /([0-9a-zA-Z])_([a-z0-9])/{ s//\1\u\2/; b again; }; }'
I have added A-Z in the first regex to handle cases like:
this_is_a_simple -> thisIsASimple
After the first match it becomes thisIsA_simple, so in the second loop we want to match A_simple.
Maybe a better version would be:
sed -re '/[A-Z]+/!{ : again; /(.*[0-9a-z])_([a-z0-9])/{ s//\1\u\2/; b again; }; }'
Because regex is greedy, this will replace from the end, so this_is_a_simple at first becomes this_is_aAimple, then this_isASimple, then thisIsASimple.

sed preserve wildcard value inside pattern

I have some app config file tmp.cfg. And need to change some given values inside.
Here are the string examples:
app-stat!error!25871a5f-9f50-40ac-923d-c80a660fe21d!1!2
app-stat!queued!25871a5f-9f50-40ac-923d-c80a660fe21d!5!10
app-stat!error!fbbf0e80-8a21-4ebf-9a78-b1017c58a19d!1!2
app-stat!error!5670b363-6a5d-4fcd-819e-85786c5957f1!120!200
For all strings that contains
!error! then following some GUID and then values !1!2 change to
!error! then preserve some GUID and then NEW values !7!10
I do not need to touch other string that contains !error! then GUID but different values in the end
Here what I've tried:
sed -i "s/error\!.*\!1\!2/error\!.*\!4\!8/g" tmp.cfg
It finds all string that I need but replaces a GUID actually with symbols .* instead of GUID number itself.
How to build sed expression in that way to preserve the wildcard part?
The expected result is:
app-stat!error!fbbf0e80-8a21-4ebf-9a78-b1017c58a19d!4!8
The actual result is:
app-stat!error!.*!4!8

sed 's/\(!error!.*\)!1!2/\1!4!8/g' file
Guess you need something like this.
Pattern enclosed within
\( ... \)
are saved in registers for later use and can be accessed as \1, \2 … upto \9.
In the above sed expression, pattern from !error!<GUID> is captured in \1 and used while replacing as \1!4!8.
You can omit g from the sed expression if you are sure that the same pattern won't occur twice on a line.

This is easy to do with awk
awk '$2=="error" && $4==1 && $5==2 {$4=7;$5=10}1' FS="!" OFS="!" file
app-stat!error!25871a5f-9f50-40ac-923d-c80a660fe21d!7!10
app-stat!queued!25871a5f-9f50-40ac-923d-c80a660fe21d!5!10
app-stat!error!fbbf0e80-8a21-4ebf-9a78-b1017c58a19d!7!10
app-stat!error!5670b363-6a5d-4fcd-819e-85786c5957f1!120!200
Separate fields by !
Then if field 2=error, filed 4=1 and field 5=1
Set field 4 and 5 to 7 and 10
1 do print the lines

This sed command should work:
sed -r 's/(.*)!error!(.*)!1!2$/\1!error!\2!4!8/g' file_name

Difference of answers while using split function in Ruby

Given the following inputs:
line1 = "Hey | Hello | Good | Morning"
line2 = "Hey , Hello , Good , Morning"
file1=length1=name1=title1=nil
Using ',' to split the string as follows:
file1, length1, name1, title1 = line2.split(/,\s*/)
I get the following output:
puts file1,length1,name1,title1
>Hey
>Hello
>Good
>Morning
However, using '|' to split the string I receive a different output:
file1, length1, name1, title1 = line2.split(/|\s*/)
puts file1,length1,name1,title1
>H
>e
>y
Both the strings are same except the separating symbol (a comma in first case and a pipe in second case). The format of the split function I am using is also the same except, of course, for the delimiting character. What causes this variation?

The problem is because | has the meaning of OR in regex. If you want literal character, then you need to escape it \|. So the correct regex should be /\|\s*/
Currently, the regex /|\s*/ means empty string or series of whitespace character. Since the empty string specified first in the OR, the regex engine will break the string up at every character (you can imagine that there is an empty string between characters). If you swap it to /\s*|/, then the whitespaces will be preferred over empty string where possible and there will be no white spaces in the list of tokens after splitting.

sed command to edit stream on given rule

I have an input stream like this:
afs=1;bgd=1;cgd=1;djh=1;fgjhh=1;
Now the rule I have to edit the stream is:
(1)if we have
"djh=number;"
replace it with
"djh=number,"
(2)else replace "string=number;"it with
"string,"
I can handle case 2 as:
sed 's/afs=1/afs,/g;s/dbg=1/dbg,/g;..... so on for rest
How to take care for condition 1?
The "djh" number can be any number(1,12,100), the other numbers are always 1.
all the double quotes I have used are for reference only; no double quotes are present in the input stream. "afs" can be "Afs" also.
Thanks in advance.

sed -e 's/;/,/g; s/,djh=/,#=/; s/\([a-z][a-z]*\)=[0-9]*,/\1,/g; s/#/djh/g'
This does the following
replace all ; by ,
replace djh with #
remove =number from all lower cased strings
replace # with djh
This results in afs,bgd,cgd,djh=1,fgjhh, for your input. Of course you could substitute djh with any other character that makes it easy to match the other strings. This is just illustrating the idea.

echo 'afs=1;bgd=1;cgd=1;djh=1;fgjhh=1;' |
sed -e 's/\(djh=[0-9]\+\);/\1,/g' -e 's/\([a-zA-Z0-9]\+\)=1;/\1,/g'

This might work for you:
echo "afs=1;bgd=1;cgd=1;djh=1;fgjhh=1;" |
sed 's/^/\n/;:a;/\n\(djh=[0-9]*\);/s//\1,\n/;ta;s/\n\([^=]*\)=1;/\1,\n/;ta;s/.$//'
afs,bgd,cgd,djh=1,fgjhh,
Explanation:
This method uses a unique marker (\n is a good choice because it cannot appear in the pattern space as it is used by sed as the line delimiter) as anchor for comparing throughout the input string. It is slow but can scale if more than one exception is needed.
Place the marker in front of the string s/^/\n/
Name a loop label :a
Match the exception(s) /\n\(djh=[0-9]*\)/
If the exception occurs substitute as necessary. Also bump the marker along /s//\1,\n/
If the above is true break to loop label ta
Match the normal and substitute. Also bump the marker along s/\n\([^=]*\)=1;/\1,\n/
If the above is true break to loop label ta
All done remove the marker s/.$//
or:
echo "afs=1;bgd=1;cgd=1;djh=1;fgjhh=1;" |
sed 's/\<djh=/\n/g;s/=[^;]*;/,/g;s/\n\([^;]*\);/djh=\1,/g'
afs,bgd,cgd,djh=1,fgjhh,
Explanation:
This is fast but does not scale for multiple exceptions:
Globaly replace the exception string with a new line s/\<djh=/\n/g
Globaly replace the normal condition s/=[^;]*;/,/g
Globaly replace the \n by the exception string s/\n\([^;]*\);/djh=\1,/g
N.B. When replacing the exception string make sure that it begins on a word boundary \<

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Converting a "A B [C] D" formatted file to a CSV file - bash

The string looks like a user-defined csv fomart. Maybe you can try csv module in python: $ python3 >>> import csv, io, re >>> '#'.join(next(csv.reader(io.StringIO(re.sub('[\[\]]', '\034', 'A B [c c c] D')), delimiter=' ', quotechar='\034'))) 'A#B#c c c#D'

Related

Swap two characters in bash with tr or similar

Sed converting underscore string to CamelCase fails on numbers

sed preserve wildcard value inside pattern

Difference of answers while using split function in Ruby

sed command to edit stream on given rule

Categories

Resources