Use bash to extract data between two regular expressions while keeping the formatting - bash

but I have a question about a small piece of code using the awk command. I have not found an answer/solution anywhere.
I am trying to parse an output file and extract all data between the 1st expression (including) ATOMIC and 2nd expression (excluding) Bond. This data is to be sent to a new file $1_geom. So far I have the following:
`awk '/ATOMIC/{flag=1;next}/Bond lengths in Bohr/{flag=0}flag' $1` >> $1_geom
This script will extract the correct data for me, but there are 2 problems:
The line ATOMICis not extracted with the data
The data is extracted and appended to a single line. I want the data to retain the formatting from the parsed file (5 columns, variable amount of lines). Please see attachment to see a visual. Visual Example Attachment. Is there a different way to append data (other than >>) so that I can keep formatting?
Any help is appreciated, thank you.

The next is causing the first match to be skipped; take it out if you don't want that.
The backticks by themselves are a shell syntax error (unless your Awk script happens to produce valid shell commands). I'm guessing you have a useless echo or something like that in your actual script which disarms the error, but instead produces the symptoms you describe.

This was part of a code in a csh script and I did have an "echo" in front of this line. Removing the "echo" makes it work perfectly and addresses the 2 questions that I had.

Related

How to delete quotation mark in text file printed

I'm honestly a novice on scilab.
I'm using print function to create .txt file with my character matrix in it.
But , when I open txt file, double quote appeared. I just want words without "".
This is how I'm using print
Compterendu(1,1)= "Medecin demandeur: "
fileresname= fullfile(RES_PATH, "compterendu.txt")
print(fileresname,Compterendu)
And, compterendu.txt was printed out like this.
Would be so grateful for any help!!
Thanks
Why do you use "print" ? After looking into the doc, yes, it is used to produce the same text as when you type the expression or the variable name on the command line. Hence it does print double quotes for strings. If you need something more basic use lower level i/o commands, like mputl.
S.

How do I ignore comments in a text file with LabVIEW?

I've created a script file reader, nothing more than a glorified text reader that changes loop cases in my program, but I need it to be able to ignore comments on a line, execute that command, and go to the next line and process the new command after it finds the comment denoted with a semicolon. For the life of me, I can't figure out how to do this.
Currently, the commands are read in like this:
DO THIS FUNCTION
DO THAT FUNCTION
I'd like to comment it with a semicolon like this:
DO THIS FUNCTION ;this is a comment to be ignored
Below is my text file read code, should be able to drag and drop it in to test. The command indicator just echoes the command being read. I've removed the rest of my program, sorry, can't send that part.
Can someone shed some light?
Is a semicolon used anywhere else in your file? Or is it just used to indicate a comment?
If it is only used to indicate a comment then as you read each line in, call the Split String primitive and split at the ";". Just use the top output regardless of whether or not the line contains a semicolon:
You can use the "Match Regular Expression Function" to split up the string, as #Moray already suggested.
Sadly I can't give you an example vi right now.
The main idea is:
find the "Match Regular Expression Function"
give it a ; as char to search for
there are three outputs of the function (before match, match, after match)
use the 'before match' instead of the whole line and give it to the rest of your program
This only works if your commands don't contain any ; except for the comments.
Note: I not quite sure what happens if you give the function a string that doesn't contain ; but you can figure that out by yourself by using the detailed help to this function :)

sed delete unmatched lines between two lines with bash variable

I need help understanding a weird problem with sed, bash and a while loop.
MY data looks like this:
-File 1- CSV
account,hostnames,status,ipaddress,port,user,pass
-File 2- XML - This is a sample record set for two entries under one account
<accountname="account">
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
</accountname>
So far, I can add records in between the respective account holder from File1 to File2. But, if I need to remove records that no longer exists it does not work efficiently since it wipes other records from different accounts, ie it does not delete between a matched accountname.
I import from File 1 into File 2 with a while loop in my bash program:
-Bash Program excerpts-
//Read File in to F//
cat File 2 | while read F
do
//extract fields from F into variables
_vmname="$(echo $F |grep 'cname'| sed 's/<cname="//g' |sed 's/.\{2\}$//g')"
_account="$(echo $F | grep 'accountname' | sed 's/accountname="//g' |sed 's/.\{2\}$//g')"
// I then compare my File1 and look for stale records that are still in File2
if grep "$_vmname" File1 ;then
continue
else
// if not matched, delete between the respective accountname
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
If I manually declare _vmname and _account and run
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
It removes the stale records from File2. When I let my bash script run, it does not.
I think I have three problems:
Reading the variables for _vmname and _account name inside a loop makes it read numerous times. Any better way to do is appreciated.
I do not think the sed statement for matching these two patterns and then delete works like I want inside a while loop.
I may have a logic problem with my thought chain.
Any pointers, and please no awk, perl, lxml or python for this one.
Thanks!
and please no awk
I appreciate that you want to keep things simple, and I suppose awk seems more complicated than what you're doing. But I'd like to point out you have so far 3 grep and 4 sed invocations per line in the file, to process another file N times, once per line. That's O(mn) using the slowest method on the planet to read the file (a while loop). And it doesn't work.
I may have a logic problem with my thought chain.
I'm afraid we must allow for that possibility!
The right advice is to tackle XML with an XML parser, because XML is not a regular language and so can't be parsed with regular expressions. And that's really what you need here, because your program processes the whole XML document. You're not just plucking out bits and depending on incidental formatting artifacts; you want to add records that aren't there and remove those that "no longer exist". Apparently there is information in the XML document you need to preserve, else you would just produce it from the CSV. A parser would spoon-feed it to you.
The second-best advice is to use awk. I suppose you might try an approach like:
Process the CSV and produce the XML to be inserted.
In awk, first read the new input XML into an array keyed by cname, Then process the XML target once. For every CNAME, consult your array; if you find a match, insert your pre-constructed XML replacement (or modify the "paragraph" accordingly).
I'm not sure what the delete criteria are, so I don't know if it can be done in the same pass with step #2. If not, extract the salient information somehow. Maybe print a list of keys from each of the two files, and use comm(1) to produce a list of to-be-deleted. Then, similar to step #2, read in that list, and process the XML file one more time. Write anything you delete to stderr so you can keep track of what went missing, from what lines.
Any pointers
Whenever you find yourself processing the same file N times for N inputs, you know you're headed for trouble. One of the two inputs is always smaller, and that one can be put in some kind of array. cat file | while read is another warning signal, telling you use awk or any of a dozen obvious utilities that understand lines of text.
You posted your question on SO two weeks ago. I suspect no one answered it because you warned them away: preemptively saying, in effect, don't tell me to use good tools. I'm only here to suggest that you'll be more comfortable after you take off that straightjacket. Better tools, in this case, are the only right answer.

How to clean a csv file where fields contains the csv separator and delimiter

I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.
The source software has no settings to play with to improve the situation.
Format of the csv:
"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";
Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.
In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.
What I managed to do so far:
RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)
(?:";"|^")(.*?)(?=";"|";\n)
RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.
(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)
RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers
(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)
I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.
Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.
When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.
This is what your row should rather look like.
"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";
So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.
In python, this would look like this:-
mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")
But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk
But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.

return line of strings between two strings in a ruby variable

I would like to extract a line of strings but am having difficulties using the correct RegEx. Any help would be appreciated.
String to extract: KSEA 122053Z 21008KT 10SM FEW020 SCT250 17/08 A3044 RMK AO2 SLP313 T01720083 50005
For Some reason StackOverflow wont let me cut and paste the XML data here since it includes "<>" characters. Basically I am trying to extract data between "raw_text" ... "/raw_text" from a xml that will always be formatted like the following: http://www.aviationweather.gov/adds/dataserver_current/httpparam?dataSource=metars&requestType=retrieve&format=xml&hoursBeforeNow=3&mostRecent=true&stationString=PHNL%20KSEA
However, the Station name, in this case "KSEA" will not always be the same. It will change based on user input into a search variable.
Thanks In advance
if I can assume that every strings that you want starts with KSEA, then the answer would be:
.*(KSEA.*?)KSEA.*
using ? would let .* match as less as possible.

Resources