What's the efficient way of checking the format of file by Ruby? - ruby

I have a file like:
Fruit.Store={
#blabla
"customer-id:12345,item:store/apple" = (1,2); #blabla
"customer-id:23456,item:store/banana" = (1,3); #blabla
"customer-id:23456,item:store/watermelon" = (1,4);
#blabla
"customer-id:67890,item:store/watermelon" = (1,6);
#The following two are unique
"customer-id:0000,item:store/" = (100, 100);
#
"" = (0,0)
};
Except the comments, each line has the same format: customer-id and item:store/ are fixed, and customer-id is a 5-digit number. The last two records are unique. How could I make sure the file is in the right format elegantly? I am thinking about using the flag for the first special line Fruit.Store={ and than for the following lines split each line by "," and "=", and if the splitted line is not correct, match them with the last two records. I want to use Ruby for it. Any advice? Thank you.
I am also thinking about using regular expression for the format, and wrote:
^"customer:\d{5},item:store\/\D*"=\(\d*,\d*\);
but I want to combine these two situations (with comment and without comment):
^"customer:\d{5},item:store\/\D*"=\(\d*,\d*\);$
^"customer:\d{5},item:store\/\D*"=\(\d*,\d*\);#.*$
how could I do it? Thanks

Using regular expressions could be a good option since each line has a fixed format; and you almost got it, your regex just needed a few tweaks:
(?:#.*|^"customer-id:\d{5},item:store\/\D*" *= *\(\d*, *\d*\); *(?:#.*)?)$
This is what was added to your current regex:
Option to be a comment line (#.*) or (|) a regular line (everything after |).
Check for possible spaces before and after =, after the comma (,) that separates the digits in parenthesis, and at the end of the line.
Option to include another comment at the end of the line ((?:#.*)?).
So just compare each line against this regex to check for the right format.

Related

Split a string and remove the first element in string

Original string '4.0.0-4.0-M-672092'
How to modify the Original string to "4.0-M-672092" using a one line code.
Any Help is highly appreciated .
Thanks and Regards
The 'split' method works in this case
https://apidock.com/ruby/String/split
'4.0.0-4.0-M-672092'.split('-')[1..-1].join('-')
# => "4.0-M-672092"
Just be careful, in this application is fine, but in long texts this might become unoptimized, since it splits all the string and then joins the array all over again
If you need this in wider texts to be more optimized, you can find the "-" index (which is your split) and use the next position to make a substring
text = '4.0.0-4.0-M-672092'
text[(text.index('-') + 1)..-1]
# => "4.0-M-672092"
But you can't do it in one line, and not finding a split character will result in an error, so use a rescue statement if that is possible to happen
Simplest way:
'4.0.0-4.0-M-672092'.split('-', 2).second
"4.0.0-4.0-M-672092"[/(?<=-).*/]
#=> "4.0-M-672092"
The regular expression reads, "Match zero or more characters other than newlines, as many as possible (.*), provided the match is preceded by a hyphen. (?<=-) is a positive lookbehind. See String#[].

Regular Expression replacement to convert Less mixins to Scss

I'm looking to convert Less mixin calls to their equivalents in Scss:
.mixin(); should become #mixin();
.mixin(0); should become #mixin(0);
.mixin(0; 1; 2); should become #mixin(0, 1, 2);
I'm having the most difficulty with the third example, as I essentially need to match n groups separated by semicolons, and replace those with the same groups separated by commas. I suppose this relies on some sort of repeating groups functionality in regexes that I'm not familiar with.
It's not simply enough to simply replace semicolons within paren - I need a regex that will only match the \.[\w\-]+\(.*\) format of mixins, but obviously with some magic in the second match group to handle the 3rd example above.
I'm doing this in Ruby, so if you're able to provide replacement syntax that's compatible with gsub, that would be awesome. I would like a single regex replacement, something that doesn't require multiple passes to clean up the semicolons.
I suggest adding two capturing groups round the subvalues you need and using an additional gsub in the first gsub block to replace the ; with , only in the 2nd group.
See
s = ".mixin(0; 1; 2);"
puts s.gsub(/\.([\w\-]+)(\(.*\))/) { "##{$1}#{$2.gsub(/;/, ',')}" }
# => #mixin(0, 1, 2);
The pattern details:
\. - a literal dot
([\w\-]+) - Group 1 capturing 1 or more word chars ([a-zA-Z0-9_]) or -
(\(.*\)) - Group 2 capturing a (, then any 0+ chars other than linebreak symbols as many as possible up to the last ) and the last ). NOTE: if there are multiple values, use lazy matching - (\(.*?\)) - here.
Here you go:
less_style = ".mixin(0; 1; 2);"
# convert the first period to #
less_style.gsub! /^\./, '#'
# convert the inner semicolons to commas
scss_style = less_style.gsub /(?<=[\(\d]);/, ','
scss_style
# => "#mixin(0, 1, 2);"
The second regex is using positive lookbehinds. You can read about those here: http://www.regular-expressions.info/lookaround.html
I also use this neat web app to play around with regexes: http://rubular.com/
This will get you a single pass through gsub:
".mixin(0; 1; 2);".gsub(/(?<!\));|\./, ";" => ",", "." => "#")
=> "#mixin(0, 1, 2);"
It's an OR regex with a hash for the replacement parameters.
Assuming from your example that you just want to replace semicolons not following close parens(negative lookbehind): (?<!\));
You can modify/build on this with other expressions. Even add more OR conditions to the regex.
Also, you can use the block version of gsub if you need more options.

Using awk or sed to print column of CSV file enclosed in double quotes

I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. The actual file contain around 300 columns and 200,000 rows.
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"
I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br>. and move second column to the end. Anything within the cells should be the same, with double quotes and commas as the original file. Below is an example of the output that I need.
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"
In this example I want to remove column3 and merge column 5, 6, 7.
Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected.
awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
sed is used to remove beginning and ending double quote of a cell.
The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column.
Other code that I have used consider every comma as beginning of a cell, so that won't work as well.
awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv
sed -i 's#"</br>"#</br>#g' inputfile.csv
Any help is greatly appreciated. thanks!
CSV is a loose format. There may be subtle variations in formatting. Your particular format may or may not be expressible with a regular grammar/regular expression. (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library.
It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. There are a number of options to tweak the formatting. Try something like this:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for row in inreader:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
Note that in Python, indexes start with 0 (e.g. row[1] is the second field). The first index of a slice is inclusive, the last is exclusive (row[1:3] is row[1] and row[2] only). Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL. There are more options at Dialects and Formatting Parameters.
The above code produces the following output:
"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"
There are two issues with this:
It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows.
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde. There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes.
If these quotes are properly escaped, your sample input CSV file becomes
infile.csv (escaped quotes):
"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"
Now consider this modified Python script that doesn't merge columns on the first row:
#!/usr/bin/python
import csv
with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
inreader = csv.reader(infile)
outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
first_row = True
for row in inreader:
if first_row:
first_row = False
else:
# Merge fields 5,6,7 (indexes 4,5,6) into one
row[4] = "</br>".join(row[4:7])
del row[5:7]
# Copy second field (index 1) to the end
row.append(row[1])
# Remove second and third fields
del row[1:3]
# Write manipulated row
outwriter.writerow(row)
The output outfile.csv is
"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"
This is your sample output, but with properly escaped "some other, ""cde"" here".
This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. Processing more complicated formats may justify more complicated tools. Using an existing library also removes a few opportunities to make mistakes.
This might be an oversimplification of the problem but this has worked for me with your test data:
cat /tmp/inputfile.csv | sed 's#\"\,\"#|#g' | sed 's#"</br>"#</br>#g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'
Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks.

Multiple sequence alignment. Convert multi-line format to single-line format?

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
TGFb3_human_used_for_docking LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA LRSADTTHST-
Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
(In this particular examples the sequences are almost identical, but in general they aren't!)
How can I convert from multi-line multiple sequence alignment format to single-line?
Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done
#Create a dictionary to accumulate full sequences
full_sequences = {}
#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
for line in infile:
line = [element.strip() for element in line.split()]
if len(line) < 2:
continue
full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]
#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
for seq in full_sequences:
outstr += seq + "\t\t" + full_sequences[seq] + "\n"
outfile.write(outstr)

Could someone help me parse this string with regex?

I'm not very good with regex, but here's what I got (the string to parse and the regex are on this page) http://rubular.com/r/iIIYDHkwVF
It just needs to match that exact test string
The regular expression is
^"AddonInfo"$(\n\s*)+^\{\s*
It's looking for
^"AddonInfo"$ — a line containing only "AddonInfo"
(\n\s*)+ — followed by at least one newline and possibly many blank or empty lines
^\{\s* — and finally a line beginning with { followed by optional whitespace
To break down a regular expression into its component pieces, have a look at an answer that explains beginning with the basics.
To match the entire string, use
^"AddonInfo"$(\n\s*)+^\{(\s*".+?"\s+".+?"\s*\n)+^\}
So after the open curly, you're looking for one or more lines such that each contains a pair of quote-delimited simple strings (no escaping).
This one works:
^"AddonInfo"[^{]*{[^}]*}
Explanation:
^"AddonInfo" matches "AddonInfo" in the beginning of a line
[^{]* matches all the following non-{ characters
{ matches the following {
[^}]* matches all the following non-} characters
} matches the following }
^"AddonInfo"(\s*)+^\{\s*(?:"([^"]+)"\s+"([^"]*)"\s+)+\}
You will get $1 to point into first key, $2 first value, $3 second key, $4, second value, and so on.
Notice that key is to be non-empty ("([^"]+"), but value may be empty (uses * instead of +).

Resources