Store regex matches in ruby? - ruby

I'm parsing a file with ruby to change the data formatting. I created a regex which has three match groups that I want to temporally store in variables. I'm having trouble getting the matches to be stored as everything is nil.
Here is what I have so far from what I've read.
regex = '^"(\bhttps?://[-\w+&##/%?=~_|$!:,.;]*[\w+&##/%=~_|$])","(\w+|[\w._%+-]+#[\w.-]+\.[a-zA-Z]{2,4})","(\w{1,30})'
begin
file = File.new("testfile.csv", "r")
while (line = file.gets)
puts line
match_array = line.scan(/regex/)
puts $&
end
file.close
end
Here is some sample data that I'm using for testing.
"https://mail.google.com","Master","password1","","https://mail.google.com","",""
"https://login.sf.org","monster#gmail.com","password2","https://login.sf.org","","ctl00$ctl00$ctl00$body$body$wacCenterStage$standardLogin$tbxUsername","ctl00$ctl00$ctl00$body$body$wacCenterStage$standardLogin$tbxPassword"
"http://www.facebook.com","Beast","12345678","https://login.facebook.com","","email","pass"
"http://www.own3d.tv","Earth","passWOrd3","http://www.own3d.tv","","user_name","user_password"
Thank you,
LF4

This won't work:
match_array = line.scan(/regex/)
That's just using a literal "regex" string as your regular expression, not what's in your regex variable. You can either put the big ugly regex right into your scan or create a Regexp instance:
regex = Regexp.new('^"(\bhttps?://[-\w+&##/%?=~_|$!:,.;]*[\w+&##/%=~_|$])","(\w+|[\w._%+-]+#[\w.-]+\.[a-zA-Z]{2,4})","(\w{1,30})')
# ...
match_array = line.scan(regex)
And you should probably use a CSV library (one comes with Ruby: 1.8.7 or 1.9) for parsing CSV files, then apply a regular expression to each column from the CSV. You'll run into fewer quoting and escaping issues that way.

Related

Find youtube url in json file with ruby

For testing purpose my json file (test.json) consists of only the string I want to find:
"https://www.youtube.com/watch?v=hBIZF3sDFTI"
Somehow I cannot find the string in file with this ruby code:
if not File.foreach("test.json").grep(/https://www.youtube.com/watch?v=hBIZF3sDFTI/).any?
puts("string not in file")
end
Output: "string not in file"
But the string is in the file.
Searching for other strings works fine, so it must be a problem with this particular string.
Any help is much appreciated!
Problems
Your regex pattern isn't valid, because it's got too many forward slashes in it. Specifically:
/https://www.youtube.com/watch?v=hBIZF3sDFTI/
is not a valid regular expression. Your String is also not a valid a JSON object.
Solution
You need to escape special regular expression characters like / and ? before trying to use your pattern. For example, you could call Regexp#escape on the String like so:
Regexp.escape 'https://www.youtube.com/watch?v=hBIZF3sDFTI'
#=> "https://www\\.youtube\\.com/watch\\?v=hBIZF3sDFTI"
Then, assuming you have a valid JSON object, you could match the expression as follows:
require 'json'
str = 'https://www.youtube.com/watch?v=hBIZF3sDFTI'
json = str.to_json
#=> "\"https://www.youtube.com/watch?v=hBIZF3sDFTI\""
pattern = Regexp.escape str
json.match pattern
#=> #<MatchData "https://www.youtube.com/watch?v=hBIZF3sDFTI">

Reading specific line into an array - ruby

Have a txt file with the following:
Anders Hansen;87442355;11;87
Jens Hansen;22338843;23;11
Nanna Kvist;25233255;24;84
I would like to search the file after a specific name taken from the user input. Then save that line into an array, splittet via ";". Can't get it to work though. This is my code:
user1 = []
puts "Start by entering the full name of user 1: "
input = gets.chomp
File.open("userregister.txt") do |f|
f.each_line { |line|
if line =~ input then do |line|
user1 << line.split(';').map
=~ in ruby tries to match a string with a regex (or vice versa). Here, you use it with two strings, which gives an error:
'foo' =~ 'bar' # => TypeError: type mismatch: String given
There are more appropriate String methods to use instead. In your case, #start_with? does the job. If you wanted to check if the latter is contained somewhere as a substring (but not necessary the beginning), you can use #include?.
In case you actually wanted to take a regex as a user input (generally a bad idea), you can convert it from string to regex:
line =~ /#{input}/
Looking at the file format, I would actually use Ruby CSV class. By specifying the column separator to ;, you will get an array for each row.
require 'csv'
input = gets.chomp
CSV.foreach('userregister.txt', col_sep: ';') do |row|
if row[0].downcase == input.downcase
# Do stuffs with row[1..-1]
end
end

Generate string for Regex pattern in Ruby

In Python language I find rstr that can generate a string for a regex pattern.
Or in Python we have this method that can return range of string:
re.sre_parse.parse(pattern)
#..... ('range', (97, 122)) ....
But In Ruby I didn't find any thing.
So how to generate string for a regex pattern in Ruby(reverse regex)?
I wanna to some thing like this:
"/[a-z0-9]+/".example
#tvvd
"/[a-z0-9]+/".example
#yt
"/[a-z0-9]+/".example
#bgdf6
"/[a-z0-9]+/".example
#564fb
"/[a-z0-9]+/" is my input.
The outputs must be correct string that available in my regex pattern.
Here outputs were: tvvd , yt , bgdf6 , 564fb that "example" method generated them.
I need that method.
Thanks for your advice.
You can also use the Faker gem https://github.com/stympy/faker and then use this call:
Faker::Base.regexify(/[a-z0-9]{10}/)
In Ruby:
/qweqwe/.to_s
# => "(?-mix:qweqwe)"
When you declare a Regexp, you've got the Regexp class object, to convert it to String class object, you may use Regexp's method #to_s. During conversion the special fields will be expanded, as you may see in the example., using:
(using the (?opts:source) notation. This string can be fed back in to Regexp::new to a regular expression with the same semantics as the original.
Also, you can use Regexp's method #inspect, which:
produces a generally more readable version of rxp.
/ab+c/ix.inspect #=> "/ab+c/ix"
Note: that the above methods are only use for plain conversion Regexp into String, and in order to match or select set of string onto an other one, we use other methods. For example, if you have a sourse array (or string, which you wish to split with #split method), you can grep it, and get result array:
array = "test,ab,yr,OO".split( ',' )
# => ['test', 'ab', 'yr', 'OO']
array = array.grep /[a-z]/
# => ["test", "ab", "yr"]
And then convert the array into string as:
array.join(',')
# => "test,ab,yr"
Or just use #scan method, with slightly changed regexp:
"test,ab,yr,OO".scan( /[a-z]+/ )
# => ["test", "ab", "yr"]
However, if you really need a random string matched the regexp, you have to write your own method, please refer to the post, or use ruby-string-random library. The library:
generates a random string based on Regexp syntax or Patterns.
And the code will be like to the following:
pattern = '[aw-zX][123]'
result = StringRandom.random_regex(pattern)
A bit late to the party, but - originally inspired by this stackoverflow thread - I have created a powerful ruby gem which solves the original problem:
https://github.com/tom-lord/regexp-examples
/this|is|awesome/.examples #=> ['this', 'is', 'awesome']
/https?:\/\/(www\.)?github\.com/.examples #=> ['http://github.com', 'http://www.github.com', 'https://github.com', 'https://www.github.com']
UPDATE: Now regular expressions supported in string_pattern gem and it is 30 times faster than other gems
require 'string_pattern'
/[a-z0-9]+/.generate
To see a comparison of speed https://repl.it/#tcblues/Comparison-generating-random-string-from-regular-expression
I created a simple way to generate strings using a pattern without the mess of regular expressions, take a look at the string_pattern gem project: https://github.com/MarioRuiz/string_pattern
To install it: gem install string_pattern
This is an example of use:
# four characters. optional: capitals and numbers, required: lower
"4:XN/x/".gen # aaaa, FF9b, j4em, asdf, ADFt
Maybe you can find what you are looking for over here.

Ruby regex gsub a line in a text file

I need to match a line in an inputted text file string and wrap that captured line with a character for example.
For example imagine a text file as such:
test
foo
test
bar
I would like to use gsub to output:
XtestX
XfooX
XtestX
XbarX
I'm having trouble matching a line though. I've tried using regex starting with ^ and ending with $, but it doesn't seem to work. Any ideas?
I have a text file that has the following in it:
test
foo
test
bag
The text file is being read in as a command line argument.
So I got
string = IO.read(ARGV[0])
string = string.gsub(/^(test)$/,'X\1X')
puts string
It outputs the exact same thing that is in the text file.
If you're trying to match every line, then
gsub(/^.*$/, 'X\&X')
does the trick. If you only want to match certain lines, then replace .* with whatever you need.
Update:
Replacing your gsub with mine:
string = IO.read(ARGV[0])
string = string.gsub(/^.*$/, 'X\&X')
puts string
I get:
$ gsub.rb testfile
XtestX
XfooX
XtestX
XbarX
Update 2:
As per #CodeGnome, you might try adding chomp:
IO.readlines(ARGV[0]).each do |line|
puts "X#{line.chomp}X"
end
This works equally well for me. My understanding of ^ and $ in regular expressions was that chomping wouldn't be necessary, but maybe I'm wrong.
You can do it in one line like this:
IO.write(filepath, File.open(filepath) {|f| f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)})
IO.write truncates the given file by default, so if you read the text first, perform the regex String.gsub and return the resulting string using File.open in block mode, it will replace the file's content in one fell swoop.
I like the way this reads, but it can be written in multiple lines too of course:
IO.write(filepath, File.open(filepath) do |f|
f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)
end
)
If your file is input.txt, I'd do as following
File.open("input.txt") do |file|
file.lines.each do |line|
puts line.gsub(/^(.*)$/, 'X\1X')
end
end
(.*) allows to capture any characters and makes it a variable Regexp
\1 in the string replacement is that captured group
If you prefer to do it in one line on the whole content, you can do it as following
File.read("input.txt").gsub(/^(.*)$/, 'X\1X')
string.gsub(/^(matchline)$/, 'X\1X')
Uses a backreference (\1) to get the first capture group of the regex, and surround it with X
Example:
string = "test\nfoo\ntest\nbar"
string.gsub!(/^test$/, 'X\&X')
p string
=> "XtestX\nfoo\nXtestX\nbar"
Chomp Line Endings
Your lines probably have newline characters. You need to handle this one way or another. For example, this works fine for me:
$ ruby -ne 'puts "X#{$_.chomp}X"' /tmp/corpus
XtestX
XfooX
XtestX
XbarX

What's the best way to parse a tab-delimited file in Ruby?

What's the best (most efficient) way to parse a tab-delimited file in Ruby?
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by #JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
#filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).

Resources