I need to remove or stripe white space from CSV rows.
test.csv
60 500, # want 60500
8 100, # want 8100
5 400, # want 5400
480, # want 480
remove_space.rb
require 'csv'
CSV.foreach('test.csv') do |row|
row = row[0]
row = row.strip
row = row.gsub(" ","")
puts row
end
I don't understand why it doesn't work. Result is same as test.csv
Any idea?
Your test.csv file contains narrow no-break space (U+202F) unicode characters. This is a non-whitespace character. (A regular space character is U+0020.)
You can see the different possible unicode spaces here: http://jkorpela.fi/chars/spaces.html
Here is a more generic script - using a POSIX bracket group - to remove all "space-like" characters:
require 'csv'
CSV.foreach('test.csv') do |row|
row = row[0]
row = row.gsub(/[[:space:]]/,"")
puts row
end
The thing is those are not normal spaces but rather Narrow No-Break Spaces:
require 'csv'
CSV.foreach('/tmp/test.csv') do |row|
puts row[0].delete "\u202f"
end
#⇒ 60500
# 8100
# 5400
# 480
You can strip out all the spaces, including unicode ones, by using \p{Space} matcher.
require 'csv'
CSV.foreach('/tmp/test.csv') do |row|
puts row[0].gsub /\p{Space}/, ''
end
Related
If I need to get a word from a text file, for example, I have this text file
AWD,SUV,0km,auto
and I need to get the km number or the Drivetrain which is AWD, what do we do after reading the file?
here's how I'm reading the file
def getWord(fileName)
file=fileName
File.readlines(file).each do |line|
puts line
end
This is a CSV (Comma Separated Value) file. You can just split on the comma and take the fields.
# chomp: true removes the trailing newline
File.readlines(fileName, chomp: true).each do |line|
(drivetrain,type,mileage,transmission) = line.split(',')
puts drivetrain
puts mileage
end
But CSV files can get more complex. For example, if there's a comma in a value it could be quoted. 1997,Ford,E350,"Super, luxurious truck". To handle all the possibilities, use the CSV class to parse each line.
headers: [:drivetrain, :type, :mileage, :transmission]
CSV.foreach(fileName, headers: headers) do |row|
puts row[:drivetrain]
puts row[:mileage]
end
I have a text file with structured text which I wish to convert to a csv file.
The file looks something like:
name: Seamus
address: 123 Strand Avenue
name: Seana
address: 126 Strand Avenue
I would like it to look like:
|name | address
______________________________
|Seamus | 123 Strand Avenue
______________________________
|Seana | 126 Strand Avenue
So I understand that I need to do something like;
create a csv file
create the column names
read the text file
for each row of the text file starting with 'name' I assign the following text to the 'name' column, for ech row starting with 'address' assign the value to the 'address' column etc.
But I dont' know how to do so.
I would appreciate any pointers people could provide.
The solution starts by identifying how to parse the text file. In this specific case what separates the "records" in the text file is an empty line.
First step would be importing the file contents:
string_content = File.read("path/to/my_file.txt")
# => "name: Seamus\naddress: 123 Strand Avenue\n\nname: Seana\naddress: 126 Strand Avenue\n"
Then you would need to separate the records. As you can see when parsing the file the empty line is a line that only contains \n, so the \n from the line above plus the one on the empty line make \n\n. That is what you need to look for to separate the records:
string_records = string_content.split("\n\n")
# => ["name: Seamus\naddress: 123 Strand Avenue", "name: Seana\naddress: 126 Strand Avenue\n"]
And then once you have the strings with the records is just a matter of splitting by \n again to separate the fields:
records_by_field = string_records.map do |string_record|
string_record.split("\n")
end
# => [["name: Seamus", "address: 123 Strand Avenue"], ["name: Seana", "address: 126 Strand Avenue"]]
Once that is separated you need to split the records by : to separate field_name and value:
data = records_by_field.map do |record|
record.each_with_object({}) do |field, new_record|
field_name, field_value = field.split(":")
new_record[field_name] = field_value.strip # don't forget to get rid of the initial space with String#strip
end
end
# => [{"name"=>"Seamus", "address"=>"123 Strand Avenue"}, {"name"=>"Seana", "address"=>"126 Strand Avenue"}]
And there you have it! An array of hashes with the correct key-value pairs.
Now from that you can create a CSV or just use it to give it any other format you may want.
To resolve your specific CSV question:
require 'csv'
# first you need to get your column headers, which will be the keys of any of the hashes, the first will do
column_names = data.first.keys
CSV.open("output_file.csv", "wb") do |csv|
# first we add the headers
csv << column_names
# for each data row we create an array with values ordered as the column_names
data.each do |data_hash|
csv << [data_hash[column_names[0]], data_hash[column_names[1]]]
end
end
That will create an output_file.csv in the same directory where you run your ruby script.
And that's it!
Let's construct the file.
str =<<~END
name: Seamus
address: 123 Strand Avenue
name: Seana
address: 126 Strand Avenue
address: 221B Baker Street
name: Sherlock
END
Notice that I've added a third record that has the order of the "name" and "address" lines reversed, and it is preceded by an extra blank line.
in_file = 'temp.txt'
File.write(in_file, str)
#=> 124
The first step is to to obtain the headers for the CSV file:
headers = []
f = File.open(in_file)
loop do
header = f.gets[/[^:]+(?=:)/]
break if header.nil?
headers << header
end
f.close
headers
#=> ["name", "address"]
Notice that the number of headers (two in the example) is arbitrary.
See IO::gets. The regular expression reads, "match one or more characters other than a colon" immediately followed by a colon ((?=:) being a positive lookahead).
If in_file is not exceedingly large it's easiest to first read that file into an array of hashes. The first step is to read the file into a string and then split the string on contiguous lines that contain nothing other than newlines and spaces:
arr = File.read(in_file).chomp.split(/\n\s*\n/)
#=> ["name: Seamus\naddress: 123 Strand Avenue",
# "name: Seana\naddress: 126 Strand Avenue",
# "address: 221B Baker Street\nname: Sherlock"]
We can now convert each element of this array to a hash:
arr = File.read(in_file).split(/\n\s*\n/).
map do |s|
s.split("\n").
each_with_object({}) do |p,h|
key, value = p.split(/: +/)
h[key] = value
end
end
#=> [{"name"=>"Seamus", "address"=>"123 Strand Avenue"},
# {"name"=>"Seana", "address"=>"126 Strand Avenue"},
# {"address"=>"221B Baker Street", "name"=>"Sherlock"}]
We are now ready to construct the CSV file:
out_file = 'temp.csv'
require 'csv'
CSV.open(out_file, 'w') do |csv|
csv << headers
arr.each { |h| csv << h.values_at(*headers) }
end
Let's see what was written:
puts File.read(out_file)
name,address
Seamus,123 Strand Avenue
Seana,126 Strand Avenue
Sherlock,221B Baker Street
See CSV::open and Hash#values_at.
This is not the format specified in the question. In fact, a file with that format would not be a valid CSV file, because there is no consistent column separator. For example, the first line, '|name | address' has a column separator ' | ', whereas the second line, '|Seamus | 123 Strand Avenue' has a column separator ' | '. Moreover, even if they were the same the pipe at the beginning of each line would become the first letter of the name.
We could change the column separator to a pipe (rather than a comma, the default) by writing CSV.open(out_file, col_sep: '|', 'w'). A common mistake in constructing CSV files is to surround the column separator with one or more spaces. That invariably leads to boo-boos.
I'm new in Ruby.
Here the script, I would like to use the selector in line 10 instead of fields[0] etc...
How can I do that ?
For the example the data are embedded.
Don't hesitate to correct me if I'm doing wrong when I'm opening or writing a file or anything else, I like to learn.
#!/usr/bin/ruby
filename = "/tmp/log.csv"
selector = [0, 3, 5, 7]
out = File.open(filename + ".rb.txt", "w")
DATA.each_line do |line|
fields = line.split("|")
columns = fields[0], fields[3], fields[5], fields[7]
puts columns.join("|")
out.puts(columns.join("|"))
end
out.close
__END__
20180704150930|rtsp|645645643|30193|211|KLM|KLM00SD624817.ts|172.30.16.34|127299264|VERB|01780000|21103|277|server01|OK
20180704150931|api|456456546|30130|234|VC3|VC300179201139.ts|172.30.16.138|192271838|VERB|05540000|23404|414|server01|OK
20180704150931|api|465456786|30154|443|BAD|BAD004416550.ts|172.30.16.50|280212202|VERB|04740000|44301|18|server01|OK
20180704150931|api|5437863735|30157|383|VSS|VSS0011062009.ts|172.30.16.66|312727922|VERB|05700000|38303|381|server01|OK
20180704150931|api|3453432|30215|223|VAE|VAE00TF548197.ts|172.30.16.74|114127126|VERB|05060000|22305|35|server01|OK
20180704150931|api|312121|30044|487|BOV|BOVVAE00549424.ts|172.30.16.58|69139448|VERB|05300000|48708|131|server01|OK
20180704150931|rtsp|453432123|30127|203|GZD|GZD0900032066.ts|172.30.16.58|83164150|VERB|05460000|20303|793|server01|OK
20180704150932|api|12345348|30154|465|TYH|TYH0011224259.ts|172.30.16.50|279556843|VERB|04900000|46503|241|server01|OK
20180704150932|api|4343212312|30154|326|VAE|VAE00TF548637.ts|172.30.16.3|28966797|VERB|04740000|32601|969|server01|OK
20180704150932|api|312175665|64530|305|TTT|TTT000000011852.ts|172.30.16.98|47868183|VERB|04740000|30501|275|server01|OK
You can get fields at specific indices using Ruby's splat operator (search for 'splat') and Array.values_at like so:
columns = fields.values_at(*selector)
A couple of coding style suggestions:
1.You may want to make selector a constant since its unlikely that you'll want to mutate it further down in your code base
2.The out and out.close and appending to DATA can all be condensed into a CSV.open:
CSV.open(filenname, 'wb') do |csv|
columns.map do |col|
csv << col
end
end
You can also specify a custom delimiter (pipe | in your case) as noted in this answer like so:
...
CSV.open(filenname, 'wb', {col_sep: '|') do |csv|
...
Let's begin with a more manageable example. First note that if your string is held by the variable data, each line of the string contains the same number (14) of vertical bars ('|'). Lets reduce that to the first 4 lines of data with each line terminated immediately before the 6th vertical bar:
str = data.each_line.map { |line| line.split("|").first(6).join("|") }.first(4).join("\n")
puts str
20180704150930|rtsp|645645643|30193|211|KLM
20180704150931|api|456456546|30130|234|VC3
20180704150931|api|465456786|30154|443|BAD
20180704150931|api|5437863735|30157|383|VSS
We need to also modify selector (arbitrarily):
selector = [0, 3, 4]
Now on to answering the question.
There is no need to divide the string into lines, split each line on the vertical bars, select the elements of interest from the resulting array, join the latter with a vertical bar and then lastly join the whole shootin' match with a newline (whew!). Instead, simply use String#gsub to remove all unwanted characters from the string.
terms_per_row = str.each_line.first.count('|') + 1
#=> 6
r = /
(?:^|\|) # match the beginning of a line or a vertical bar in a non-capture group
[^|\n|]+ # match one or more characters other than a vertical bar or newline
/x # free-spacing regex definition mode
line_idx = -1
new_str = str.gsub(r) do |s|
line_idx += 1
selector.include?(line_idx % terms_per_row) ? s : ''
end
puts new_str
20180704150930|30193|211
20180704150931|30130|234
20180704150931|30154|443
20180704150931|30157|383
Lastly, we write new_str to file:
File.write(fname, new_str)
Below is the input file that I want to store into a hash table, sort it and output in the format shown below.
Input File
Name=Ashok, Email=ashok85#gmail.com, Country=India, Comments=9898984512
Email=raju#hotmail.com, Country=Sri Lanka, Name=Raju
Country=India, Comments=45535878, Email=vijay#gmail.com, Name=Vijay
Name=Ashok, Country=India, Email=ashok37#live.com, Comments=8898788987
Output File (Sorted by Name)
Name Email Country Comments
-------------------------------------------------------
Ashok ashok37#live.com India 8898788987
Ashok ashok85#gmail.com India 9898984512
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
So far, I have read the data from the file and stored every line into an array, but I am stuck at hash[key]=>value
file_data = {}
File.open('input.txt', 'r') do |file|
file.each_line do |line|
line_data = line.split('=')
file_data[line_data[0]] = line_data[1]
end
end
puts file_data
Given that each line in your input file has pattern of key=value strings which are separated by commas, you need to split the line first around comma, and then around equals sign. Here is version of corrected code:
# Need to collect parsed data from each line into an array
array_of_file_data = []
File.open('input.txt', 'r') do |file|
file.each_line do |line|
#create a hash to collect data from each line
file_data = {}
# First split by comma
pairs = line.chomp.split(", ")
pairs.each do |p|
#Split by = to separate out key and value
key_value = p.split('=')
file_data[key_value[0]] = key_value[1]
end
array_of_file_data << file_data
end
end
puts array_of_file_data
Above code will print:
{"Name"=>"Ashok", "Email"=>"ashok85#gmail.com", "Country"=>"India", "Comments"=>"9898984512"}
{"Email"=>"raju#hotmail.com", "Country"=>"Sri Lanka", "Name"=>"Raju"}
{"Country"=>"India", "Comments"=>"45535878", "Email"=>"vijay#gmail.com", "Name"=>"Vijay"}
{"Name"=>"Ashok", "Country"=>"India", "Email"=>"ashok37#live.com", "Comments"=>"8898788987"}
A more complete version of program is given below.
hash_array = []
# Parse the lines and store it in hash array
File.open("sample.txt", "r") do |f|
f.each_line do |line|
# Splits are done around , and = preceded or followed
# by any number of white spaces
splits = line.chomp.split(/\s*,\s*/).map{|p| p.split(/\s*=\s*/)}
# to_h can be used to convert an array with even number of elements
# into a hash, by treating it as an array of key-value pairs
hash_array << splits.to_h
end
end
# Sort the array of hashes
hash_array = hash_array.sort {|i, j| i["Name"] <=> j["Name"]}
# Print the output, more tricks needed to get it better formatted
header = ["Name", "Email", "Country", "Comments"]
puts header.join(" ")
hash_array.each do |h|
puts h.values_at(*header).join(" ")
end
Above program outputs:
Name Email Country Comments
Ashok ashok85#gmail.com India 9898984512
Ashok ashok37#live.com India 8898788987
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
You may want to refer to Padding printed output of tabular data to have better formatted tabular output
Here is what my CSV looks like: http://tinypic.com/r/kuwk6/5
And here is my code:
File.open("/Users/Katie/Downloads/File_Name.csv", encoding: "ISO-8859-1").each_line do |line|
line.chomp!
CSV.parse(line, col_sep: "\t") do |row|
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
end
I had issues with special characters, which is why I have the encoding in there and because I'm on a Mac, when I open a CSV in excel, it does something weird to the rows, so I put in the line.chomp!. The file is technically tab deliminated, so i did the col_sep for the tabs.
Basically I want the URL to be split at the "&Wt.srch=1" but I only want to have the first part of the string returned after it splits them, which is why I put the [0].
When I run the code without the "unless" row, it says block (2 levels) in <main>': undefined methodsplit' for nil:NilClass (NoMethodError)
This makes me think that it thinks this column is empty, when in fact, it's not. But of course when I put in the "unless" line, it runs the script just fine, but doesn't actually split the url string.
Sorry if this is a really basic / easy problem... Thanks in advance for your help!
You don't need CSV.parse do do this
With tabs:
File:
c1 c2 c3 c4 c5
Hello Alpha Example More https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Thanks Bravo Example some https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Blah Charlie Example stuff https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Script:
#returns each_line of the csv file as a string
File.open("/Users/Katie/Downloads/File_Name.csv").each_line do |line|
#splits the line at tab character into row Array
row = line.chomp.split("\t")
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
Output:
c5
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
With Commas:
File:
c1,c2,c3,c4,c5
Hello,Alpha,Example,More,https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Thanks,Bravo,Example,some,https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Blah,Charlie,Example,stuff,https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Script:
#returns each_line of the csv file as a string
File.open("/Users/Katie/Downloads/File_Name.csv").each_line do |line|
#splits the line at tab character into row Array
row = line.chomp.split(",")
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
Output:
c5
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
Script to handle the use of encoding with "ISO-8859-1":
File.open("/Users/Katie/Downloads/File_Name.csv", encoding: "ISO-8859-1").each_line do |line|
#splits the line at tab character into row Array
row = line.chomp.split(" ").delete_if{|r| r.strip.empty?}
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
The way you have it set up you are looping through the lines and then splitting the lines into individual strings using CSV.parse so row is actually a single "cell" not an array of cells.