I want to remove duplicate lines from a text for example :
1.aabba
2.abaab
3.aabba
4.aabba
After running :
1.aabba
2.abaab
Tried so far :
lines = File.readlines("input.txt")
lines = File.read('/path/to/file')
lines.split("\n").uniq.join("\n")
Let's construct a file.
fname = 't'
IO.write fname, <<~END
dog
cat
dog
pig
cat
END
#=> 20
See IO::write. First let's suppose you simply want to read the unique lines into an array.
If, as here, the file is not excessive large, you can write:
arr = IO.readlines(fname, chomp: true).uniq
#=> ["dog", "cat", "pig"]
See IO::readlines. chomp: true removes the newline character at the end of each line.
If you wish to then write that array to another file:
fname_out = 'tt'
IO.write(fname_out, arr.join("\n") << "\n")
#=> 12
or
File.open(fname_out, 'w') do |f|
arr.each { |line| f.puts line }
end
If you wish to overwrite fname, write to a new file, delete the existing file and then rename the new file fname.
If the file is so large it cannot be held in memory and there are many duplicate lines, you might be able to do the following.
require 'set'
st = IO.foreach(fname, chomp: true).with_object(Set.new) do |line, st|
st.add(line)
end
#=> #<Set: {"dog", "cat", "pig"}>
See IO::foreach.
If you wish to simply write the contents of this set to file, you can execute:
File.open(fname_out, 'w') do |f|
st.each { |s| f.puts(s) }
end
If instead you need to convert the set to an array:
st.to_a
#=> ["dog", "cat", "pig"]
This assumes you have enough memory to hold both st and st.to_a. If not, you could write:
st.size.times.with_object([]) do |_,a|
s = st.first
a << s
st.delete(s)
end
#=> ["dog", "cat", "pig"]
If you don't have enough memory to even hold st you will need to read your file (line-by-line) into a database and then use database operations.
If you wish to write the file with the duplicates skipped, and the file is very large, you may do the following, albeit with the infinitesimal risk of including one or more duplicates (see the comments).
require 'set'
line_map = IO.foreach(fname, chomp: true).with_object({}) do |line,h|
hsh = line.hash
h[hsh] = $. unless h.key?(hsh)
end
#=> {3393575068349183629=>1, -4358860729541388342=>2,
# -176447925574512206=>4}
$. is the number (base 1) of the line just read. See String#hash. Since the number of distinct values returned by this method is finite and the number of possible strings is infinite, there is the possibility that two distinct strings could have the same hash value.
Then (assuming line_map is not empty):
lines_to_keep = line_map.values
File.open(fname_out, 'w') do |fout|
IO.foreach(fname, chomp: true) do |line|
if lines_to_keep.first == $.
fout.puts(line)
lines_to_keep.shift
end
end
end
Let's see what we've written:
puts File.read(fname_out)
dog
cat
pig
See File::open.
Incidentally, for IO class methods m (including read, write, readlines and foreach), you may see IO.m... written File.m.... That's permissible because File is a subclass of IO and therefore inherits the latter's methods. That does not apply to my use of File::open, as IO::Open is a different method.
Set only stores unique elements, so:
require 'Set'
s = Set.new
while line = gets
s << line.strip
end
s.each { |unique_elt| puts unique_elt }
You can run this with any input file using < input.txt on the command-line rather than hardwiring the file name into your program.
Note that Set is based on Hash, and the documentation states "Hashes enumerate their values in the order that the corresponding keys were inserted", so this will preserve the order of entry.
You can continue your idea with uniq.
uniq compares result of the block and delete duplicates.
For example you have input.txt with this content:
1.aabba
2.abaab
3.aabba
4.aabba
puts File.readlines('input.txt', chomp: true).
uniq { |line| line.sub(/\A\d+\./, '') }.
join("\n")
# will print
# 1.aabba
# 2.abaab
Here Sring#sub that delete list numbers, but you can use other methods, for example line[2..-1].
Below is the input file that I want to store into a hash table, sort it and output in the format shown below.
Input File
Name=Ashok, Email=ashok85#gmail.com, Country=India, Comments=9898984512
Email=raju#hotmail.com, Country=Sri Lanka, Name=Raju
Country=India, Comments=45535878, Email=vijay#gmail.com, Name=Vijay
Name=Ashok, Country=India, Email=ashok37#live.com, Comments=8898788987
Output File (Sorted by Name)
Name Email Country Comments
-------------------------------------------------------
Ashok ashok37#live.com India 8898788987
Ashok ashok85#gmail.com India 9898984512
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
So far, I have read the data from the file and stored every line into an array, but I am stuck at hash[key]=>value
file_data = {}
File.open('input.txt', 'r') do |file|
file.each_line do |line|
line_data = line.split('=')
file_data[line_data[0]] = line_data[1]
end
end
puts file_data
Given that each line in your input file has pattern of key=value strings which are separated by commas, you need to split the line first around comma, and then around equals sign. Here is version of corrected code:
# Need to collect parsed data from each line into an array
array_of_file_data = []
File.open('input.txt', 'r') do |file|
file.each_line do |line|
#create a hash to collect data from each line
file_data = {}
# First split by comma
pairs = line.chomp.split(", ")
pairs.each do |p|
#Split by = to separate out key and value
key_value = p.split('=')
file_data[key_value[0]] = key_value[1]
end
array_of_file_data << file_data
end
end
puts array_of_file_data
Above code will print:
{"Name"=>"Ashok", "Email"=>"ashok85#gmail.com", "Country"=>"India", "Comments"=>"9898984512"}
{"Email"=>"raju#hotmail.com", "Country"=>"Sri Lanka", "Name"=>"Raju"}
{"Country"=>"India", "Comments"=>"45535878", "Email"=>"vijay#gmail.com", "Name"=>"Vijay"}
{"Name"=>"Ashok", "Country"=>"India", "Email"=>"ashok37#live.com", "Comments"=>"8898788987"}
A more complete version of program is given below.
hash_array = []
# Parse the lines and store it in hash array
File.open("sample.txt", "r") do |f|
f.each_line do |line|
# Splits are done around , and = preceded or followed
# by any number of white spaces
splits = line.chomp.split(/\s*,\s*/).map{|p| p.split(/\s*=\s*/)}
# to_h can be used to convert an array with even number of elements
# into a hash, by treating it as an array of key-value pairs
hash_array << splits.to_h
end
end
# Sort the array of hashes
hash_array = hash_array.sort {|i, j| i["Name"] <=> j["Name"]}
# Print the output, more tricks needed to get it better formatted
header = ["Name", "Email", "Country", "Comments"]
puts header.join(" ")
hash_array.each do |h|
puts h.values_at(*header).join(" ")
end
Above program outputs:
Name Email Country Comments
Ashok ashok85#gmail.com India 9898984512
Ashok ashok37#live.com India 8898788987
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
You may want to refer to Padding printed output of tabular data to have better formatted tabular output
I have to work with a csv file that is not directly useable for my needs of generating a simple chart. I need to manipulate the file into something "cleaner" and am running into issues and unsure if my overall strategy is correct, as I'm just learning to parse through files with ruby.... My issues here mainly related to me looking for data that is offset from where I've found or haven't found matches. After I find a line which meets criteria, I need to read info from 2 lines after it and manipulate some of it (move something from the last column to the second).
Here's the original csv file:
component
quantity header,design1,design2,design3,Ref,Units
quantity type,#,#,#,ref#,unit value
component
quantity header,design1,design2,design3,Ref,Units
quantity type,#,#,#,ref#,unit value
component
quantity header,design1,design2,design3,Ref,Units
quantity type,#,#,#,ref#,unit value
Desired output:
Component Header,Quantity type Header,Units Header,design1 header,design2 header,design3 header,Ref header
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
My ruby script at the moment:
require 'csv'
f = File.new("sp.csv")
o = CSV.open('output.csv', 'w')
f.each_line do |l| #iterate through each line
data = l.split
if l !~ /,/ #if the line does not contain a comma it is a component
o << [data,f.gets] #start writing data, f.gets skips next line but need to skip 2 and split the line to manipulate columns
else
o << ['comma'] #just me testing that I can find lines with commas
end
end
f.gets skips the next line and the documentation isn't clear to me how to use it to skip 2. After that I THINK I can split that line by commas and manipulate row data with array[column]. Aside from this offset issue I'm also unsure if my general approach is a good strategy
EDIT
Here's some lines from the real file.... I'll work through the answers provided and see if I can make it all work. The idea I've had is to read and write line by line, vs. converting the whole file to an array and then reading and writing. My thought is that when these files get big, and they do, it'll take less memory doing it line by line.
THANKS for the help, I'll work through answers and get back to you.
DCB
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,82.915,69.226,78.35,78.383,86.6,85.763,N/A,Celsius
RCB
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,76.557,68.779,74.705,74.739,80.22,79.397,N/A,Celsius
Antenna
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,69.988,65.045,69.203,69.238,73.567,72.777,N/A,Celsius
PCBA_fiberTray
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,66.651,65.904,66.513,66.551,72.516,70.47,N/A,Celsius
EDIT 2
Using some regexp from answers below I developed a line by line strategy to parse through this. I'll post it as an answer for completeness.
Thanks for helping out and exposing me to methods to develop a solution
How about slicing it into groups of 3 lines:
File.read("sp.csv").split("\n").each_slice(3) do |slice|
o << [slice[0], *slice[2].split(',')]
end
I created a CSV file based on the sample, called "test.csv".
Starting with this code:
data = File.readlines('test.csv').slice_before(/^component/)
I get an enumerator back. If I look at the data that enumerator will return I get:
pp data.to_a
[["component\n",
"quantity header,design1,design2,design3,Ref,Units\n",
"quantity type,#,#,#,ref#,unit value\n"],
["component\n",
"quantity header,design1,design2,design3,Ref,Units\n",
"quantity type,#,#,#,ref#,unit value\n"],
["component\n",
"quantity header,design1,design2,design3,Ref,Units\n",
"quantity type,#,#,#,ref#,unit value\n"]]
That's an array of arrays, broken into sub-arrays on the "component" line. I suspect the values aren't reflecting reality, but without a more accurate sample... well, GIGO.
If the "component" line isn't really a bunch of repeating "component" lines, and doesn't have any commas, you can use this instead:
data = File.readlines('test.csv').slice_before(/\A[^,]+\Z/)
or:
data = File.readlines('test.csv').slice_before(/^[^,]+$/)
The result will be the same with the current samples.
If you need a more complex regex you can substitute that, for instance:
/^(?:#{ Regexp.union(%w[component1 component2]).source })$/i
Which returns a pattern that will find any words in the %w[] array:
/^(?:component1|component2)$/i
From there we can walk the data array and clean out all the extraneous headers using:
data.map{ |a| a[2..-1] }.flatten
Which returns something like:
[
"quantity type,#,#,#,ref#,unit value\n",
"quantity type,#,#,#,ref#,unit value\n",
"quantity type,#,#,#,ref#,unit value\n"
]
That can be iterated and passed to CSV to be parsed into arrays if needed:
data.map{ |a| a[2..-1].map{ |r| CSV.parse(r) }.flatten }
[
["quantity type", "#", "#", "#", "ref#", "unit value"],
["quantity type", "#", "#", "#", "ref#", "unit value"],
["quantity type", "#", "#", "#", "ref#", "unit value"]
]
That's all background to get you thinking how you can tear apart the CSV data.
Using this code:
data.flat_map { |ary|
component = ary[0].strip
ary[2..-1].map{ |a|
data = CSV.parse(a).flatten
[
component,
data.shift,
data.pop,
*data[0..-2]
]
}
}
Returns:
[
["component", "quantity type", "unit value", "#", "#", "#"],
["component", "quantity type", "unit value", "#", "#", "#"],
["component", "quantity type", "unit value", "#", "#", "#"]
]
The only thing left to do is create the header you want to use, and pass the returned data back into CSV to let it generate the output file. You should be able to get there from here using the CSV documentation.
Edit:
Based on the actual data, here's a version of the code with a minor tweak, along with its output:
require 'csv'
require 'pp'
data = File.readlines('test.csv').slice_before(/^[^,]+$/)
pp data.flat_map { |ary|
component = ary[0].strip
ary[2..-1].map{ |a|
record = CSV.parse(a).flatten
[
component,
record.shift,
record.pop,
*record[0..-2]
]
}
}
Which looks like:
[["DCB",
"Avg Temperature",
"Celsius",
"82.915",
"69.226",
"78.35",
"78.383",
"86.6",
"85.763"],
["RCB",
"Avg Temperature",
"Celsius",
"76.557",
"68.779",
"74.705",
"74.739",
"80.22",
"79.397"],
["Antenna",
"Avg Temperature",
"Celsius",
"69.988",
"65.045",
"69.203",
"69.238",
"73.567",
"72.777"],
["PCBA_fiberTray",
"Avg Temperature",
"Celsius",
"66.651",
"65.904",
"66.513",
"66.551",
"72.516",
"70.47"]]
Code I'm using which creates csv file with everything manipulated... Thanks to those that contributed some help.
require 'csv'
file_in = File.new('sp1.csv')
file_out = CSV.open('output.csv', 'w')
header = []
row = []
file_in.each_line do |line|
case line
when /^[^,]+$/ #Find a component (line with no comma)
comp_header = file_in.gets.split(',') #header is after component and is split into an arry
if header.empty? #header
header.push("Component", comp_header[0], comp_header[-1].strip)
comp_header[1..-3].each do |h|
header.push(h)
end
file_out << header
end
#comp = line.to_s.strip
next
when /,/ #when a row had commas
puts #comp
vals = line.split(',') #split up into vals array
row.push(#comp, vals[0], vals[-1].strip) #add quantity and unit to row array
vals[1..-3].each do |v| #for values (excluding quanity, units, reference info)
row.push(v) #add values to row array
end
end
file_out << row #write the current row to csv file
row = [] #reset the row array to move on to the next component set
end