I have an input text file "input.txt" that looks like this:
Country Code ID QTY
FR B000X2D 75 130
FR B000X2E 75 150
How do I extract the first, second and the third string from each line?
This code maps a whole line into one field of array:
f = File.open("input.txt", "r")
line_array = []
f.each_line { |line| line_array << line }
f.close
puts line_array[1]
Which outputs:
FR B000X2D 75 130
Furthermore, how can I split one line into more lines based on a quantity number,
max(quantity) = 50 per line
so that the output is:
FR B000X2D 75 50
FR B000X2D 75 50
FR B000X2D 75 30
If this is space delimited, should be pretty easy to split things up:
File.readlines('input.txt').map do |line|
country, code, id, qty = line.chomp.split(/\s+/)
[ country, code, id.to_i, qty.to_i ]
end
You can also easily reject any rows you don't want, or select those you do, plus this helps with stripping off headers:
File.readlines('input.txt').reject do |line|
line.match(/\ACountry/i)
end.map do |line|
country, code, id, qty = line.chomp.split(/\s+/)
[ country, code, id.to_i, qty.to_i ]
end.select do |country, code, id, qty|
qty <= 50
end
Use the CSV class if these are tab separated entries. CSV stands for "comma separated values" but the you can provide your own separator
require 'csv'
CSV.foreach("fname", :row_sep => "\t") do |row|
# use row here...
end
See https://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html
Related
I have a CSV that looks like this:
user_id,is_user_unsubscribed
131072,1
7077888,1
11010048,1
12386304,1
327936,1
2228480,1
6553856,1
9830656,1
10158336,1
10486016,1
10617088,1
11010304,1
11272448,1
393728,1
7012864,1
8782336,1
11338240,1
11928064,1
4326144,1
8127232,1
11862784,1
but I want the data to look like this:
131072
7077888
11010048
12386304
327936
...
any ideas on what to do? I have 330,000 rows...
You can read your file as an array and ignore the first row like this:
data = CSV.read("dataset.csv")[1 .. -1]
This way you can remove the header.
Regarding the column, you can delete a column like this:
data = CSV.read("dataset.csv")[1 .. -1]
data.delete("is_user_unsubscribed")
data.to_csv # => The new CSV in string format
Check this for more info: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Table.html
http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html
My recommendation would be to read in a line from your file as a string, then split the String that you get by commas (there is a comma separating your columns).
Splitting a Ruby String:
https://code-maven.com/ruby-split
require 'pp'
line_num=0
text=File.open('myfile.csv').read
text.each_line do |line|
textArray = line.split
textIWant = textArray[0]
line_num = line_num + 1
print "#{textIWant}"
end
In this code we open a text file, and read line by line. Each line we split into the text we want by choosing the text from the first column (zeroth item in the array), then print it.
If you do not want the headers, when line_num = 0, add an if statement to not pick up the data. Even better use unless.
Just rewrite a new file with your new data.
I wound up doing this. Is this kosher?
user_ids = []
[]
CSV.foreach("eds_users_sept15.csv", headers:true) do |row|
user_ids << row['user_id']
end
nil
user_ids.count
322101
CSV.open('some_new_file.csv', 'w') do |c|
user_ids.each do |id|
c << [id]
end
end
I have 330,000 rows...
So I guess speed matters, right?
I took your method and the other 2 that was proposed, tested them on a 330,000 rows csv file and made a benchmark to show you something interesting.
require 'csv'
require 'benchmark'
Benchmark.bm(10) do |bm|
bm.report("Method 1:") {
data = Array.new
CSV.foreach("input.csv", headers:true) do |row|
data << row['user_id']
end
}
bm.report("Method 2:") {
data = CSV.read("input.csv")[1 .. -1]
data.delete("is_user_unsubscribed")
}
bm.report("Method 3:") {
data = Array.new
File.open('input.csv').read.each_line do |line|
data << line.split(',')[0]
end
data.shift # => remove headers
}
end
The output:
user system total real
Method 1: 3.110000 0.010000 3.120000 ( 3.129409)
Method 2: 1.990000 0.010000 2.000000 ( 2.004016)
Method 3: 0.380000 0.010000 0.390000 ( 0.383700)
As you can see handling the CSV file as a simple text file, splitting the lines and pushing them into the array is ~5 times faster than using CSV Module. Of course it has some disadvantages too; i.e., if you'll ever add columns in the input file you'll have to review the code.
It's up to you if you prefer lightspeed code or easier scalability.
I'm guessing that you plan to convert each string that precedes a comma to an integer. If so,
CSV.read("dataset.csv").drop(1).map(:to_i)
is all you need. (For example, "131072,1".to_i #=> 131072.)
If you want strings, you could write
CSV.read("dataset.csv").drop(1).map { |s| s[/d+/] }
I have the following files:
ids_to_remove.txt
ID File_Name Row Name
id\a_po87y GMT_dealer auto_dev
id\ruio66 dao_wells auto_dev
id\rzd766 123_Smart_option cia_read
...
...
etc
GMT_dealer-details.txt
#This configuration file was written by: org.eclipse.equinox.internal
[groups]
auto_dev = id\a_po87y, id\rt7890, id\sdfs09, id\rzdo9k
qa_check = id\op9iu, id\guijm0, id\a_po87y
AD_read = id\a_po87y
dao_wells-details.txt
#Content may be structured and packaged into modules to facilitate delivering, extending, and upgrading the Content
[groups]
AD_read = id\a_po87y
auto_dev = id\guijm0, id\oikju8, id\ruio66
CSI = id\kiopl, id\ruio66, id\o9i8u7
In ids_to_remove.txt, there are close to 500 entries, and items in a row are tab separated. In other files, groups values are space separated.
The value for File_Name represents a folder or file at E:/my_files/*-details.txt, where * is the File_Name value like GMT_dealer, dao_wells, or 123_Smart_option.
For each row, I want to delete any occurrence of the ID value in row with the row id Row Name in the file represented by File_Name. For example, I want to delete the string id\a_po87y from row with id auto_dev in file GMT_dealer. id\a_po87y should be removed only from the group auto_dev, and the same id present in groups qa_check and AD_read should be left as is. Likewise it has to be carried out on all files under E:/my_files.
I wrote the code below:
file_dir = 'E:/my_files'
file = File.open("E:/ids_to_remove.txt", "r")
contents = file.each_line.map { |line| line.split("\t") }.transpose
id, file_name, group = contents
id.each do |ids|
puts "For id: #{ids}"
file_name.each do |name|
value = File.open("#{file_dir}/#{name}-details.txt")
text = File.read(value)
text.each_line do |el|
group.each do |gr|
if el.match(/#{gr}/) then
print "group row #{gr}\n"
replace = text.gsub( /#{Regexp.escape(ids)}\,\s/, '').gsub( /#{Regexp.escape(ids)}/, '' ).gsub /,\s*$/, ''
end
group.shift
end
end
file_name.shift
end
end
id.shift
It doesn't do what I need. Looking for any suggestions.
For debugging added few puts and the output I got
For ID: id\a_po87y
group row auto_dev
group row auto_dev
For ID: id\ruio66
For ID: id\rzd766
I will do something like this:
file_dir = 'E:/my_files'
file = File.open("E:/ids_to_remove.txt", "r")
file.each_line.map do |line|
id, file_name, group = line.split
old_text = File.read("#{file_dir}/#{file_name}-details.txt")
new_text = []
old_text.each_line do |line|
if line =~ /=/
line_group, line_ids = line.split("=")
if line_group.strip == group.strip
line_ids = line_ids.split(",").reject { |l_id| l_id.strip == id }.join(",")
end
new_text << "#{line_group}=#{line_ids.chomp("\n")}"
else
new_text << line.chomp("\n")
end
end
File.write("#{file_dir}/#{file_name}-details.txt", new_text.join("\n"))
end
I'm sure there is a better way to handle the extra "\n", but this will get the desired output.
My requirement is to check if a given string (reading it from a text file) exists in any of the files in a particular folder, if so
store and print the first word of the line of the matched string.
Below is code snippet,
.......
.......
my_files.each do |file_name|
puts "File Name: #{file_name}"
content = File.read(file_name)
changed = content.gsub( /#{Regexp.escape(id_value)}/, '' ) #'id_value' is from the first level loop ,stores string value(for every iteration).
if content.include?("#{id_value}")
print "its there\n"
Row = content.split.first
puts "From: #{Row}"
end
One of the files in the folders
CDA created on September 20th 1999
Owner: Edward Jenner
Access IDs,
id = class\234ha, class\poi23, class\opiuj, cap\7y6t5
dept = sub\6985de, ret\oiu87, class\234ha
say if the id_value is class\234ha
for first iteration, it should give the output as 'id' and 'dept' but the output is 'CDA'. Also I'm facing the below warning too.
test.rb:19: warning: already initialized constant Row test.rb:19:
warning: previous definition of Row was here From: class\poi23
Any suggestions please. I have looked for other options too but none worked.Beginner to ruby so kindly excuse my ignorance.Thanks.
Here is an example from a script that I had that does what you are looking for. It's fairly strait forward if you use the each_line method of a file object.
#!/usr/bin/env ruby
regex_to_find = Regexp.new Regexp.escape(ARGV[0])
files = Dir.glob ARGV[1]
files.each do |f|
current_file = File.new f
current_file.each_line do |l|
if l =~ regex_to_find
puts "#{f} #{current_file.lineno}: first word = #{l.split.first}, full line: #{l}"
end
end
end
If you run this script on a directory with a file containing the data you show above. you get the following output. Which is I think what you are looking for.
$ ./q43950329.rb 'class\234ha' "*"
q43950329_data 4: first word = id, full line: id = class\234ha, class\poi23, class\opiuj, cap\7y6t5
q43950329_data 5: first word = dept, full line: dept = sub\6985de, ret\oiu87, class\234ha
Note the above output is in a file called q43950329.rb and the following file exists in the current directory called q43950329_data
CDA created on September 20th 1999
Owner: Edward Jenner
Access IDs,
id = class\234ha, class\poi23, class\opiuj, cap\7y6t5
dept = sub\6985de, ret\oiu87, class\234ha
If there is anything like this:
File.open('sample.txt').grep(/class\234ha/) { |line| line.split.first }
=> ["id", "dept"]
Do pass a block to the grep method
I have a file Index.csv that contains the following data:
100
200
300
400
500
600
700
800
900
1000
I need to print or save into a new file New.csv the rows of a CSV file Original.csv as described in Original.csv. How do I do that?
I could not do it, so I copied the contents of Index.csv into an array, and wrote the following code, but it's not working:
array = [100,200,300,400,500,600,700,800,900,1000]
CSV.open('New.csv', "wb") do |csv|
f = File.open('Original.csv', "r")
f.each_line { |line|
row = line.split(",")
for i in 0..array.size
if array[i]==line
csv<<row
end
end
}
end
There is missing detail in your question, such as how many lines are in the files, and whether the index file is sorted. Without that information and assuming the worst, huge files and an unsorted index file, I'd use something like this code:
File.open('new.csv', 'w') do |new_csv|
File.foreach('index.csv') do |line_num|
File.open('original.csv', 'r') do |original_csv|
original_line = ''
line_num.to_i.times do
original_line = original_csv.gets
end
new_csv.puts original_line
end
end
end
Assuming an index.csv of:
1
3
5
7
9
and an original.csv of:
row1
row2
row3
row4
row5
row6
row7
row8
row9
row10
Running the code creates new.csv:
> cat new.csv
row1
row3
row5
row7
row9
CSV files are text, so it's not necessary to use the CSV class to read or write them if we're only concerned with the individual lines.
There are changes that could be made to use readlines and slurping the input files and indexes into the resulting arrays, but that will result in code that isn't scalable. The suggested code will result in rereading original.csv for each line in index.csv, but it'll also handle files of arbitrary size, something that's very important in production environments.
For instance, if index.csv will be small and unsorted:
File.open('new.csv', 'w') do |new_csv|
indexes = File.readlines('index.csv').map(&:to_i).sort
File.foreach('original.csv').with_index(1) do |original_line, original_lineno|
new_csv.puts original_line if indexes.include?(original_lineno)
end
end
That will run more quickly because it only iterates through original.csv once, but opens up a potential scalability problem if index.csv grows too big.
I will show you a way to print a line without reading from "Index.csv".
array = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
i = array.shift
File.new("Original.csv").each_line.with_index(1) do
|l, j|
if j == i
puts l
i = array.shift
end
end
I have a CSV that has three types of info:
userID,wordID,ct
(Basically, 14k different tweeps, a different line for each word they use, including a count for that word)
I would like to be able to filter this file just for userIDs that have at least 2000 different wordIDs.
I understand how to go through the file and count up wordIDs per userID, but I don't know how to combine this with "now put 'userID,wordID,ct' just for the userIDs that are really frequent."
Any help is much appreciated.
Here's how I'm processing the file currently. I suspect there are more efficient ways to do this as the file itself is 19m lines--thoughts on efficiency are certainly appreciated.
filename = ARGV[0]
file = File.new(filename, "r")
entry = {}
file.each do |line|
user, word, ct = line.chomp.split(",")
entry[user] = entry[user].to_i + 1
end
file = File.new(filename, "r")
file.each do |line|
line.strip!
user, word, ct = line.chomp.split(",")
if entry[user] >= 2000
puts line
end
end