How to convert space-delimited .txt File to ","-delimited .txt file using Ruby? - ruby

I do have a text file as below:
Employee details.txt
Raja Palit 77489 24 84 12/12/2011
Mathew bargur 77559 25 88 01/12/2011
harin Roy 77787 24 80 12/12/2012
Soumi paul 77251 24 88 11/11/2012
I want the file as below:
Expected file:
Raja,Palit,77489,24,84,12/12/2011
Mathew,bargur,77559,25,88,01/12/2011
harin,Roy,77787,24,80,12/12/2012
Soumi,paul,77251,24,88,11/11/2012
What I tried below:
IO.foreach('D://docs//details.txt') do |line|
splits = line.split("\t")
col1, col2, col3, col4, col5, col6 = splits
splits[6..-1].join(',')
end

Though it seems like a quick way to deal with this sort of data by splitting on whitespace, that will fail if any field contains embedded whitespace. For instance, if the name of the person in the record is something like "Maria Von Trapp" or "Smokey the Bear", the resulting comma-delimited fields will be wrong.
The correct way to deal with this is to parse based on column-field widths, then squeeze and strip whitespace inside those fields, then turn the record into a CSV record.
require 'csv'
require 'scanf' if (RUBY_VERSION >= '1.9.3')
FORMAT = '%15c %d %d %d %10c'
data = <<EOT
Raja Palit 77489 24 84 12/12/2011
Mathew bargur 77559 25 88 01/12/2011
harin Roy 77787 24 80 12/12/2012
Soumi paul 77251 24 88 11/11/2012
Maria Von Trapp 99999 99 99 12/31/2012
Smokey the Bear 99999 99 99 12/31/2012
EOT
data.split("\n").each do |li|
fields = li.scanf(FORMAT)
puts [fields.first.strip, *fields[1 .. -1]].to_csv
end
Which outputs:
Raja Palit,77489,24,84,12/12/2011
Mathew bargur,77559,25,88,01/12/2011
harin Roy,77787,24,80,12/12/2012
Soumi paul,77251,24,88,11/11/2012
Maria Von Trapp,99999,99,99,12/31/2012
Smokey the Bear,99999,99,99,12/31/2012
Note, Ruby 1.9.3 split scanf into its own module, which explains the conditional require.

Strings come with a squeeze method, it squeezes runs of the char(s) in the argument into one char. In this case it reduces the multiple spaces into one space, which is then replaced by a comma:
File.open("test.txt") do |in_file|
File.open("test.csv", 'w') do |out_file| #the 'w' opens the file for writing
in_file.each {|line| out_file << line.squeeze(' ').gsub(' ', ',') }
end # closes test.csv
end # closes test.txt

You could use a regular expression to replace any whitespace characters with a comma:
my_string.sub! /\s/g, ','
If you want to discard empty fields, you could use this:
my_string.sub! /\s+/g, ','
An alternative would be to split it on spaces and join on commas. This will also discard empty fields:
my_string = my_string.split(' ').join(',')

File.open("details.txt", "r+"){|io| io.write(io.read.gsub(/[ \t]+/, ","))}

Related

script to loop through and combine two text files

I have two .csv files which I am trying to 'multiply' out via a script. The first file is person information and looks basically like this:
First Name, Last Name, Email, Phone
Sally,Davis,sdavis#nobody.com,555-555-5555
Tom,Smith,tsmith#nobody.com,555-555-1212
The second file is account numbers and looks like this:
AccountID
1001
1002
Basically I want to get every name with every account Id. So if I had 10 names in the first file and 10 account IDs in the second file, I should end up with 100 rows in the resulting file and have it look like this:
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Any help would be greatly appreciated
You could simply write a for loop for each value to be repeated by it's id count and append the description, but just in the reverse order.
Has that not worked or have you not tried that?
If python works for you, here's a script which does that:
def main():
f1 = open("accounts.txt", "r")
f1_total_lines = sum(1 for line in open('accounts.txt'))
f2_total_lines = sum(1 for line in open('info.txt'))
f1_line_counter = 1;
f2_line_counter = 1;
f3 = open("result.txt", "w")
f3.write('First Name, Last Name, Email, Phone, AccountID\n')
for line_account in f1.readlines():
f2 = open("info.txt", "r")
for line_info in f2.readlines():
parsed_line_account = line_account
parsed_line_info = line_info.rstrip() # we have to trim the newline character from every line from the 'info' file
if f2_line_counter == f2_total_lines: # ...for every but the last line in the file (because it doesn't have a newline character)
parsed_line_info = line_info
f3.write(parsed_line_info + ',' + parsed_line_account)
if f1_line_counter == f1_total_lines:
f3.write('\n')
f2_line_counter = f2_line_counter + 1
f1_line_counter = f1_line_counter + 1
f2_line_counter = 1 # reset the line counter to the first line
f1.close()
f2.close()
f3.close()
if __name__ == '__main__':
main()
And the files I used are as follows:
info.txt:
Sally,Davis,sdavis#nobody.com,555-555-555
Tom,Smith,tsmith#nobody.com,555-555-1212
John,Doe,jdoe#nobody.com,555-555-3333
accounts.txt:
1001
1002
1003
If You Intended to Duplicate Account_ID
If you intended to add each Account_ID to every record in your information file then a short awk solution will do, e.g.
$ awk -F, '
FNR==NR{a[i++]=$0}
FNR!=NR{b[j++]=$0}
END{print a[0] ", " b[0]
for (k=1; k<i; k++)
for (m=1; m<i; m++)
print a[m] ", " b[k]}
' info id
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Above the lines in the first file (when the file-record-number equals the record-number, e.g. FNR==NR) are stored in array a, the lines from the second file (when FNR!=NR) are stored in array b and then they combined and output in the END rule in the desired order.
Without Duplicating Account_ID
Since Account_ID is usually a unique bit of information, if you did not intended to duplicate every ID at the end of each record, then there is no need to loop. The paste command does that for you. In your case with your information file as info and you account ID file as id, it is as simple as:
$ paste -d, info id
First Name, Last Name, Email, Phone,AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555,1001
Tom,Smith,tsmith#nobody.com,555-555-1212,1002
(note: the -d, option just sets the delimiter to a comma)
Seems a lot easier that trying to reinvent the wheel.
Can be easily done with arrays
OLD=$IFS; IFS=$'\n'
ar1=( $(cat file1) )
ar2=( $(cat file2) )
IFS=$OLD
ind=${!ar1[#]}
for i in $ind; { echo "${ar1[$i]}, ${ar2[$i]}"; }

Ruby question mark in filename

I have a little piece of ruby that creates a file containing tsv content with 2 columns, a date, and a random number.
#!/usr/bin/ruby
require 'date'
require 'set'
startDate=Date.new(2014,11,1)
endDate=Date.new(2015,9,1)
dates=File.new("/PATH_TO_FILE/dates_randoms.tsv","w+")
rands=Set.new
while startDate <= endDate do
random=rand(1000)
while rands.add?(random).nil? do
random=rand(1000)
end
dates.puts("#{startDate.to_s.gsub("-","")} #{random}")
startDate=startDate+1
end
Then, from another program, i read this file and create a file out of the random number:
dates_file=File.new(DATES_FILE_PATH,"r")
dates_file.each_line do |line|
parts=line.split("\t")
random=parts.at(1)
table=File.new("#{TMP_DIR}#{random}.tsv","w")
end
But when i go and check the file i see 645?.tsv for example.
I initially thought that was the line separator in the tsv file (the one containing the date and the random) but its run in the same unix filesystem, its not a transaction from dos to unix
Some lines from the file:
head dates_randoms.tsv
20141101 356
20141102 604
20141103 680
20141104 668
20141105 995
20141106 946
20141107 354
20141108 234
20141109 429
20141110 384
Any advice?
parts = line.split("\t")
random = parts.at(1)
line there will contain a trailing newline char. So for a line
"whatever\t1234\n"
random will contain "1234\n". That newline char then becomes a part of filename and you see it as a question mark. The simplest workaround is to do some sanitization:
random = parts.at(1).chomp
# alternatively use .strip if you want to remove whitespaces
# from beginning of the value too

Recovering hex data from a large log-file using Ruby and RegEx

I'm trying to filter/append lines of hex data from a large log-file, using Ruby and RegEx.
The lines of the log-file that I need look like this:
Data: 10 55 61 (+ lots more hex data)
I want to add all of the hex data, for further processing later. The regex /^\sData:(.+)/ should do the trick.
My Ruby-program looks like this:
puts "Start"
fileIn = File.read("inputfile.txt")
fileOut = File.new("outputfile.txt", "w+")
fileOut.puts "Start of regex data\n"
fileIn.each_line do
dataLine = fileIn.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end
fileOut.puts "\nEOF"
fileOut.close
puts "End"
It works - sort of - but the lines in the output file are all the same, just repeating the result of the first regex match.
What am I doing wrong?
You are iterating over the same entire file. You need to iterate over the line.
fileIn.each_line do |line|
dataLine = line.match(/^\sData:(.+)/).captures
fileOut.write dataLine
end

How to write some value to a text file in ruby based on position

I need some help is some unique solution. I have a text file in which I have to replace some value based on some position. This is not a big file and will always contain 5 lines with fixed number of length in all the lines at any given time. But I have to specficaly replace soem text in some position only. Further, i can also put in some text in required position and replace that text with required value every time. I am not sure how to implement this solution. I have given the example below.
Line 1 - 00000 This Is Me 12345 trying
Line 2 - 23456 This is line 2 987654
Line 3 - This is 345678 line 3 67890
Consider the above is the file I have to use to replace some values. Like in line 1, I have to replace '00000' with '11111' and in line 2, I have to replace 'This' with 'Line' or any require four digit text. The position will always remain the same in text file.
I have a solution which works but this is for reading the file based on position and not for writing. Can someone please give a solution similarly for wrtiting aswell based on position
Solution for reading the file based on position :
def read_var file, line_nr, vbegin, vend
IO.readlines(file)[line_nr][vbegin..vend]
end
puts read_var("read_var_from_file.txt", 0, 1, 3) #line 0, beginning at 1, ending at 3
#=>308
puts read_var("read_var_from_file.txt", 1, 3, 6)
#=>8522
I have also tried this solution for writing. This works but I need it to work based on position or based on text present in the specific line.
Explored solution to wirte to file :
open(Dir.pwd + '/Files/Try.txt', 'w') { |f|
f << "Four score\n"
f << "and seven\n"
f << "years ago\n"
}
I made you a working sample anagraj.
in_file = "in.txt"
out_file = "out.txt"
=begin
=>contents of file in.txt
00000 This Is Me 12345 trying
23456 This is line 2 987654
This is 345678 line 3 67890
=end
def replace_in_file in_file, out_file, shreds
File.open(out_file,"wb") do |file|
File.read(in_file).each_line.with_index do |line, index|
shreds.each do |shred|
if shred[:index]==index
line[shred[:begin]..shred[:end]]=shred[:replace]
end
end
file << line
end
end
end
shreds = [
{index:0, begin:0, end:4, replace:"11111"},
{index:1, begin:6, end:9, replace:"Line"}
]
replace_in_file in_file, out_file, shreds
=begin
=>contents of file out.txt
11111 This Is Me 12345 trying
23456 Line is line 2 987654
This is 345678 line 3 67890
=end

replace every occurrence of 'line 2' with line_2 with regex

I'm parsing some text from an XML file which has sentences like
"Subtract line 4 from line 1.", "Enter the amount from line 5"
i want to replace all occurrences of line with line_
eg. Subtract line 4 from line 1 --> Subtract line_4 from line_1
Also, there are sentences like "Are the amounts on lines 4 and 8 the same?" and "Skip lines 9 through 12; go to line 13."
I want to process these sentences to become
"Are the amounts on line_4 and line_8 the same?"
and
"Skip line_9 through line_12; go to line_13."
Here's a working implementation with rspec test. You call it like this: output = LineIdentifier[input]. To test, spec file.rb after installing rspec gem.
require 'spec'
class LineIdentifier
def self.[](input)
output = input.gsub /line (\d+)/, 'line_\1'
output.gsub /lines (\d+) (and|from|through) (line )?(\d+)/, 'line_\1 \2 line_\4'
end
end
describe "LineIdentifier" do
it "should identify line mentions" do
examples = {
#Input Output
'Subtract line 4 from line 1.' => 'Subtract line_4 from line_1.',
'Enter the amount from line 5' => 'Enter the amount from line_5',
'Subtract line 4 from line 1' => 'Subtract line_4 from line_1',
}
examples.each do |input, output|
LineIdentifier[input].should == output
end
end
it "should identify line ranges" do
examples = {
#Input Output
'Are the amounts on lines 4 and 8 the same?' => 'Are the amounts on line_4 and line_8 the same?',
'Skip lines 9 through 12; go to line 13.' => 'Skip line_9 through line_12; go to line_13.',
}
examples.each do |input, output|
LineIdentifier[input].should == output
end
end
end
This works for the specific examples including the ones in the OP comments. As is often the case when using regex to do parsing, it becomes a hodge-podge of additional cases and tests to handle ever-increasing known inputs. This handles the lists of line numbers using a while loop with a non-greedy match. As written, it is simply processing an input line-by-line. To get series of line numbers across line boundaries, it would need to be changed to process it as one chunk with matching across lines.
open( ARGV[0], "r" ) do |file|
while ( line = file.gets )
# replace both "line ddd" and "lines ddd" with line_ddd
line.gsub!( /(lines?\s)(\d+)/, 'line_\2' )
# Now replace the known sequences with a non-greedy match
while line.gsub!( /(line_\d+[a-z]?,?)(\sand\s|\sthrough\s|,\s)(\d+)/, '\1\2line_\3' )
end
puts line
end
end
Sample Data: For this input:
Subtract line 4 from line 1.
Enter the amount from line 5
on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
... on line 10 Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4
It produces this output:
Subtract line_4 from line_1.
Enter the amount from line_5
on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
... on line_10 Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4
sed is your friend:
lines.sed:
#!/bin/sed -rf
s/lines? ([0-9]+)/line_\1/g
s/\b([0-9]+[a-z]?)\b/line_\1/g
lines.txt:
Subtract line 4 from line 1.
Enter the amount from line 5
Are the amounts on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
Enter the total of the amounts from Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4
demo:
$ cat lines.txt | ./lines.sed
Subtract line_4 from line_1.
Enter the amount from line_5
Are the amounts on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
Enter the total of the amounts from Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4
You can also make this into a sed one-liner if you prefer, although the file is more maintainable.

Resources