Comparison between two tab separated files in unix using awk - shell

I've written this code on unix but I am facing the problem as mentioned below.
My Code is:
paste 1.txt 2.txt|
awk ' { FS = "\t " } ; NR == 1 { n = NF/2 }
{for(i=1;i<=n;i++)
if($i!=$(i+n))
{c = c s i; s = "," }
if(c)
{print "Line No. " NR-1 " COLUMN NO " c;
c = "" ; s = "" } } '
Expected Output:
Line No. 2 COLUMN NO 2,3
Line No. 4 COLUMN NO 1,2,3,4
Line No. 6 COLUMN NO 2,3,4,5
Line No. 7 COLUMN NO 1,2,3,4,5
Output that is getting generated:
Line No. 2 COLUMN NO 2,3
Line No. 4 COLUMN NO 1,2,3,4
Line No. 6 COLUMN NO 2,3,4,5
Line No. 7 COLUMN NO 1,2,3,4
Below specified file is space separated. To understand it better I have formatted it this way.
File1:
ID_ID First_name Last_name Address Contact_Number
ID1 John Rock 32, Park Lake, California 2222200000
ID2 Tommy Hill 5322 Otter Lane Middleberge 3333300000
ID3 Leonardo Test Half-Way Pond, Georgetown 4444400000
ID8 Rhyan Bigsh 6762,33 Ave N,St. Petersburg 5555500000
ID50 Steve Goldberg 6762,33 Ave N,St. Petersburg 6666600000
ID60 Steve Goldberg 6666600000
File2:
ID_ID First_name Last_name Address Contact_Number
ID1 John Rock 32, Park Lake, California 2222200000
ID2 Tommy1 Hill1 5322 Otter Lane Middleberge 3333300000
ID3 Leonardo Test Half-Way Pond, Georgetown 4444400000
ID80 Sylvester Stallone 5555500000
ID50 Steve Goldberg 6762,33 Ave N,St. Petersburg 6666600000
ID60 Mark Waugh St. Petersburg 7777700000
ID70 John Smith 8888800000

Related

Extract bibtex entries based on the year

Okay, I got the file.bib file with multiple entries such
#Book{Anley:2007:shellcoders-handbook-2nd-ed,
author = {Chris Anley and John Heasman and Felix Lindner and Gerardo
Richarte},
title = "{The Shellcoder's Handbook}",
publisher = {Wiley},
year = 2007,
edition = 2,
month = aug,
}
there you can find the "year = 2007" line. My task is to filter out the years which are greater than 2020 ($currentyear) or lower than 1900 ($minyear), the result should be a also the output of the month "may", which stands behind a "year" line in this file. (Which is a mistake by the admin). (btw the file is over 4000 lines long).
It is better to use awk for this. Similar to your line, it would read:
awk -v t1="1900" -v t2="$(date "+%Y")" \
'!match($0,/year.*=.*/){next}
{t=substr(RSTART,RLENGTH)
match(t,/[0-9][0-9][0-9][0-9]/)
y=substr(RSTART,RLENGTH)
}
(y > t1) && (y <= t2) { print y }' file

script to loop through and combine two text files

I have two .csv files which I am trying to 'multiply' out via a script. The first file is person information and looks basically like this:
First Name, Last Name, Email, Phone
Sally,Davis,sdavis#nobody.com,555-555-5555
Tom,Smith,tsmith#nobody.com,555-555-1212
The second file is account numbers and looks like this:
AccountID
1001
1002
Basically I want to get every name with every account Id. So if I had 10 names in the first file and 10 account IDs in the second file, I should end up with 100 rows in the resulting file and have it look like this:
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Any help would be greatly appreciated
You could simply write a for loop for each value to be repeated by it's id count and append the description, but just in the reverse order.
Has that not worked or have you not tried that?
If python works for you, here's a script which does that:
def main():
f1 = open("accounts.txt", "r")
f1_total_lines = sum(1 for line in open('accounts.txt'))
f2_total_lines = sum(1 for line in open('info.txt'))
f1_line_counter = 1;
f2_line_counter = 1;
f3 = open("result.txt", "w")
f3.write('First Name, Last Name, Email, Phone, AccountID\n')
for line_account in f1.readlines():
f2 = open("info.txt", "r")
for line_info in f2.readlines():
parsed_line_account = line_account
parsed_line_info = line_info.rstrip() # we have to trim the newline character from every line from the 'info' file
if f2_line_counter == f2_total_lines: # ...for every but the last line in the file (because it doesn't have a newline character)
parsed_line_info = line_info
f3.write(parsed_line_info + ',' + parsed_line_account)
if f1_line_counter == f1_total_lines:
f3.write('\n')
f2_line_counter = f2_line_counter + 1
f1_line_counter = f1_line_counter + 1
f2_line_counter = 1 # reset the line counter to the first line
f1.close()
f2.close()
f3.close()
if __name__ == '__main__':
main()
And the files I used are as follows:
info.txt:
Sally,Davis,sdavis#nobody.com,555-555-555
Tom,Smith,tsmith#nobody.com,555-555-1212
John,Doe,jdoe#nobody.com,555-555-3333
accounts.txt:
1001
1002
1003
If You Intended to Duplicate Account_ID
If you intended to add each Account_ID to every record in your information file then a short awk solution will do, e.g.
$ awk -F, '
FNR==NR{a[i++]=$0}
FNR!=NR{b[j++]=$0}
END{print a[0] ", " b[0]
for (k=1; k<i; k++)
for (m=1; m<i; m++)
print a[m] ", " b[k]}
' info id
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Above the lines in the first file (when the file-record-number equals the record-number, e.g. FNR==NR) are stored in array a, the lines from the second file (when FNR!=NR) are stored in array b and then they combined and output in the END rule in the desired order.
Without Duplicating Account_ID
Since Account_ID is usually a unique bit of information, if you did not intended to duplicate every ID at the end of each record, then there is no need to loop. The paste command does that for you. In your case with your information file as info and you account ID file as id, it is as simple as:
$ paste -d, info id
First Name, Last Name, Email, Phone,AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555,1001
Tom,Smith,tsmith#nobody.com,555-555-1212,1002
(note: the -d, option just sets the delimiter to a comma)
Seems a lot easier that trying to reinvent the wheel.
Can be easily done with arrays
OLD=$IFS; IFS=$'\n'
ar1=( $(cat file1) )
ar2=( $(cat file2) )
IFS=$OLD
ind=${!ar1[#]}
for i in $ind; { echo "${ar1[$i]}, ${ar2[$i]}"; }

Want to find the max record among the records using pig

I want to find the player who scored max no of runs against each team using pig.
Input : Inputs are in the below fashion
Sachin 100 KXIP Hyderabad 1991
sehwag 150 KXIP Hyderabad 1991
Sehwag 100 MI Mumbai 2011
Kohli 0 CSK Chennai 2014
Dhoni 150 MI Hyderabad 1991
Sachin 32 PW Chennai 2014
Dhoni 150 MI Mumbai 2011
My Implementation:
record1= LOAD 'ipl.txt' using PigStorage(' ') as (name:chararray,runs:int,team:chararray,loc:chararray,year:int);
record2 = GROUP record1 by team as team;
record3 = FOREACH record2 GENERATE group,MAX(record1.runs) as mx;
record4= ORDER record3 by mx ASC;
DUMP record4;
Output:
(PW,32)
(KXIP,150)
(MI,150)
But expecting the result in the following fashion..
Sachin PW 32 Chennai 2014
record1= LOAD 'ipl.txt' using PigStorage(' ') as (name:chararray,runs:int,team:chararray,loc:chararray,year:int);
record2 = GROUP record1 by team;
record3 = FOREACH record2 GENERATE group,MAX(record1.runs) as mx;
record4 = JOIN record3 by (mx,group) LEFT OUTER, record1 by (runs,team);
record5 = FOREACH record4 GENERATE record1::name as name, record1::team as team, record3::mx as mx, record1::year as year;
record6= ORDER record5 by mx ASC;
DUMP record6;
produces the following result
(Kohli,CSK,0,2014)
(Sachin,PW,32,2014)
(sehwag,KXIP,150,1991)
(Dhoni,MI,150,1991)
(Dhoni,MI,150,2011)
notice that there are two records for Dhoni, this is because he scored 150 twice. If you want to remove that you need to choose the earliest or latest year depending on what you want.
I would do it using the TOP function : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is the script to obtain the result you want :
record1= LOAD 'ipl.txt' using PigStorage(' ') as
(name:chararray,runs:int,team:chararray,loc:chararray,year:int);
record2 = GROUP record1 by team;
record3 = FOREACH record2 GENERATE FLATTEN(TOP(1,1,record1));
record4= ORDER record3 by runs ASC;
DUMP record4;
As a result, you will get :
(Kohli,0,CSK,Chennai,2014)
(Sachin,32,PW,Chennai,2014)
(sehwag,150,KXIP,Hyderabad,1991)
(Dhoni,150,MI,Hyderabad,1991)

Ruby code to extract data from irregular text with intelligence

I am trying to write a ruby code to extract data from specific location from irregular text content.
The following is the text content something I am looking at.
Address1 Address2
adress1, adress1, # 34 , adress1,
4th Floor, Plot # 14 & 15,
Drive,, HARIKA BHIMANI
Madhapur, Hyderabad - 500081 2-14-117/35-1 Nas
Andhra Pradesh AP
+(91)40-00000000
xyz#dabc.com
This is my weird text and I want to extract Address1 and Address2 separately.
I thought I will try split but did not get how to extract Address1 and Address2 separately since both of them are in the single line. The space between content of Address1 and Address2 will be more than 2 space for sure.
I am planning to parse each line and split string in each line with separator more than 1 space. How do split the string in ruby with separator character more than two space ?
We can ignore the first 2 lines in the above text and start from 3rd line. Basically I want to separate out left side and right side data. The separator is more than 2 spaces. I have edited the question with my sample coding but it is failing if one of the line in the left side data is empty
I have tried following sample
if !line.empty?
splits = line.split(/ {2,}/)
case splits.length
when 2
puts "Address1 "+ splits[1]
when 3
puts "Address1 "+ splits[1]
puts "Address2 "+ splits[2]
else
end
end
But it fails for the following sample
leftSideHasData rightSideHasData
OnlyRightSideHasData
How can I achieve this in Ruby ? Does ruby provide any apis to do this with ease ?
text = %W{ Address1 Address2
adress1, adress1, # 34 , adress1,
4th Floor, Plot # 14 & 15,
Drive,, HARIKA BHIMANI
Madhapur, Hyderabad - 500081 2-14-117/35-1 Nas
Andhra Pradesh AP
+(91)40-00000000
xyz#dabc.com}
rows = text.split("\n").map { |row| row.split(/\s{2,}/) }
rows.each { |row| address1 << row[0]; address2 << row[1] }
address1
=> ["",
" adress1, adress1, # 34 , adress1, ",
" 4th Floor, Plot # 14 & 15, ",
" Drive,,",
" Madhapur, Hyderabad - 500081",
" Andhra Pradesh",
" +(91)40-00000000",
" xyz#dabc.com"]
address2
=> ["Address1", nil, nil, "HARIKA BHIMANI", "2-14-117/35-1 Nas", "AP", nil, nil]
You can remove nils with address2.compact

How to write some value to a text file in ruby based on position

I need some help is some unique solution. I have a text file in which I have to replace some value based on some position. This is not a big file and will always contain 5 lines with fixed number of length in all the lines at any given time. But I have to specficaly replace soem text in some position only. Further, i can also put in some text in required position and replace that text with required value every time. I am not sure how to implement this solution. I have given the example below.
Line 1 - 00000 This Is Me 12345 trying
Line 2 - 23456 This is line 2 987654
Line 3 - This is 345678 line 3 67890
Consider the above is the file I have to use to replace some values. Like in line 1, I have to replace '00000' with '11111' and in line 2, I have to replace 'This' with 'Line' or any require four digit text. The position will always remain the same in text file.
I have a solution which works but this is for reading the file based on position and not for writing. Can someone please give a solution similarly for wrtiting aswell based on position
Solution for reading the file based on position :
def read_var file, line_nr, vbegin, vend
IO.readlines(file)[line_nr][vbegin..vend]
end
puts read_var("read_var_from_file.txt", 0, 1, 3) #line 0, beginning at 1, ending at 3
#=>308
puts read_var("read_var_from_file.txt", 1, 3, 6)
#=>8522
I have also tried this solution for writing. This works but I need it to work based on position or based on text present in the specific line.
Explored solution to wirte to file :
open(Dir.pwd + '/Files/Try.txt', 'w') { |f|
f << "Four score\n"
f << "and seven\n"
f << "years ago\n"
}
I made you a working sample anagraj.
in_file = "in.txt"
out_file = "out.txt"
=begin
=>contents of file in.txt
00000 This Is Me 12345 trying
23456 This is line 2 987654
This is 345678 line 3 67890
=end
def replace_in_file in_file, out_file, shreds
File.open(out_file,"wb") do |file|
File.read(in_file).each_line.with_index do |line, index|
shreds.each do |shred|
if shred[:index]==index
line[shred[:begin]..shred[:end]]=shred[:replace]
end
end
file << line
end
end
end
shreds = [
{index:0, begin:0, end:4, replace:"11111"},
{index:1, begin:6, end:9, replace:"Line"}
]
replace_in_file in_file, out_file, shreds
=begin
=>contents of file out.txt
11111 This Is Me 12345 trying
23456 Line is line 2 987654
This is 345678 line 3 67890
=end

Resources