detecting "duplicate" entries in a tab separated file using bash & commands - bash

I have a tab-separated text file I need to check for duplicates. The layout looks roughly like so. (The first entries in the file are the column names.)
Sample input file:
+--------+-----------+--------+------------+-------------+----------+
| First | Last | BookID | Title | PublisherID | AuthorID |
+--------+-----------+--------+------------+-------------+----------+
| James | Joyce | 37 | Ulysses | 344 | 1022 |
| Ernest | Hemingway | 733 | Old Man... | 887 | 387 |
| James | Joyce | 872 | Dubliners | 405 | 1022 |
| Name1 | Surname1 | 1 | Title1 | 1 | 1 |
| James | Joyce | 37 | Ulysses | 345 | 1022 |
| Name1 | Surname1 | 1 | Title1 | 2 | 1 |
+--------+-----------+--------+------------+-------------+----------+
The file can hold up to 500k rows. What we're after is checking that there are no duplicates of the BookID and AuthorID values. So for instance, in the table above there can be no two rows with a BookID of 37 and AuthorID 1022.
It's likely, but not guaranteed, that the author will be grouped on consecutive lines. If it isn't, and it's too tricky to check, I can live with that. But otherwise, if the author is the same, we need to know if a duplicate BookID is there.
One complication-- we can have duplicate BookIDs in the file, but it's the combo of AuthorID + BookID that is not allowed.
Is there a good way of checking this in a bash script, perhaps some combo of sed and awk or another means of accomplishing this?
Raw tab-separated file contents for scripting:
First Last BookID Title PublisherID AuthorID
James Joyce 37 Ulysses 344 1022
Ernest Hemingway 733 Old Man... 887 387
James Joyce 872 Dubliners 405 1022
Name1 Surname1 1 Title1 1 1
James Joyce 37 Ulysses 345 1022
Name1 Surname1 1 Title1 2 1

If you want to find and count the duplicates you can use
awk '{c[$3 " " $6]+=1} END { for (k in c) if (c[k] > 1) print k "->" c[k]}'
which saves the combinations count in an associative array and then prints the counts if greater than 1

tab-separated text file
is checking that there are no duplicates of the BookID and AuthorID values
And from #piotr.wittchen answer the columns look like this:
First Last BookID Title PublisherID AuthorID
That's simple:
extract BookID AuthorID columns
sort
check for duplicates
cut -f3,6 input_file.txt | sort | uniq -d
If you gotta have the whole lines, we have to reorder the fields a bit for uniq to eat them:
awk '{print $1,$2,$4,$5,$3,$6}' input_file.txt | sort -k5 -k6 | uniq -d -f4
If you gotta have them in the initial order, you can number the lines, get the duplicates and re-sort them with the line numbers and then remove the line numbers, like so:
nl -w1 input_file.txt |
awk '{print $1,$2,$3,$5,$6,$4,$7}' input_file.txt | sort -k6 -k7 | uniq -d -f5 |
sort -k1 | cut -f2-

This is pretty easy with awk:
$ awk 'BEGIN { FS = "\t" }
($3,$6) in seen { printf("Line %d is a duplicate of line %d\n", NR, seen[$3,$6]); next }
{ seen[$3,$6] = NR }' input.tsv
It saves each bookid, authorid pair in a hash table and warns if that pair already exists.

As #Cyrus already said in the comment, your questions is not really clear, but looks interesting and I attempted to understand it and provide solution giving a few assumptions.
Assuming we have the following records.txt file:
First Last BookID Title PublisherID AuthorID
James Joyce 37 Ulysses 344 1022
Ernest Hemingway 733 Old Man... 887 387
James Joyce 872 Dubliners 405 1022
Name1 Surname1 1 Title1 1 1
James Joyce 37 Ulysses 345 1022
Name1 Surname1 1 Title1 2 1
we are going to remove lines, which has duplicated BookID (column 3) and AuthorID (Column 6) values at the same time. We assume that First, Last name and Title are also the same and we don't have to take it into consideration and PublisherID may be different or the same (it doesn't matter). Location of the records in the file doesn't matter (duplicated lines don't have to be grouped together).
Having these assumptions in mind, expected output for the input provided above will be as follows:
Ernest Hemingway 733 Old Man... 887 387
James Joyce 872 Dubliners 405 1022
James Joyce 37 Ulysses 344 1022
Name1 Surname1 1 Title1 1 1
Duplicated records of the same books of the same author for one publisher were removed.
Here's my solution for this problem in Bash
#!/usr/bin/env bash
file_name="records.txt"
repeated_books_and_authors_ids=($(cat $file_name | awk '{print $3$6}' | sort | uniq -d))
for i in "${repeated_books_and_authors_ids[#]}"
do
awk_statment_exclude="$awk_statment_exclude\$3\$6 != $i && "
awk_statment_include="$awk_statment_include\$3\$6 ~ $i || "
done
awk_statment_exclude=${awk_statment_exclude::-3}
awk_statment_exclude="awk '$awk_statment_exclude {print \$0}'"
not_repeated_records="cat $file_name | $awk_statment_exclude | sed '1d'"
eval $not_repeated_records
awk_statment_include=${awk_statment_include::-3}
awk_statment_include="awk '$awk_statment_include {print \$0}'"
repeated_records_without_duplicates="cat $file_name | $awk_statment_include | sort | awk 'NR % 2 != 0'"
eval $repeated_records_without_duplicates
It's probably not the best possible solution, but it works.
Regards,
Piotr

Related

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

Join two csv files if value is between interval in file 2

I have two csv files that I need to join, F1 has milions of lines, F2 (file 1) has thousands of lines. I need to join these files, if the position in file F1 (F1.pos) is between F2.start and F2.end. Is there any way, how to do this in bash? Because I have a code in Python pandas to sqllite3 and I am looking for something quicker.
Table F1 looks like:
| name | pos |
|------ |------ |
| a | 1020 |
| b | 1200 |
| c | 1800 |
Table F2 looks like:
| interval_name | start | end |
|--------------- |------- |------ |
| int1 | 990 | 1090 |
| int2 | 1100 | 1150 |
| int3 | 500 | 2000 |
Result should look like:
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int1 | 990 | 1090 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
DISCLAIMER: Use dedicated/local tools if available, this is hacking:
There is an apparent error in your desired output: name b should not match int1.
$ tail -n+1 *.csv
==> f1.csv <==
name,pos
a,1020
b,1200
c,1800
==> f2.csv <==
interval_name,start,end
int1,990,1090
int2,1100,1150
int3,500,2000
$ awk -F, -vOFS=, '
BEGIN {
print "name,pos,interval_name,start,end"
PROCINFO["sorted_in"]="#ind_num_asc"
}
FNR==1 {next}
NR==FNR {Int[$1] = $2 "," $3; next}
{
for(i in Int) {
split(Int[i], I)
if($2 >= I[1] && $2 <= I[2]) print $0, i, Int[i]
}
}
' f2.csv f1.csv
Outputs:
name,pos,interval_name,start,end
a,1020,int1,990,1090
a,1020,int3,500,2000
b,1200,int3,500,2000
c,1800,int3,500,2000
This is not particularly efficient in any way; the only sorting used is to ensure that the Int array is parsed in the correct order, which changes if your sample data is not indicative of the actual schema. I would be very interested to know how my solution performs vs pandas.
Here's one in awk. It hashes the smaller file records to arrays and for each of the bigger file records it iterates thru the hashes so it is slow:
$ awk '
NR==FNR { # hash f2 records
start[NR]=$4
end[NR]=$6
data[NR]=substr($0,2)
next
}
FNR<=2 { # mind the front matter
print $0 data[FNR]
}
{ # check if in range and output
for(i in start)
if($4>start[i] && $4<end[i])
print $0 data[i]
}' f2 f1
Output:
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
I doubt that a bash script would be faster than a python script. Just don't import the files into a database – write a custom join function instead!
The best way to join depends on your input data. If nearly all F1.pos are inside of nearly all intervals then a naive approach would be the fastest. The naive approach in bash would look like this:
#! /bin/bash
join --header -t, -j99 F1 F2 |
sed 's/^,//' |
awk -F, 'NR>1 && $2 >= $4 && $2 <= $5'
# NR>1 is only there to skip the column headers
However, this will be very slow if there are only a few intersections, for instance, when the average F1.pos only is in 5 intervals. In this case the following approach will be way faster. Implement it in a programing language of your choice – bash is not appropriate for this:
Sort F1 by pos in ascending order.
Sort F2 by start and then by end in ascending order.
For each sorted file, keep a pointer to a line, starting at the first line.
Repeat until F1's pointer reaches the end:
For the current F1.pos advance F2's pointer until F1.pos ≥ F2.start.
Lock F2's pointer, but continue to read lines until F1.pos ≤ F2.end. Print the read lines in the output format name,pos,interval_name,start,end.
Advance F1's pointer by one line.
Only sorting the files could be actually faster in bash. Here is a script to sort both files.
#! /bin/bash
sort -t, -n -k2 F1-without-headers > F1-sorted
sort -t, -n -k2,3 F2-without-headers > F2-sorted
Consider using LC_ALL=C, -S N% and --parallel N to speed up the sorting process.

Re: Transpose data using Linq

I have a table, in the following format,
|BallotNo | City | CandidateNo | Votes
|Box1 | AA | Cand1 | 1200
|Box1 | AA | Cand2 | 1500
|Box2 | BB | Cand1 | 2500
|Box2 | BB | Cand2 | 3600
uing linq, I want to a get a result in the format
|Box1 |AA |Cand1 |1200 |Cand2 |1500
|Box2 |BB |Cand1 |2500 |Cand2 |3600
Thanks
You are looking for a grouping option.
As I have understood, you need to group by City row, it is pretty easy, see the http://msdn.microsoft.com/library/bb534492.aspx link on how to use the GroupBy extension method.

shell - grep - how to get only lines that have certain amount char

good morning.
I have the following lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
And I wanna get only the lines with 7 "|" and the same first field.
So the output for these two lines will be nothing, but for these two lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
The output will be "error".
I'm getting the inputs from a file using the following command :
grep '.*|.*|.*|.*|.*|.*|.*|.*' < $1 | sort -nbsk1 | cut -d "|" -f1 | uniq -d |
while read line2; do
echo error
done
But this implementation would still print error even if I have more then 7 "|".
Any suggestions ?
P.S - I can assume that there is a \n in the end of each line.
For printing lines containing only 7 |, try:
awk -F'|' 'NF == 8' filename
If you want to use bash to count the number of | in a given line, try:
line="1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123";
count=${line//[^|]/};
echo ${#count};
With grep
grep '^\([^|]*|[^|]*\)\{7\}$'
Assuming zz.txt is:
$ cat zz.txt
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
$ cut -d\| -f1-8 zz.txt
above cut will give you the output you need.
I would suggest that you use awk for this job.
BEGIN { FS = "|" }
NF == 8 && $1 == '1' { print $0}
would do the job (although the == and && could be = and & ; my awk is a bit rusty)

How to remove repeated columns using ruby FasterCSV

I'm using Ruby 1.8 and FasterCSV.
The csv file I'm reading in has several repeated columns.
| acct_id | amount | acct_num | color | acct_id | acct_type | acct_num |
| 345 | 12.34 | 123 | red | 345 | 'savings' | 123 |
| 678 | 11.34 | 432 | green | 678 | 'savings' | 432 |
...etc
I'd like to condense it to:
| acct_id | amount | acct_num | color | acct_type |
| 345 | 12.34 | 123 | red | 'savings' |
| 678 | 11.34 | 432 | green | 'savings' |
Is there a general purpose way to do this?
Currently my solution is something like:
headers = CSV.read_line(file)
headers = CSV.read_line # get rid of garbage line between headers and data
FasterCSV.filter(file, :headers => headers) do |row|
row.delete(6) #delete second acct_num field
row.delete(4) #delete second acct_id field
# additional processing on the data
row['color'] = color_to_number(row['color'])
row['acct_type'] = acct_type_to_number(row['acct_type'])
end
Assuming you want to get rid of the hardcoded deletions
row.delete(6) #delete second acct_num field
row.delete(4) #delete second acct_id field
Can be replaced by
row = row.to_hash
This will clobber duplicates. The rest of the posted code will keep working.

Resources