Code Golf 4th of July Edition: Counting Top Ten Occurring Words - text-files

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Given the following list of presidents do a top ten word count in the smallest program possible:
INPUT FILE
Washington
Washington
Adams
Jefferson
Jefferson
Madison
Madison
Monroe
Monroe
John Quincy Adams
Jackson
Jackson
Van Buren
Harrison
DIES
Tyler
Polk
Taylor
DIES
Fillmore
Pierce
Buchanan
Lincoln
Lincoln
DIES
Johnson
Grant
Grant
Hayes
Garfield
DIES
Arthur
Cleveland
Harrison
Cleveland
McKinley
McKinley
DIES
Teddy Roosevelt
Teddy Roosevelt
Taft
Wilson
Wilson
Harding
Coolidge
Hoover
FDR
FDR
FDR
FDR
Dies
Truman
Truman
Eisenhower
Eisenhower
Kennedy
DIES
Johnson
Johnson
Nixon
Nixon
ABDICATES
Ford
Carter
Reagan
Reagan
Bush
Clinton
Clinton
Bush
Bush
Obama
To start it off in bash 97 characters
cat input.txt | tr " " "\n" | tr -d "\t " | sed 's/^$//g' | sort | uniq -c | sort -n | tail -n 10
Output:
2 Nixon
2 Reagan
2 Roosevelt
2 Truman
2 Washington
2 Wilson
3 Bush
3 Johnson
4 FDR
7 DIES
Break ties as you see fit! Happy fourth!
For those of you who care more information on presidents can be found here.

C#, 153:
Reads in the file at p and prints results to the console:
File.ReadLines(p)
.SelectMany(s=>s.Split(' '))
.GroupBy(w=>w)
.OrderBy(g=>-g.Count())
.Take(10)
.ToList()
.ForEach(g=>Console.WriteLine(g.Count()+"|"+g.Key));
If merely producing the list but not printing to the console, it's 93 characters.
6|DIES
4|FDR
3|Johnson
3|Bush
2|Washington
2|Adams
2|Jefferson
2|Madison
2|Monroe
2|Jackson

A shorter shell version:
xargs -n1 < input.txt | sort | uniq -c | sort -nr | head
If you want case insensitive ranking, change uniq -c into uniq -ci.
Slightly shorter still, if you're happy about the rank being reversed and readability impaired by lack of spaces. This clocks in at 46 characters:
xargs -n1<input.txt|sort|uniq -c|sort -n|tail
(You could strip this down to 38 if you were allowed to rename the input file to simply "i" first.)
Observing that, in this special case, no word occur more than 9 times we can shave off 3 more characters by dropping the '-n' argument from the final sort:
xargs -n1<input.txt|sort|uniq -c|sort|tail
That takes this solution down to 43 characters without renaming the input file. (Or 35, if you do.)
Using xargs -n1 to split the file into one word on each line is preferable to the tr \ \\n solution, as that creates lots of blank lines. This means that the solution is not correct, because it misses out Nixon and shows a blank string showing up 256 times. However, a blank string is not a "word".

vim 60
:1,$!tr " " "\n"|tr -d "\t "|sort|uniq -c|sort -n|tail -n 10

Vim 36
:%s/\W/\r/g|%!sort|uniq -c|sort|tail

Haskell, 102 characters (wow, so close to matching the original):
import List
(take 10.map snd.sort.map(\(x:y)->(-length y,x)).group.sort.words)`fmap`readFile"input.txt"
J, only 55 characters:
10{.\:~~.(,.~[:<"0#(+/)=/~);;:&.><;._2[1!:1<'input.txt'
(I've yet to figure out how to elegantly perform text manipulations in J... it's much better at array-structured data.)
NB. read the file
<1!:1<'input.txt'
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------...
| Washington Washington Adams Jefferson Jefferson Madison Madison Monroe Monroe John Quincy Adams Jackson Jackson Van Buren Harrison DIES Tyler Polk Taylor DIES Fillmore Pierce ...
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------...
NB. split into lines
<;._2[1!:1<'input.txt'
+--------------+--------------+---------+-------------+-------------+-----------+-----------+----------+----------+---------------------+-----------+-----------+-------------+-----------------+---------+--------+---------------+------------+----------+----...
| Washington| Washington| Adams| Jefferson| Jefferson| Madison| Madison| Monroe| Monroe| John Quincy Adams| Jackson| Jackson| Van Buren| Harrison DIES| Tyler| Polk| Taylor DIES| Fillmore| Pierce| ...
+--------------+--------------+---------+-------------+-------------+-----------+-----------+----------+----------+---------------------+-----------+-----------+-------------+-----------------+---------+--------+---------------+------------+----------+----...
NB. split into words
;;:&.><;._2[1!:1<'input.txt'
+----------+----------+-----+---------+---------+-------+-------+------+------+----+------+-----+-------+-------+---+-----+--------+----+-----+----+------+----+--------+------+--------+-------+-------+----+-------+-----+-----+-----+--------+----+------+---...
|Washington|Washington|Adams|Jefferson|Jefferson|Madison|Madison|Monroe|Monroe|John|Quincy|Adams|Jackson|Jackson|Van|Buren|Harrison|DIES|Tyler|Polk|Taylor|DIES|Fillmore|Pierce|Buchanan|Lincoln|Lincoln|DIES|Johnson|Grant|Grant|Hayes|Garfield|DIES|Arthur|Cle...
+----------+----------+-----+---------+---------+-------+-------+------+------+----+------+-----+-------+-------+---+-----+--------+----+-----+----+------+----+--------+------+--------+-------+-------+----+-------+-----+-----+-----+--------+----+------+---...
NB. count reptititions
|:~.(,.~[:<"0#(+/)=/~);;:&.><;._2[1!:1<'input.txt'
+----------+-----+---------+-------+------+----+------+-------+---+-----+--------+----+-----+----+------+--------+------+--------+-------+-------+-----+-----+--------+------+---------+--------+---------+----+------+-------+--------+------+---+------+------...
|2 |2 |2 |2 |2 |1 |1 |2 |1 |1 |2 |6 |1 |1 |1 |1 |1 |1 |2 |3 |2 |1 |1 |1 |2 |2 |2 |1 |2 |1 |1 |1 |4 |2 |2 ...
+----------+-----+---------+-------+------+----+------+-------+---+-----+--------+----+-----+----+------+--------+------+--------+-------+-------+-----+-----+--------+------+---------+--------+---------+----+------+-------+--------+------+---+------+------...
|Washington|Adams|Jefferson|Madison|Monroe|John|Quincy|Jackson|Van|Buren|Harrison|DIES|Tyler|Polk|Taylor|Fillmore|Pierce|Buchanan|Lincoln|Johnson|Grant|Hayes|Garfield|Arthur|Cleveland|McKinley|Roosevelt|Taft|Wilson|Harding|Coolidge|Hoover|FDR|Truman|Eisenh...
+----------+-----+---------+-------+------+----+------+-------+---+-----+--------+----+-----+----+------+--------+------+--------+-------+-------+-----+-----+--------+------+---------+--------+---------+----+------+-------+--------+------+---+------+------...
NB. sort
|:\:~~.(,.~[:<"0#(+/)=/~);;:&.><;._2[1!:1<'input.txt'
+----+---+-------+----+------+----------+------+---------+------+-----+------+--------+-------+-------+---------+-------+--------+-----+----------+-------+---------+-----+---+-----+------+----+------+----+------+-----+-------+----+------+-----+-------+----...
|6 |4 |3 |3 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |2 |1 |1 |1 |1 |1 |1 |1 |1 |1 |1 |1 |1 |1 |1 ...
+----+---+-------+----+------+----------+------+---------+------+-----+------+--------+-------+-------+---------+-------+--------+-----+----------+-------+---------+-----+---+-----+------+----+------+----+------+-----+-------+----+------+-----+-------+----...
|DIES|FDR|Johnson|Bush|Wilson|Washington|Truman|Roosevelt|Reagan|Nixon|Monroe|McKinley|Madison|Lincoln|Jefferson|Jackson|Harrison|Grant|Eisenhower|Clinton|Cleveland|Adams|Van|Tyler|Taylor|Taft|Quincy|Polk|Pierce|Obama|Kennedy|John|Hoover|Hayes|Harding|Garf...
+----+---+-------+----+------+----------+------+---------+------+-----+------+--------+-------+-------+---------+-------+--------+-----+----------+-------+---------+-----+---+-----+------+----+------+----+------+-----+-------+----+------+-----+-------+----...
NB. take 10
10{.\:~~.(,.~[:<"0#(+/)=/~);;:&.><;._2[1!:1<'input.txt'
+-+----------+
|6|DIES |
+-+----------+
|4|FDR |
+-+----------+
|3|Johnson |
+-+----------+
|3|Bush |
+-+----------+
|2|Wilson |
+-+----------+
|2|Washington|
+-+----------+
|2|Truman |
+-+----------+
|2|Roosevelt |
+-+----------+
|2|Reagan |
+-+----------+
|2|Nixon |
+-+----------+

Perl: 90
Perl: 114 (Including perl, command-line switches, single quotes and filename)
perl -nle'$h{$_}++for split/ /;END{$i++<=10?print"$h{$_} $_":0for reverse sort{$h{$a}cmp$h{$b}}keys%h}' input.txt

The lack of AWK is disturbing.
xargs -n1<input.txt|awk '{c[$1]++}END{for(p in c)print c[p],p|"sort|tail"}'
75 characters.
If you want to get a bit more AWKy, you can forget xargs:
awk -v RS='[^a-zA-Z]' /./'{c[$1]++}END{for(p in c)print c[p],p|"sort|tail"}' input.txt

My best try with ruby so far, 166 chars:
h = Hash.new
File.open('f.l').each_line{|l|l.split(/ /).each{|e|h[e]==nil ?h[e]=1:h[e]+=1}}
h.sort{|a,b|a[1]<=>b[1]}.last(10).each{|e|puts"#{e[1]} #{e[0]}"}
I am surprised that no one has posted a crazy J solution yet.

Here's a compressed version of the shell script, observing that for a reasonable interpretation of the input data (no leading or trailing blanks) that the second 'tr' and the 'sed' command in the original do not change the data (verified by inserting 'tee out.N' at suitable points and checking the output file sizes - identical). The shell needs fewer spaces than humans do - and using cat instead of input I/O redirection wastes space.
tr \ \\n<input.txt|sort|uniq -c|sort -n|tail -10
This weighs in at 50 characters including newline at end of script.
With two more observations (pulled from other people's answers):
tail on its own is equivalent to 'tail -10', and
in this case, numeric and alpha sorting are equivalent,
this can be shrunk by a further 7 characters (to 43 including trailing newline):
tr \ \\n<input.txt|sort|uniq -c|sort|tail
Using 'xargs -n1' (with no command prefix given) instead of 'tr' is extremely clever; it deals with leading, trailing and multiple embedded spaces (which this solution does not).

vim 38 and works for all input
:%!xargs -n1|sort|uniq -c|sort -n|tail

Python 2.6, 104 chars:
l=open("input.txt").read().split()
for c,n in sorted(set((l.count(w),w) for w in l if w))[-10:]:print c,n

python 3.1 (88 chars)
import collections
collections.Counter(open('input.txt').read().split()).most_common(10)

Perl 86 characters
94, if you count the input filename.
perl -anE'$_{$_}++for#F;END{say"$_{$_} $_"for#{[sort{$_{$b}<=>$_{$a}}keys%_]}[0..10]}' test.in
If you don't care how many results you get, then it's only 75, excluding the filename.
perl -anE'$_{$_}++for#F;END{say"$_{$_} $_"for sort{$_{$b}<=>$_{$a}}keys%_}' test.in

Ruby 66B
puts (a=$<.read.split).uniq.map{|x|"#{a.count x} "+x}.sort.last 10

Ruby
115 chars
w = File.read($*[0]).split
w.uniq.map{|x| [w.select{|y|x==y}.size,x]}.sort.last(10).each{|z| puts "#{z[1]} #{z[0]}"}

Windows Batch File
This is obviously not the smallest solution, but I decided to post it anyway, just for fun. :) NB: the batch file uses a temporary file named $ for storing temporary results.
Original uncompressed version with comments:
#echo off
setlocal enableextensions enabledelayedexpansion
set infile=%1
set cnt=%2
set tmpfile=$
set knownwords=
rem Calculate word count
for /f "tokens=*" %%i in (%infile%) do (
for %%w in (%%i) do (
rem If the word hasn't already been processed, ...
echo !knownwords! | findstr "\<%%w\>" > nul
if errorlevel 1 (
rem Count the number of the word's occurrences and save it to a temp file
for /f %%n in ('findstr "\<%%w\>" %infile% ^| find /v "" /c') do (
echo %%n^|%%w >> %tmpfile%
)
rem Then add the word to the known words list
set knownwords=!knownwords! %%w
)
)
)
rem Print top 10 word count
for /f %%i in ('sort /r %tmpfile%') do (
echo %%i
set /a cnt-=1
if !cnt!==0 goto end
)
:end
del %tmpfile%
Compressed & obfuscated version, 317 characters:
#echo off&setlocal enableextensions enabledelayedexpansion&set n=%2&set l=
for /f "tokens=*" %%i in (%1)do for %%w in (%%i)do echo !l!|findstr "\<%%w\>">nul||for /f %%n in ('findstr "\<%%w\>" %1^|find /v "" /c')do echo %%n^|%%w>>$&set l=!l! %%w
for /f %%i in ('sort /r $')do echo %%i&set /a n-=1&if !n!==0 del $&exit /b
This can be shortened to 258 characters if echo is already off and command extensions and delayed variable expansion are on:
set n=%2&set l=
for /f "tokens=*" %%i in (%1)do for %%w in (%%i)do echo !l!|findstr "\<%%w\>">nul||for /f %%n in ('findstr "\<%%w\>" %1^|find /v "" /c')do echo %%n^|%%w>>$&set l=!l! %%w
for /f %%i in ('sort /r $')do echo %%i&set /a n-=1&if !n!==0 del $&exit /b
Usage:
> filename.bat input.txt 10 & pause
Output:
6|DIES
4|FDR
3|Johnson
3|Bush
2|Wilson
2|Washington
2|Truman
2|Roosevelt
2|Reagan
2|Nixon

Related

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

How to Add 4 Blank Columns to a pipe delimited CSV via Command Line

I am on a Windows machine.
I have a CSV file that looks like below that use pipe as the delimiter:
Column 1 | Column 2 | Column 3
1 | 2 | 3
1 | 2 | 3
And I need to add 4 blank columns to make it look like:
Column 1 | Column 2 | Column 3 ||||
1 | 2 | 3 ||||
1 | 2 | 3 ||||
This works fine if my delimiter was a CSV, but can't figure out what to do for the pipe.
#echo off
for /f "delims=" %%a in ('type "Test.csv"') do (
>>"fileout.csv" echo.%%a,,,,
)
My expected output is as follows
Column 1 | Column 2 | Column 3 ||||
1 | 2 | 3 ||||
1 | 2 | 3 ||||
The escape character for batch scripts is the caret - you can use your existing code, just add a caret before each pipe:
#echo off
for /f "delims=" %%a in ('type "Test.csv"') do (
>>"fileout.csv" echo.%%a^|^|^|^|
)

unix 'sort' command for inline characters

I have a .txt file of pumpkinsizes that I'm trying to sort by size of pumpkin:
name |size
==========
Joe |5
Mary |10
Bill |2
Jill |1
Adam |20
Mar |5
Roe |10
Mir |3
Foo |9
Bar |12
Baz |0
Currently I'm having great difficulty in getting sort to work properly. Can anyone help me sort my list by pumpkin size without modifying the list structure?
The table headings need special consideration, since "sorting" them will move them to some random line. So we use a two step process:
a) output the table headings. b) sort the rest numerically (-n), reverse
order (-r), with field separator | (-t), starting at field 2 (-k)
$ awk 'NR<=2' in; awk 'NR>2' in | sort -t '|' -nr -k 2
name |size
==========
Adam |20
Bar |12
Roe |10
Mary |10
Foo |9
Mar |5
Joe |5
Mir |3
Bill |2
Jill |1
Baz |0
The key point is the option -k of sort. You can use man sort to see how it works. The solution for your problem follows:
sed -n '3,$p' YOUR_FILENAME| sort -hrt '|' -k 2
You can simply remove the
name |size
==========
by using sed command. Then whatever is left can be sorted using sort command.
sed '1,2d' txt | sort -t "|" -k 2 -n
Here, sed '1,2d' will remove the first 2 lines.
Then sort will tokenize the data on character '|' using option -t.
Since you want to sort based on size which happens to be second token, so the token "size" can be specified by -k 2 option of sort.
Finally, considering "size" as number, so this can be specified by option -n of sort.
You can do this in the shell:
{ read; echo "$REPLY"; read; echo "$REPLY"; sort -t'|' -k2n; } < pumpkins.txt
That reads and prints the first 2 header lines, then sorts the rest.

How to insert a different delimiter in between two columns in shell

I 've a file as below
ABc def|0|0|0| 1 | 2| 9|
0 2930|0|0|0|0| 1 | 2| 9|
Now, i want to split the first column with the same delimiter.
output:
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
Please help me out with awk.
You can use sed for this:
$ sed 's/ /|/' file
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
The way it is defined, it just replaces the first space with a |, which is exactly what you need.
With awk it is a bit longer:
$ $ awk 'BEGIN{FS=OFS="|"}{split($1, a, " "); $1=a[1]"|"a[2]}1' file
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
After definining input and output field separator as |, it splits the first field based on space. Then prints the line back.
Another awk
awk '{sub(/ /,"|")}1' file
ABc|def|0|0|0| 1 | 2| 9|
0|2930|0|0|0|0| 1 | 2| 9|
Without the leading space, this works fine.
You said you want to replace the delimiter (space->pipe) in first column.
It could happen that in your first col, there is no space, but in other columns, there are spaces. In this case, you don't want to do any change on that line. Also in your first column, there could be more spaces, I guess you want to have them all replaced. So I cannot think of a shorter way for this problem.
awk -F'|' -v OFS="|" '{gsub(/ /,"|",$1)}7' file
sed 's/^[[:blank:]]\{1,\}/ /;/^\([^|]\{1,\}\)[[:blank:]]\{1,\}\([^|[[:blank:]]\)/ s//\1|\2/'
assuming first column is blank for empty, a blank (or several) as the separator than another non blank or |
this allow this
ABc def|0|0|0| 1 | 2| 9|
def|0|0|0| 1 | 2| 9|
ABc|def|0|0|0| 1 | 2| 9|

detecting "duplicate" entries in a tab separated file using bash & commands

I have a tab-separated text file I need to check for duplicates. The layout looks roughly like so. (The first entries in the file are the column names.)
Sample input file:
+--------+-----------+--------+------------+-------------+----------+
| First | Last | BookID | Title | PublisherID | AuthorID |
+--------+-----------+--------+------------+-------------+----------+
| James | Joyce | 37 | Ulysses | 344 | 1022 |
| Ernest | Hemingway | 733 | Old Man... | 887 | 387 |
| James | Joyce | 872 | Dubliners | 405 | 1022 |
| Name1 | Surname1 | 1 | Title1 | 1 | 1 |
| James | Joyce | 37 | Ulysses | 345 | 1022 |
| Name1 | Surname1 | 1 | Title1 | 2 | 1 |
+--------+-----------+--------+------------+-------------+----------+
The file can hold up to 500k rows. What we're after is checking that there are no duplicates of the BookID and AuthorID values. So for instance, in the table above there can be no two rows with a BookID of 37 and AuthorID 1022.
It's likely, but not guaranteed, that the author will be grouped on consecutive lines. If it isn't, and it's too tricky to check, I can live with that. But otherwise, if the author is the same, we need to know if a duplicate BookID is there.
One complication-- we can have duplicate BookIDs in the file, but it's the combo of AuthorID + BookID that is not allowed.
Is there a good way of checking this in a bash script, perhaps some combo of sed and awk or another means of accomplishing this?
Raw tab-separated file contents for scripting:
First Last BookID Title PublisherID AuthorID
James Joyce 37 Ulysses 344 1022
Ernest Hemingway 733 Old Man... 887 387
James Joyce 872 Dubliners 405 1022
Name1 Surname1 1 Title1 1 1
James Joyce 37 Ulysses 345 1022
Name1 Surname1 1 Title1 2 1
If you want to find and count the duplicates you can use
awk '{c[$3 " " $6]+=1} END { for (k in c) if (c[k] > 1) print k "->" c[k]}'
which saves the combinations count in an associative array and then prints the counts if greater than 1
tab-separated text file
is checking that there are no duplicates of the BookID and AuthorID values
And from #piotr.wittchen answer the columns look like this:
First Last BookID Title PublisherID AuthorID
That's simple:
extract BookID AuthorID columns
sort
check for duplicates
cut -f3,6 input_file.txt | sort | uniq -d
If you gotta have the whole lines, we have to reorder the fields a bit for uniq to eat them:
awk '{print $1,$2,$4,$5,$3,$6}' input_file.txt | sort -k5 -k6 | uniq -d -f4
If you gotta have them in the initial order, you can number the lines, get the duplicates and re-sort them with the line numbers and then remove the line numbers, like so:
nl -w1 input_file.txt |
awk '{print $1,$2,$3,$5,$6,$4,$7}' input_file.txt | sort -k6 -k7 | uniq -d -f5 |
sort -k1 | cut -f2-
This is pretty easy with awk:
$ awk 'BEGIN { FS = "\t" }
($3,$6) in seen { printf("Line %d is a duplicate of line %d\n", NR, seen[$3,$6]); next }
{ seen[$3,$6] = NR }' input.tsv
It saves each bookid, authorid pair in a hash table and warns if that pair already exists.
As #Cyrus already said in the comment, your questions is not really clear, but looks interesting and I attempted to understand it and provide solution giving a few assumptions.
Assuming we have the following records.txt file:
First Last BookID Title PublisherID AuthorID
James Joyce 37 Ulysses 344 1022
Ernest Hemingway 733 Old Man... 887 387
James Joyce 872 Dubliners 405 1022
Name1 Surname1 1 Title1 1 1
James Joyce 37 Ulysses 345 1022
Name1 Surname1 1 Title1 2 1
we are going to remove lines, which has duplicated BookID (column 3) and AuthorID (Column 6) values at the same time. We assume that First, Last name and Title are also the same and we don't have to take it into consideration and PublisherID may be different or the same (it doesn't matter). Location of the records in the file doesn't matter (duplicated lines don't have to be grouped together).
Having these assumptions in mind, expected output for the input provided above will be as follows:
Ernest Hemingway 733 Old Man... 887 387
James Joyce 872 Dubliners 405 1022
James Joyce 37 Ulysses 344 1022
Name1 Surname1 1 Title1 1 1
Duplicated records of the same books of the same author for one publisher were removed.
Here's my solution for this problem in Bash
#!/usr/bin/env bash
file_name="records.txt"
repeated_books_and_authors_ids=($(cat $file_name | awk '{print $3$6}' | sort | uniq -d))
for i in "${repeated_books_and_authors_ids[#]}"
do
awk_statment_exclude="$awk_statment_exclude\$3\$6 != $i && "
awk_statment_include="$awk_statment_include\$3\$6 ~ $i || "
done
awk_statment_exclude=${awk_statment_exclude::-3}
awk_statment_exclude="awk '$awk_statment_exclude {print \$0}'"
not_repeated_records="cat $file_name | $awk_statment_exclude | sed '1d'"
eval $not_repeated_records
awk_statment_include=${awk_statment_include::-3}
awk_statment_include="awk '$awk_statment_include {print \$0}'"
repeated_records_without_duplicates="cat $file_name | $awk_statment_include | sort | awk 'NR % 2 != 0'"
eval $repeated_records_without_duplicates
It's probably not the best possible solution, but it works.
Regards,
Piotr

Resources