sort | uniq | xargs grep ... where lines contain spaces - bash

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
I'm using a bash shell via cygwin for WinXP
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
01/01/2005 00:22
01/01/2005 00:37
[snip]
02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows.
Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.
So, given that
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.
Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?

sort -k5,5 will do the sort on fields and avoid the cut;
uniq -f 4 will ignore the first 4 fields for the uniq;
Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
So:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16

The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:
tr '\n' '\000'
To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:
cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
Edit: the position of tr in the pipe was wrong.

You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv

This is a good candidate for awk:
BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
Set field seperator to ',' (CSV).
Split fifth field on the space, stick result in A.
Concatenate the line number to the list of what we have already stored for that date.
Print out the line numbers for each date.

Try escaping the spaces with sed:
echo 01/01/2005 00:37 | sed 's/ /\\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\\ /g' | xargs -I '{}' grep '{}' myfile.csv
(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

Related

How do I delete lines from my bash history matching a specific pattern?

I can get a list of the line numbers matching a specific pattern such as containing the word "function".
history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g'
If I do history -d on that it says bad pattern, I don't know if it's as it's a list or their strings rather than numbers?
history -d (history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
Quick answer:
while read n; do history -d $n; done < <(history | tac | awk '/function/{print $1}')
Explanation:
The history command accepts only a single offset when using the -d flag. On top of that when you delete an entry, it also renumbers all the commands after this entry. For this reason we revert the output of history using tac and process the lines from last to first. This short awk line just replaces the grep and sed command to pick up the history offset.
We do not use a full pipeline as this creates subshells and history -d $n would not work properly. This is nicely explained in: Why can't I delete multiple entries from bash history with this loop
Note: If you want to push this to your history file ($HISTFILE), you have to use history -w
Warning: When you have multiline commands in your history the story becomes very complicated and strongly depends on various options that have been set. See [U&L] When is a multiline history entry (aka lithist) in bash possible? for the nasty bits.
You can delete one history entry or a range of entries, but not a list. Your matches are likely to be spread out, so the range option is out.
The multiple sed commands to extract the history offsets can be simplified into one:
sed -E 's/^ *([0-9]*).*$/\1/'
One problem with history is that it can have multiline entries, like:
741 source <(history | \
grep function | \
sed -E 's/^ *([0-9]*).*$/\1/' | \
sort -rn | \
xargs -n1 echo history -d)
If your grep matches on function above, your sed will not be able to extract the history offset number, so we need to make that possible. One way may be to remove all newlines and only add them on lines containing the history offset. This is one way that probably can be done in some easier way:
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}'
We can then produce a number of history -d commands with xargs. xargs can't run the build-it history directly, so I've just used it to produce input to the built-in source using Process Substitution:
source <(history | \
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}' | \
grep function | \
sed -E 's/^ *([0-9]*).*$/\1/' | \
sort -rn | \
xargs -n1 echo history -d)
#kvantour gives nice alternatives to grep + sed + sort -rn. Using those, my above blob could be simplified into:
source <(history | \
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}' | \
awk '/function/ {print "history -d",$1}' | \
tac)
You need to store the pattern in a variable and then pass it to history.
$ history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g'
1077
$ var=$( history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
$ history -d $var
However, as you can have a lot of ocurrences for the patter, I would use a loop
$ var=$( history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
$ for i in $var
> do
> history -d $i
> history -w
> done
If the line you want to delete has already been written to your $HISTFILE (which typically happens when you end a session by default), you will need to write back to $HISTFILE, or the line will reappear when you open a new session.
After the deletion you need to load again the .bashrc by executing
$ cd
$ source .bashrc
However, there are cases that the lines won't be deleted: if you set PROMPT_COMMAND to history -a, in that case it is already written to the history file, rather than on exit under normal configuration.

BASH Finding palindromes in a .txt file

I have been given a .txt file in which we have to find all the palindromes in the text (must have at least 3 letters and they cant be the same letters e.g. AAA)
it should be displayed with the first column being the amount of times it appears and the second being the word e.g.
123 kayak
3 bob
1 dad
#!/bin/bash
tmp='mktemp'
awk '{for(x=1;$x;++x)print $x}' "${1}" | tr -d [[:punct:]] | tr -s [:space:] | sed -e 's/#//g' -e 's/[0-9]*//g'| sed -r '/^.{,2}$/d' | sort | uniq -c -i > tmp1
This outputs the file as it should do, ignoring case, words less than 3 letters, punctuation and digits.
However i am now stump on how to pull out the palindromes from this, i thought a temp file might be the way, just don't know where to take it.
any help or guidance is much appreciated.
# modify this to your needs; it should take your input on stdin, and return one word per
# line on stdout, in the same order if called more than once with the same input.
preprocess() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
paste <(preprocess <"$1") <(preprocess <"$1" | rev) \
| awk '$1 == $2 && (length($1) >= 3) { print $1 }' \
| sort | uniq -c
The critical thing here is to paste together your input file with a stream that has each line from that input file reversed. This gives you two separate columns you can compare.

How to change the date format in a file in shell script?

I have a file in which are written dates in the following form YYYY-MM-DD. And I want to write a shell script to change the date format into DD/MM/YYYY.
This is my first attempt that doesn't work
#!/bin/bash
NR=$(cat $1 | grep -o '[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' | cut -d'-' -f 1,1 | wc -l)
for (( i=1; i<=$NR; ++i))
do
Y=$(cat $1 | grep -o '[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' | cut -d'-' -f 1,1 | head -1)
M=$(cat $1 | grep -o '[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' | cut -d'-' -f 2,2 | head -1)
D=$(cat $1 | grep -o '[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' | cut -d'-' -f 3,3 | head -1)
sed 's/[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/"$D/$M/$Y"/' $1
done
I get the following error "sed: -e expression #1, char 51: unkown option to `s'"
You don't need variables for parts of the strings. Sed already remembers parts of the match if you use \(...\) grouping.
You can't use / as literal and the substitution delimiter at the same time. Either backslash the literal slashes, or use a different delimiter:
sed 's/\([0-9]\{4\}\)-\([0-9][0-9]\)-\([0-9][0-9]\)/\3\/\2\/\1/g'
or
sed 's=\([0-9]\{4\}\)-\([0-9][0-9]\)-\([0-9][0-9]\)=\3/\2/\1=g'

pipe result of cut to next argument and concat to string

I have something like:
cut -d ' ' -f2 | xargs cat *VAR_HERE.ss
where I want to use the result of cut as a variable and concat the output of cut between * and . so that cat will use the name to output the appropriate file. for example if the result of cut is:
$ cut -d ' ' -f2
001
I essentially want the second command to do:
$ cat *001.txt
I have seen xargs been used, but I am struggling to find out how to use it to explicitly call the output of cut, rather than assuming the second command only requires the exact output. obviously here I want to concat the output of cut to a string.
thanks
You can do:
cut -d ' ' -f2 | xargs -I {} bash -c 'cat *{}.txt'

Feed line number to sed

I have a piped unix script which finally yields a line number to me in the subject file.
Now,I need to print out the file contents from this particular line to the end.
Is it possible to feed the line number to sed via xargs,for sed to print out the desired.
.....|tail -1 | cut -f 1 | xargs sed ...?
Is this possible?
..... | tail -1 | cut -f 1 | xargs -i sed -n '{},$p' your_file

Resources