I have been given a .txt file in which we have to find all the palindromes in the text (must have at least 3 letters and they cant be the same letters e.g. AAA)
it should be displayed with the first column being the amount of times it appears and the second being the word e.g.
123 kayak
3 bob
1 dad
#!/bin/bash
tmp='mktemp'
awk '{for(x=1;$x;++x)print $x}' "${1}" | tr -d [[:punct:]] | tr -s [:space:] | sed -e 's/#//g' -e 's/[0-9]*//g'| sed -r '/^.{,2}$/d' | sort | uniq -c -i > tmp1
This outputs the file as it should do, ignoring case, words less than 3 letters, punctuation and digits.
However i am now stump on how to pull out the palindromes from this, i thought a temp file might be the way, just don't know where to take it.
any help or guidance is much appreciated.
# modify this to your needs; it should take your input on stdin, and return one word per
# line on stdout, in the same order if called more than once with the same input.
preprocess() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
paste <(preprocess <"$1") <(preprocess <"$1" | rev) \
| awk '$1 == $2 && (length($1) >= 3) { print $1 }' \
| sort | uniq -c
The critical thing here is to paste together your input file with a stream that has each line from that input file reversed. This gives you two separate columns you can compare.
Related
I want to pipe the output of the command into two commands and paste the results together. I found this answer and similar ones suggesting using tee but I'm not sure how to make it work as I'd like it to.
My problem (simplified):
Say that I have a myfile.txt with keys and values, e.g.
key1 /path/to/file1
key2 /path/to/file2
What I am doing right now is
paste \
<( cat myfile.txt | cut -f1 ) \
<( cat myfile.txt | cut -f2 | xargs wc -l )
and it produces
key1 23
key2 42
The problem is that cat myfile.txt is repeated here (in the real problem it's a heavier operation). Instead, I'd like to do something like
cat myfile.txt | tee \
<( cut -f1 ) \
<( cut -f2 | xargs wc -l ) \
| paste
But it doesn't produce the expected output. Is it possible to do something similar to the above with pipes and standard command-line tools?
This doesn't answer your question about pipes, but you can use AWK to solve your problem:
$ printf %s\\n 1 2 3 > file1.txt
$ printf %s\\n 1 2 3 4 5 > file2.txt
$ cat > myfile.txt <<EOF
key1 file1.txt
key2 file2.txt
EOF
$ cat myfile.txt | awk '{ ("wc -l " $2) | getline size; sub(/ .+$/,"",size); print $1, size }'
key1 3
key2 5
On each line we first we run wc -l $2 and save the result into a variable. Not sure about yours, but on my system wc -l includes the filename in the output, so we strip it with sub() to match your example output. And finally, we print the $1 field (key) and the size we got from wc -l command.
Also, can be done with shell, now that I think about it:
cat myfile.txt | while read -r key value; do
printf '%s %s\n' "$key" "$(wc -l "$value" | cut -d' ' -f1)"
done
Or more generally, by piping to two commands and using paste, therefore answering the question:
cat myfile.txt | while read -r line; do
printf %s "$line" | cut -f1
printf %s "$line" | cut -f2 | xargs wc -l | cut -d' ' -f1
done | paste - -
P.S. The use of cat here is useless, I know. But it's just a placeholder for the real command.
I have the following bash script called countscript.sh
1 #!/bin/bash
2 echo "Running" $0
3 tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed $1 q
But I don't understand how to pass the argument correctly: ( "3" should be the argument $1 of sed).
$ echo " one two two three three three" | ./countscript.sh 3
Running ./countscript.sh
sed: -e expression #1, char 1: missing command
This works fine:
$ echo "one two three four one one four" | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 3q
3 one
2 four
1 two
Thanks.
PS: Anybody else noticed the
bug in this script on page 10, https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf ?
In the quoted paper, I think you are misreading
sed ${1}q
as
sed ${1} q
and sed does not consider 3 by itself a valid command. The separate argument q is treated as an input file name. If the value of $1 did result in a single valid sed script, you would have likely gotten an error for the missing input file q.
Proper shell programming would dictate this be written as
sed "${1}q"
or
sed "${1} q"
instead; with the space as part of the script, sed correctly outputs the first $1 lines of input and exits.
It's somewhat curious that the authors used sed instead of head - "$1" to output the first few lines, as one of them (McIlroy) essentially invented the idea of the Unix pipeline as a series of special-purpose, narrowly focused tools. Not having read the full paper, I don't know what Knuth and McIlroy's contributions to the paper were; perhaps Bentley just likes sed. :)
When running the following command:
$ echo " one two two three three three" | ./countscript.sh 3
the special variable $1 will be replaced by 3, your first argument. Hence, the script runs:
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 3 q
Notice the space between the 3 and the q. sed does not know what to do, because you give it no command (3 is not a command).
Remove the space, and you should be fine.
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed "${1}q"
I can get a list of the line numbers matching a specific pattern such as containing the word "function".
history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g'
If I do history -d on that it says bad pattern, I don't know if it's as it's a list or their strings rather than numbers?
history -d (history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
Quick answer:
while read n; do history -d $n; done < <(history | tac | awk '/function/{print $1}')
Explanation:
The history command accepts only a single offset when using the -d flag. On top of that when you delete an entry, it also renumbers all the commands after this entry. For this reason we revert the output of history using tac and process the lines from last to first. This short awk line just replaces the grep and sed command to pick up the history offset.
We do not use a full pipeline as this creates subshells and history -d $n would not work properly. This is nicely explained in: Why can't I delete multiple entries from bash history with this loop
Note: If you want to push this to your history file ($HISTFILE), you have to use history -w
Warning: When you have multiline commands in your history the story becomes very complicated and strongly depends on various options that have been set. See [U&L] When is a multiline history entry (aka lithist) in bash possible? for the nasty bits.
You can delete one history entry or a range of entries, but not a list. Your matches are likely to be spread out, so the range option is out.
The multiple sed commands to extract the history offsets can be simplified into one:
sed -E 's/^ *([0-9]*).*$/\1/'
One problem with history is that it can have multiline entries, like:
741 source <(history | \
grep function | \
sed -E 's/^ *([0-9]*).*$/\1/' | \
sort -rn | \
xargs -n1 echo history -d)
If your grep matches on function above, your sed will not be able to extract the history offset number, so we need to make that possible. One way may be to remove all newlines and only add them on lines containing the history offset. This is one way that probably can be done in some easier way:
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}'
We can then produce a number of history -d commands with xargs. xargs can't run the build-it history directly, so I've just used it to produce input to the built-in source using Process Substitution:
source <(history | \
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}' | \
grep function | \
sed -E 's/^ *([0-9]*).*$/\1/' | \
sort -rn | \
xargs -n1 echo history -d)
#kvantour gives nice alternatives to grep + sed + sort -rn. Using those, my above blob could be simplified into:
source <(history | \
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}' | \
awk '/function/ {print "history -d",$1}' | \
tac)
You need to store the pattern in a variable and then pass it to history.
$ history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g'
1077
$ var=$( history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
$ history -d $var
However, as you can have a lot of ocurrences for the patter, I would use a loop
$ var=$( history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
$ for i in $var
> do
> history -d $i
> history -w
> done
If the line you want to delete has already been written to your $HISTFILE (which typically happens when you end a session by default), you will need to write back to $HISTFILE, or the line will reappear when you open a new session.
After the deletion you need to load again the .bashrc by executing
$ cd
$ source .bashrc
However, there are cases that the lines won't be deleted: if you set PROMPT_COMMAND to history -a, in that case it is already written to the history file, rather than on exit under normal configuration.
I'm new to Linux shell. I know there are tools to do this thing, such as awk. But I'm wondering if I could do it using grep or wc or other commands? awk seems intimidating to me. Thanks.
I tried grep and wc, like this:
grep tol test.txt | wc -w
But grep will give me the whole line.
If I tried the following:
grep '^tol$*' test.txt | wc -w
It only counts the line begins with mol.
How can I grep the words starting with tol?
Something like that:
grep -o '\<tol[[:alpha:]]*\>' test.txt | wc -w
< - for beginning of the word,
> - the end of the word.
[[:alpha:]] - to avoid match of combinations like tol123 (You said you need only words).
-o - to show only matches, not the entire line.
You can do the same fairly simply with awk, e.g.
awk '{for(i=1;i<=NF;i++) $i~/^tol/ && n++} END {print n}'
Example
$ echo -e "tolerance topaz tolstoy\nbats toluene toledo" |
> awk '{for(i=1;i<=NF;i++) $i~/^tol/ && n++} END {print n}'
4
Another option is to translate all whitespace characters into linefeeds so that each word starts on a new line, then grep can count them itself:
echo -e "tolerance topaz\ttolstoy\nbats toluene toledo" | tr '[:space:]' '\n' | grep -c "^tol"
4
Or, if using a file called words.txt:
tr '[:space:]' '\n' < words.txt | grep -c "^tol"
I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
I'm using a bash shell via cygwin for WinXP
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
01/01/2005 00:22
01/01/2005 00:37
[snip]
02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows.
Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.
So, given that
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.
Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?
sort -k5,5 will do the sort on fields and avoid the cut;
uniq -f 4 will ignore the first 4 fields for the uniq;
Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
So:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:
tr '\n' '\000'
To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:
cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
Edit: the position of tr in the pipe was wrong.
You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
This is a good candidate for awk:
BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
Set field seperator to ',' (CSV).
Split fifth field on the space, stick result in A.
Concatenate the line number to the list of what we have already stored for that date.
Print out the line numbers for each date.
Try escaping the spaces with sed:
echo 01/01/2005 00:37 | sed 's/ /\\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\\ /g' | xargs -I '{}' grep '{}' myfile.csv
(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)