How to make git-log scroll up instead of down - terminal

Whenever I view a git log --all --graph --oneline --decorate output in my terminal emulator, the first commit is viewed at the top of the terminal screen. When I quit the git log output view with q, a few lines from the are not visible any more, as there are some new lines appended to the bottom of the screen, for the next command.
Usually though, those top lines are the most interesting, as they resemble the most recent git history, so I want them to be still visible when I type the next git command.
How can I make the git log output appear starting at the bottom of the screen, i.e. such that the first commit is viewed at the bottom? You would have to scroll up to view older commits.
NOTE: The --reverse flag is not an option for two reasons.
Each time you have to scroll all the way to the bottom to view the first
commits. That should not be necessary. I want to start at the bottom.
It doesn't combine with the --graph flag: fatal: cannot combine --reverse with --graph.

Original answer, which doesn't work, so goto EDIT for the working version
The following sed solution works for me if used directly form command line. It does not use any temporary string to switch \ and / by relying on seds y command.
$ git --no-pager log --all --graph --decorate --oneline --color=always | tac | sed 'h
s!\( *[0-9a-z]\{7\} .*\)\{0,1\}$!\1!
y/\\\/_¯/\/\\¯_/
x
s!\(.*\)\( *[0-9a-z]\{7\} .*\)\{0,1\}$!\2!
x
G
s/\n//' | less -X +G -r
It assumes the SHA code to be 7 characters long and uses it to "recognize" what is not the leading sequence of \, /, |, _, *, and <space>, which I was not able to put at the beginning of the first s command's search pattern and in place of the first . in the second s command.
I don't know why I cannot get it to work when all sed commands are put in a script called in with seds -f option.
EDIT (The above code is actually faulty)
As #user1902689 pointed out, sed scripts can make anyone's eyes bleed, myself included, since they are extremely cryptic.
In my opinion, the task would be easily accomplished if the --color=always was not used, in which case the text piped out of git log
would be the same as we see on screen; using --color=always, on the contrary, inserts control sequences like ^[[33m interspersed in the text to control the coloring (different colors for different branches, ...).
But having a colored output is nice, so I directed the output of git log --color=always ... to file, and looked into it, discovering that the hash always appears in between ^[[33m and ^[[m, where ^[ is a single character obtainable by hitting Ctrl+V, then Esc. These are essentially the escape sequences interpreted by bash as setting the color to yellow, and back to white, respectively (link).
The hash, which is not the only 7-alphanumeric characters-string in the line (e.g. thiswrd can be in the commit main message), is almost certainly the first one, so greedy expressions (sed has no non-greedy expressions) can be safely used after it, and not before (a .* before the hash-matching regexp would make that regexp match the last 7-alphanumeric characters-string on the line, which could be anytext, for instance, and the hash would be lost somewhere in .*). In order to allow the use of the greedy .* in a way that it doesn't devour the hash, we can enclose the hash in between newlines \n which are not matched by the . in .* (and, as such, they must be explicitly typed) with an s command, so that we can "limit" the greediness of .* in a successive s command by using some \n explicitly in the search pattern.
I think the following code (explained afterward) is not definitive, since it hardcodes the coloring escape sequences used to get a colored hash string, but it works as long as I've tried.
$ git --no-pager log --all --graph --decorate --oneline --color=always | tac | sed '
s/\(\(^[\[33m\)\([0-9a-z]\{7\}\)\(^[\[m\)\(.*\)\)/\2\n\3\n\4\5/
h
s/^.*\(\n[a-z0-9]\{7\}\n.*\|$\)/\1/
x
s/\n[a-z0-9]\{7\}\n.*$//
y/\\\/_¯/\/\\¯_/
G
s/\n\([a-z0-9]\{7\}\)\n/\1/
s/\n//' | less -X +G -r
Each line contains the three parts Graph OpeningColorTagHashClosingColorTag Message, or only the first one, Graph.
The sed string consists of 9 commands that do what I intended to do with the original answer, but in a bit different way (especially in that the order of some commands is inverted, to save a x command).
The first s command puts a newline \n on each side of the Hash string [0-9a-z]\{7\} (or does nothing if there is no hash in this line; note that the lines before/after a merge/diverge have no hash or message following them). This has the purpose of "isolating" Hash. Note that the capturing groups \(...\) are numbered based on the order of occurrence of the opening token \(, so in the replacement string \2\n\3\n\4\5:†
\2 refers to ^[\[33m, which is the OpeningColorTag (NOTE: ^[ is a single character obtained by hitting Ctrl+V, then Esc, whereas the "true" [ must be escped with a backslash \);
\3 refers to Hash, [0-9a-z]\{7\}
\4 refers to the ClosingColorTag, ^[\[m (what is said for \2 holds here too);
\5 is whatever follows, .* (implicitly up to end-of-line).
Now the pattern space (the current line, as we've edited so far) contains the original line with two embedded newlines on each side of the hash (Graph OpeningColorTag\nHash\nClosingColorTag Message), or the unmodified original line if it contained no hash (Graph).
The h command "saves" the pattern space into the hold space (think of it as a drawer).
Now the pattern and hold spaces have the same content (Graph OpeningColorTag\nHash\nClosingColorTag Message or Graph).
The second s command captures and replace, with itself only (/\1/) and discarding everything before it (^.*, i.e. Graph OpeningColorTag), either (\| separates alternatives in a capturing group \(...\))
the newline-enclosed hash and everything following it (i.e. the commit main message) \n[a-z0-9]\{7\}\n.*,
or the end-of-line $,
Now the pattern space contains \nHash\nClosingColorTag Message, or the empty string if there was no hash.
The x command exchanges the contents of pattern and hold spaces, so that the multiline \nHash\nClosingColorTag Message (or the empty string) is saved in the hold space, and the multiline Graph OpeningColorTag\nHash\nClosingColorTag Message is in the pattern space, ready to be re-edited.
The third s command command strips off the \nHash, and everything following it, from the pattern space.
Now the pattern space contains Graph OpeningColorTag.
The y substitutes each character between the first two non-escaped /, with the corresponing character between the second and third non-escaped /. Here both back and forward slashes must be escaped with a backslash. (This shoul be safe, since OpeningColorTag should not contain any of the translated characters.)
Now the pattern space contains Hparg OpeningColorTag, where Hparg is the "inverted" version of Graph (or Hparg only).
The G command gets the content of the hold space and appends it to (hence the capital G; the lower case g would copy into instead of append to) the pattern space, with a newline \n in between.
Now the pattern space contains Hparg OpeningColorTag\n\nHash\nClosingColorTag Message (or Hparg\n only), and we don't care the hold space from now on.
The fourth s command captures the \nHash\n part and substitutes it with Hash.
Now the pattern space contains Hparg OpeningColorTag\nHashClosingColorTag Message or Hparg\n.
The last s command removes the remaining newline \n.
Finally the pattern space contains Hparg OpeningColorTagHashClosingColorTag Message or Hparg.
Steps 8. and 9. cannot be fused together (e.g. s/\n\n\([a-z0-9]\{7\}\)\n/\1/), since the two \n enclosing the hash are there only if the line contains the hash (the first s of point 1. does nothing if there's no hash), whereas the first \n is always present, since it comes with the G command.
† Actually the most external group \(...\) is not needed (it's not used, indeed), so it can be removed, and all numeric references to the other capturing groups can be diminished by 1, e.g. s/\(\(^[\[33m\)\([0-9a-z]\{7\}\)\(^[\[m\)\(.*\)\)/\2\n\3\n\4\5/ can be changed to s/\(^[\[33m\)\([0-9a-z]\{7\}\)\(^[\[m\)\(.*\)/\1\n\2\n\3\4/; but I'll keep the unnecessary group in the answer, since it gives the chance to mention the not-so-widely-known numbering of capturing groups.

This is an answer that seems to catch most edge cases. Not tested thoroughly.
[alias]
rlog = !"git --no-pager log --all --graph --decorate --oneline --color=always | tac | awk -f ~/switchslashes.awk | less -X +G -r"
where the file ~/switchslashes.awk contains
{
match($0,/([[:space:][:cntrl:]|*\/\\]+)(.*)/,a) # find the segment of the graph
tgt = substr($0,RSTART,RLENGTH) # save that segment in a variable tgt
gsub(/\//,RS,tgt) # change all /s to newlines in tgt
gsub(/\\/,"/",tgt) # change all \s to /s in tgt
gsub(RS,"\\",tgt) # change all newlines to \s in tgt
gsub(/_/,"¯",tgt) # change all _ to ¯ in tgt
print tgt substr($0,RSTART+RLENGTH) # print tgt plus rest of the line
}
which is a modified version of this script. It replaces underscores with overlines and swaps slashes with backslashes and vice versa. This fixes the graph after the text has been reversed by tac.
Disclaimer
I never started using this, since it is slow with larger repositories. It needs to load everything and then apply the substitutions, which take too much time for my taste.

First of all you can alway pass -n to the log to print out any number of commits you are interested in.
How can I make the git log output appear reversed
Use the --reverse flag:
--reverse
Output the commits in reverse order.
git log --reverse
You can read here for more tips and flags regarding git log:
http://www.alexkras.com/19-git-tips-for-everyday-use/

A command that comes close to the intended result is
git --no-pager log --all --graph --decorate --oneline --color=always | tac | less -r +G -X
However, this still messes up the graph a little bit, as the slashes are not reversed properly.
Update
This command takes also care of swapping the slashes with backslashes and vice versa.
git --no-pager log --all --graph --decorate --oneline --color=always | tac | sed -e 's/[\]/aaaaaaaaaa/g' -e 's/[/]/\\/g' -e 's/aaaaaaaaaa/\//g' | less -r +G -X
The corresponding git alias is
[alias]
rlog = !"git --no-pager log --all --graph --decorate --oneline --color=always | tac | sed -e 's/[\\]/aaaaaaaaaa/g' -e 's/[/]/\\\\\\\\/g' -e 's/aaaaaaaaaa/\\\\//g' | less -r +G -X"

Related

Cut Typed Line Numbers From Multiple Lines in a Text File

I have a long text file that has numbers in front of each of the paths listed in the file like the example below:
1) /some/path/here/file.txt
1) /some/path/here/1file.edf
2) /some/another_path/here/2file.txt
3) /some/other_path/here/3file.txt
3) /some/other_path/here/4file.edf
3) /some/other_path/here/5file.edf
...
This file continues for thousands of lines. What I have to do is cut the number off the first part of each of these lines so that I could tar the list of files without the numbers interfering. Is there a way that I could do this either using Shell Commands of Emacs?
Regular expressions are your friend here. You can do this in emacs or on the command line. In emacs, if you press C-M-% (control+alt+shift+5) or run query-replace-regexp it will prompt you for a regex to find, and another to replace. If you use ^[0-9]*) (note that there is a space at the end of this), and then leave "repalce" empty, you will replace the number, the paren, and the space with nothing. If you start at the top of the file, you can type ! to replace all, or go one by one to check that it's working for a bit first.
In the regexp, ^ matches the beginning of the line [0-9] means match a single character in the range 0 through 9, the * means "match any number of the previous thing", then the ) matches a literal paren (in emacs this is the default, in shell you'd likely need to escape this with \).
Something like:
cat list.txt | cut -f 2 -d' ' | xargs tar czf archive.tar.gz

Using both GNU Utils with Mac Utils in bash

I am working with plotting extremely large files with N number of relevant data entries. (N varies between files).
In each of these files, comments are automatically generated at the start and end of the file and would like to filter these out before recombining them into one grand data set.
Unfortunately, I am using MacOSx, where I encounter some issues when trying to remove the last line of the file. I have read that the most efficient way was to use head/tail bash commands to cut off sections of data. Since head -n -1 does not work for MacOSx I had to install coreutils through homebrew where the ghead command works wonderfully. However the command,
tail -n+9 $COUNTER/test.csv | ghead -n -1 $COUNTER/test.csv >> gfinal.csv
does not work. A less than pleasing workaround was I had to separate the commands, use ghead > newfile, then use tail on newfile > gfinal. Unfortunately, this will take while as I have to write a new file with the first ghead.
Is there a workaround to incorporating both GNU Utils with the standard Mac Utils?
Thanks,
Keven
The problem with your command is that you specify the file operand again for the ghead command, instead of letting it take its input from stdin, via the pipe; this causes ghead to ignore stdin input, so the first pipe segment is effectively ignored; simply omit the file operand for the ghead command:
tail -n+9 "$COUNTER/test.csv" | ghead -n -1 >> gfinal.csv
That said, if you only want to drop the last line, there's no need for GNU head - OS X's own BSD sed will do:
tail -n +9 "$COUNTER/test.csv" | sed '$d' >> gfinal.csv
$ matches the last line, and d deletes it (meaning it won't be output).
Finally, as #ghoti points out in a comment, you could do it all using sed:
sed -n '9,$ {$!p;}' file
Option -n tells sed to only produce output when explicitly requested; 9,$ matches everything from line 9 through (,) the end of the file (the last line, $), and {$!p;} prints (p) every line in that range, except (!) the last ($).
I realize that your question is about using head and tail, but I'll answer as if you're interested in solving the original problem rather than figuring out how to use those particular tools to solve the problem. :)
One method using sed:
sed -e '1,8d;$d' inputfile
At this level of simplicity, GNU sed and BSD sed both work the same way. Our sed script says:
1,8d - delete lines 1 through 8,
$d - delete the last line.
If you decide to generate a sed script like this on-the-fly, beware of your quoting; you will have to escape the dollar sign if you put it in double quotes.
Another method using awk:
awk 'NR>9{print last} NR>1{last=$0}' inputfile
This works a bit differently in order to "recognize" the last line, capturing the previous line and printing after line 8, and then NOT printing the final line.
This awk solution is a bit of a hack, and like the sed solution, relies on the fact that you only want to strip ONE final line of the file.
If you want to strip more lines than one off the bottom of the file, you'd probably want to maintain an array that would function sort of as a buffered FIFO or sliding window.
awk -v striptop=8 -v stripbottom=3 '
{ last[NR]=$0; }
NR > striptop*2 { print last[NR-striptop]; }
{ delete last[NR-striptop]; }
END { for(r in last){if(r<NR-stripbottom+1) print last[r];} }
' inputfile
You specify how much to strip in variables. The last array keeps a number of lines in memory, prints from the far end of the stack, and deletes them as they are printed. The END section steps through whatever remains in the array, and prints everything not prohibited by stripbottom.

Bash sort command not sorting colored output correctly

When using /bin/sort in bash I find that output is often inappropriately sorted when it comes from colorized input.
For example in a directory with the following contents:
$ ls
dir1 (directory, printed in blue)
dir2 (directory, printed in blue)
dir3 (directory, printed in blue)
afile (file, printed in white)
file1 (file, printed in white)
file2 (file, printed in white)
file3 (file, printed in white)
I would expect ls | sort to sort afile first, then dir1, etc. Instead I get:
$ ls | sort
dir2
dir3
dir1
afile
file1
file2
file3
I have tried quite a few options on sort (-d, -g, -h, -n) to no avail.
The only way I have been able to fix the issue is by explicitly turning the color output of ls off:
$ ls --color=never | sort
afile
dir1
dir2
dir3
file1
file2
file3
But this feels like a work-around, not a solution to the problem. I keep thinking there must be a way to keep the color in the final output, if only for cases where turning off the color is not an option (e.g. for commands other than ls that may not easily support disabling color).
How would one force sort to act only on the printed characters (i.e. just the file and directory names)? I would be interested to see how to cleanly remove the color output after the fact (I've tried strings for this but get the color specifiers [01;34m for the blue text) and especially interested in whether one can preserve the color output after sorting.
Text is colored by adding ANSI colour sequences of the form \x1b[...m, where the ... is replaced by one or more numbers separated by semicolons, which describe the style. In order to sort the text, you would need to ignore the entire color sequence, which is well beyond the capabilities of the standard locale collation definition.
If the coloured output is produced by a program like ls, which colours each line independently, you could use sed to create a sort key -- the line with colour sequences deleted -- followed by the fully coloured line.
Here's a simple solution which requires that the TAB is not present in any line. (It also requires that there are no newlines in any entry, but that was required by the fact that entries are individual lines, which is a basic premise for using sort.)
ls -U --color=always | # Sample data input
sed 'h;s/\x1b[[0-9;]*m//g;G;s/\n/\t/' | # Insert the sort key
sort | # Sort the result
cut -f2- # Remove the sort key
Explanation of the sed command:
h Copy the line to the hold space
s/\x1b[[0-9;]*m//g Remove all colour sequences
G Append a newline and the contents of the hold space
s/\n/\t/ Change the newline to a tab
Note: Using backslash escapes other than \n in the sed pattern and replacement is a Gnu extension, probably also available in other sed implementations but not required by the Posix standard. For a Posix standard sed, you'll need to replace \x1b and \t with a binary ESC and TAB respectively.
First of all, why do you need sort when ls has a ton of ordering options built in? Maybe you can solve your problem by reading the manual for ls.
But supposing you do need color-insensitive sorting: You can tell sort to start sorting after a specified number of characters, which is not quite what you need but it's better than nothing. I don't have your version of ls so I can't check for sure, but simple colored text is normally activated with a five-character sequence. E.g., blue is ^[[34m. So, tell sort to start sorting with the sixth character:
ls | sort -k 1.6
But what about ordinary files? They probably don't get a color prefix, so the above will fail badly unless you filter it to add an equal number of characters in front of black lines. For symmetry you could simply add black color (^[[00m, to make it five characters).
ls | perl -pe '{ s/^/\e\[00m/ unless /^\e\[/ }' | sort -k 1.6

Making a script that uses 'sed' to patch hex strings inside binaries in OSX

patching hex strings inside binaries with sed.
how do i use Sed to open a binary file inside a .app, search for a unique string of hex values , replace them with the new string and then save the binary and exit.?
i have done alot of research and im stuck.
ultimately i would like to wright this as a script and below i have written some code as terminal commands that basically doesn't work but represents what i want to happen to the best of my ability.
//binary patcher script attempt
hexdump -ve '1/1 "%.2X"' /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
sed "s/\x48\x85\xc0\x75\x33/\x48\x85\xc0\x74\x33/g" | \
xxd -r -p > /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched | \
cd /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/ | \
mv /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
sudo chmod u+x /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp
//returns 1 if the string is in the file
xxd -p /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | tr -d '\n' | grep -c ‘4885c07533'
(this is not in use in the script at the moment but i tested it and it does return 1 if the sequence is there and so i thought it would be handy when it comes to possibly of making these patches into small applications of their own. implementing by means of something along the lines of :-
'if(xxd -p /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | tr -d '\n' | grep -c ‘4885c07533' == 1){runTheRestOfTheScript;
else if (xxd -p /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | tr -d '\n' | grep -c ‘4885c07533' == 1){ThrowERROR;'
ok so back to the stuff in the script
//First it dumps the binaries hex information into memory
hexdump -ve '1/1 "%.2X"' /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
//calls sed to find the string of values and replace it with the new one.
sed "s/\x48\x85\xc0\x75\x33/\x48\x85\xc0\x74\x33/g" | \
//saves the new patched file as MyApp.Patched
xxd -r -p > /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched | \
//cds to the directory of the patched file
cd /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/ | \
// renames the file to its original executable name
mv /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp.Patched /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp | \
//sets the new file as executable after a password.
sudo chmod u+x /Users/MiRAGE/Downloads/MyApp.app/Contents/MacOS/MyApp
now this is my first attempt and i am aware some of the functions probably are completely wrong and really, apart from it does not do the patching and it deletes the contents of the binary it works as far as the renaming goes and hopefully gives you an overview of how i need the runtime of the script to work.
now i am a real newbie but i really need to get this done and i really have no idea what to do.
i need this script to basically work by waiting for the user to point the program in the direction of the file that needs patching (as I’m patching the apps iv made preferably it would accept dragging of a .app file into the window and it finding the binary in the macOSX folder by itself (this will come later tho and could also be done in various ways)
i then need it to search for the string in the binary and replace it with the edited string in this case :-
original :- 4885c07533
patched:-4885c07433 {its worth re mentioning this string will always be unique but may vary in length depending on the function that needs patching}
I then need to save it with the same name as the original which this script handles by saving the patched file as .patched appended and subsequently renaming it accordingly .
It then makes the file executable and exits leaving the patched .app ready to run.
This method of creating patches would be particularly helpful if i notice i have made a mistake in many of my programs for instance. if the function is unique i could make a single patch that could edit the binaries at the touch of a button while just holding the section of code that is relevant to patch inside.
so to sum up.
what i am looking for is some way of getting this script working and maybe, if any of you can help a little advice on turning this into a little application to make my life easier.
many thanks in advance for any and all help you can offer.
i will be checking daily so if i need to clarify something let me know and ill be on it in a flash.
MiRAGE
With regards to the sed line
sed "s/\x48\x85\xc0\x75\x33/\x48\x85\xc0\x74\x33/g"
Firstly, you can use sed to change around arbitrary binary - but you should beware newlines. sed processes its inputs always newline separated, so if the value \x0a appears in your string you will have problems.
The following will allow you to consider the entire file as pure binary. (call sed with the -n option so that it won't print out lines after processing them by default).
# Append the current line to the hold space
H
# On the last line the hold space contains all of the file - now swap pattern and hold space, operate on the pattern space and print the line
${
# exchange hold and pattern space
x
# do substitution
s/.../.../g
# print out result, required due to -n option
p
}
or, more succinctly
sed -n 'H;${x;s/.../.../g;p}'
When you append the pattern space to hold space the new line will be inserted - so this circumvents issues with new lines.
Also, in your example you used double quotes for your sed expression. Due to shell escaping rules for backslashes and the nature of sed, I would recommend the use of single quotes to avoid complication. Apologies if it is the case that this is not true for your shell.
Lastly about sed, you should beware of special values contained in the hex.
If you escape a byte literal in sed with \x.., the way this is interpreted is by first replacing the escapted byte literal with its value, and then executing the line. Importantly, regex special characters still do what they do if they weren't escaped.
Example:
sed 's/\x5e\x2f/foo/'
# becomes
substitute pattern '\x5e\x2f' for 'foo'
# becomes
substitute pattern '^/' for 'foo'
# which matches a / at the beginning of a line as opposed to ^/
So the characters to look out for on the left of a substitution are the usual suspects, and beware \x26 (&) on the right hand side of a substitution.
Hopefully that at least clarifies sed's potential role in your script :-).

How to remove duplicate phrases from a document?

Is there a simple way to remove duplicate contents from a large textfile? It would be great to be able to detect duplicate sentences (as separated by "." or even better to find duplicates of sentence fragments (such as 4-word pieces of text).
Removing duplicate words is easy enough, as other people have pointed out. Anything more complicated than that, and you're into Natural Language Processing territory. Bash isn't the best tool for that -- you need a slightly more elegant weapon for a civilized age.
Personally, I recommend Python and it's NLTK (natural language toolkit). Before you dive into that, it's probably worth reading up a little bit on NLP so that you know what you actually need to do. For example, the "4-word pieces of text" are known as 4-grams (n-grams in the generic case) in the literature. The toolkit will help you find those, and more.
Of course, there are probably alternatives to Python/NLTK, but I'm not familiar with any.
Remove duplicate phrases an keeping the original order:
nl -w 8 "$infile" | sort -k2 -u | sort -n | cut -f2
The first stage of the pipeline prepends every line with line number to document the original order. The second stage sorts the original data with the unique switch set.
The third restores the original order (sorting the 1. column). The final cut removes the first column.
You can remove duplicate lines (which have to be exactly equal) with uniq if you sort your textfile first.
$ cat foo.txt
foo
bar
quux
foo
baz
bar
$ sort foo.txt
bar
bar
baz
foo
foo
quux
$ sort foo.txt | uniq
bar
baz
foo
quux
Apart from that, there's no simple way of doing what you want. (How will you even split sentences?)
You can use grep with backreferences.
If you write grep "\([[:alpha:]]*\)[[:space:]]*\1" -o <filename> it will match any two identical words following one another. I.e. if the file content is this is the the test file , it will output the the.
(Explanation [[:alpha:]] matches any character a-z and A-Z, the asterisk * after it means that may appear as many times as it wants, the \(\) is used for grouping to backreference it later, then [[:space:]]* matches any number of spaces and tabs, and finally \1 matches the exact sequence that was found, that was enclosed in \(\)brackets)
Likewise, if you want to match a group of 4 words, that is repeated two times in a row, the expression will look like grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}[[:space:]]*\1" -o <filename> - it will match e.g. a b c d a b c d.
Now we need to add an arbitrary character sequence inbetween matches. In theory this should be done with inserting .* just before backreference, i.e. grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}.*\1" -o <filename>, but this doesn't seem to work for me - it matches just any string and ignores said backreference
The short answer is that there's no easy method. In general any solution needs to first decide how to split the input document into chunks (sentences, sets of 4 words each, etc) and then compare them to find duplicates. If it's important that the ordering of the non-duplicate elements by the same in the output as it was in the input then this only complicates matters further.
The simplest bash-friendly solution would be to split the input into lines based on whatever criteria you choose (e.g. split on each ., although doing this quote-safely is a bit tricky) and then use standard duplicate detection mechanisms (e.g. | uniq -c | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' and then, for each resulting line, remote the text from the input.
Presuming that you had a file that was properly split into lines per "sentence" then
uniq -c lines_of_input_file | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' | while IFS= read -r match ; do sed -i '' -e 's/'"$match"'//g' input_file ; done
Might be sufficient. Of course it will break horribly if the $match contains any data which sed interprets as a pattern. Another mechanism should be employed to perform the actual replacement if this is an issue for you.
Note: If you're using GNU sed the -E switch above should be changed to -r
I just created a script in python, that does pretty much what I wanted originally:
import string
import sys
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
if len(sys.argv) != 2:
sys.exit("Usage: find_duplicate_fragments.py some_textfile.txt")
file=sys.argv[1]
infile=open(file,"r")
text=infile.read()
text=text.replace('\n','') # remove newlines
table = string.maketrans("","")
text=text.translate(table, string.punctuation) # remove punctuation characters
text=text.translate(table, string.digits) # remove numbers
text=text.upper() # to uppercase
while text.find(" ")>-1:
text=text.replace(" "," ") # strip double-spaces
spaces=list(find_all(text," ")) # find all spaces
# scan through the whole text in packets of four words
# and check for multiple appearances.
for i in range(0,len(spaces)-4):
searchfor=text[spaces[i]+1:spaces[i+4]]
duplist=list(find_all(text[spaces[i+4]:len(text)],searchfor))
if len(duplist)>0:
print len(duplist),': ',searchfor
BTW: I'm a python newbie, so any hints on better python practise are welcome!

Resources