How to clean a codebase, trailing whitespace, new lines etc - whitespace

I have a code base that is driving me nuts with conflicts due to trailing whitespace. I'd like to clean it up.
I'd want to:
Remove all trailing whitespace
Remove any newline characters at the end of files
Convert all line endings to unix (dos2unix)
Convert all leading spaces to tabs, ie 4 spaces to tabs.
While ignoring the .git directory.
I'm on OSX Snow Leopard, and in zsh.
so far, i have:
sed -i "" 's/[ \t]*$//' **/*(.)
which works great, but sed adds a new line to the end of every file it touches, which is no good. I dont think sed can be stopped from doing this, so how can i remove these new lines? Theres probably some awk magic to be applied here.
(Complete answers also welcome)

[EDIT: Fixed whitespace trimming]
[EDIT #2: Strip trailing blank lines from end of file]
perl -i.bak -pe 'if (defined $x && /\S/) { print $x; $x = ""; } $x .= "\n" x chomp; s/\s*?$//; 1 while s/^(\t*) /$1\t/; if (eof) { print "\n"; $x = ""; }' **/*(.)
This strips trailing blank lines from the file, but leaves exactly one \n at the end of the file. Most tools expect this, and it will not show up as a blank line in most editors. However if you do want to strip that very last \n, just delete the print "\n"; part from the command.
The command works by "saving up" \n characters until a line containing a non-blank character is seen -- then it prints them all before processing that line.
Remove .bak to avoid creating backups of the original files (use at your own risk!)
\s*? matches zero or more whitespace characters non-greedily, including \r, which is the first character of the \r\n DOS line-break syntax. In Perl, $ matches either at the end of the line, or immediately before a final \n, so combined with the fact that *? matches non-greedily (trying a 0-width match first, then a 1-width match and so on) it does the right thing.
1 while s/^(\t*) /$1\t/ is just a loop that repeatedly replaces any lines beginning with any number of tabs followed by 4 spaces with one more tab than there was, until this is no longer possible. So it will work even if some lines have been partially converted to tabs already, provided all \t characters start at a column divisible by 4.
I haven't seen the **/*(.) syntax before, presumably that's a zsh extension? If it worked with sed, it will work with perl.

From Mac:
find . -iname '*.swift' -type f -exec sed -i '' 's/[[:space:]]\{1,\}$//' {} \+
This will remove all trailing spaces from all swift files from current directory recursively. You can change file types as you need.

Related

Remove empty lines and trim/squeeze blanks from files recursively

I have a folder structure thus:
----project\
----datafolder1\
----file1.txt
----file2.txt
----datafolder2\
----file1.txt
----file2.txt
----file1.txt
----file2.txt
Each of the text files has lines that contain purely numerical data (integer and decimals) as well as other information that is unnecessary. These include:
blank spaces to start a line, between two numerical values of interest, before end of line, e.g.:
<blank><blank>43<blank><tab>73.5<blank><end of line>
I'd like the above to just be:
43<blank>73.5<end of line>
empty lines.
I'd like these empty lines to be removed so that all interesting data is on adjacent and contiguous lines.
lines with letters, e.g.:
---next line contains 50 customer data----
I want these to be removed as well.
Instead of doing these modifications manually, I'd like to automate this by a script that runs from project\ folder and recursively visits datafolder1 and then datafolder2, operates on the text files and then creates a modified text file with the above properties labelled modfile1.txt, modfile2.txt and so on.
Recursively visiting subfolders seems possible using the answer specified here. Using grep to find only lines that contain numbers seems possible according to answer here. However, that only works in case where each line of interest contains only a single number. In my case, a line of interest can contain multiple integers (positive or negative) and decimals separated by spaces. Finally, putting all of this together into a script is beyond my reach given my current knowledge of these tools. I am okay if all of this can be done in awk or .sh itself.
You can use awk to remove blank lines and lines that contain letters, trim leading and trailing spaces, and squeeze spaces between words as well.
# selects *.txt minus mod*.txt
find . -name '*.txt' ! -name 'mod*' -exec awk '
FNR == 1 {
close(fn)
fn = FILENAME
sub(/.*\//, "&mod", fn)
}
/[[:alpha:]]/ { next }
NF { $1 = $1; print > fn }' {} +
Wrt how $1 = $1 works, see Ed's answer here.
Here is a similar version using sed.
find -name \*.txt -print -exec sh -c "sed -r '/(^\s*$|[[:alpha:]])/d ; s/\s+/ /g ; s/(^\s|\s$)//g' '{}' > '{}.mod'" \;
There is a small issue with naming the new files as modfile.txt .
The next time you run it, it will process modfile.txt and create modmodfile.txt .
Adding a .mod suffix will prevent the modified files from being processed.
/(^\s*$|[[:alpha:]])/d # delete blank lines or lines with alpha
s/\s+/ /g # replace multiple spaces with one space
s/(^\s|\s$)//g # replace space at the beginning or end of the line with nothing
While it can be done with awk as shown in the accepted answer, it can also be done with Perl:
find . -name '*.txt' -exec perl -i.bak -nle '
next unless ( /^[\s\d\.\-]+$/ && /\d/ ); # skip unwanted lines
s/\s+/ /g; # keep only single spaces
s/^\s+|\s+$//g; # trim whitespace at start and end
print' {} +
This uses -i.bak to do inplace replacement, saving your original files with a .bak extension.
The -l option adds a newline, because we trimmed any whitespace characters from the end (also removing \r (CR) characters in case the files came from Windows)
If it's important to keep the original file names, you could do something like this afterwards
find . -name "*.txt.bak" -print0 \
| while IFS= read -r -d '' f; do
mv "${f%%.bak}" "${f%%.txt.bak}-new.txt";
mv "$f" "${f%%.bak}"
done
The regex i use to throw away empty lines (so a whole row filled with spaces and tabs, even unicode variants, constitute an empty line in this case) just do
mawk1.3.4 'BEGIN { FS = "^$" } /[[:graph:]]/'
FS = "^$" to prevent it from wasting CPU splitting fields you don't need.
Word of caution - stick with mawk 1.3 instead.
***ps :
reason for striking gnu-awk here is that despite gawk and mawk2 matching each other on /[[:graph:]]/, some of my internal testing has realized that both would drop a bunch of korean hangul, and some emojis in the 4-byte unicode space.
only mawk1.3.4 seems to correctly account for them.
ps2 :
FS = "^$" is faster than FS = RS

SED's Substituted string is considered as one-line string, whereas it contains newline character

I am testing the sed command to substitute one line with 3 lines and, then, to delete the last line. (I could have substituted it with only the 2 first lines, but this is deliberately stated like this to showcase the main issue).
Let's say that I have the following text :
// ##OPTION_NAME: xxxx
I want to replace the token ##OPTION_NAME by ##OP-NAME and surround it by 2 new lines; Like so :
// ##OP-START
// ##OP-NAME: xxxx
// ##OP-END
To illustrate this, I put this text in a code.c file, and the sed commands in a sed script named script.sed.
Then, I call the following shell command :
Shell command
sed -f script.sed code.c
script.sed
# Begin by replacing patterns by their equivalents, surrounding them with ##OP-START and ##OP-END lines
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g
The problem
Now, I add another sed command in script.sed to delete the line containing ##OP-END. Surprise ! all 3 lines are removed !
# Begin by replacing patterns by their equivalents, surrounding them with ##OP-START and ##OP-END lines
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g
# Last parse; delete ##OP-END
/##OP-END/d
I tried \r\n instead of \n in the sustitution command
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g, but it does not work.
I also tested on ##OP-START to see if it makes some difference,
but alas ! All 3 lines were removed too.
It seems that sed is considering it as one line !
This is not a surprise, d operates on the pattern space, not on a per line basis. After the modification with the s command, your pattern space contains 3 lines. The content of it matches the expression and gets therefore deleted.
To delete this line from the pattern space, you need to use the s command again:
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g$
s/\n\/\/ ##OP-END//
About pattern and hold space: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html#tag_20_116_13

sed error unterminated substitute pattern for new line text

I am writing a script to add new dependencies to the watch list. I am putting a placeholder to know where to add the text, for eg
assets = [
"../../new_app/assets"
# [[NEW_APP_ADD_ASSETS]]
]
It is simple to replace just the place holder but my problem is to add comma in the previous line.
that can be done if I search and replace
"
# [[NEW_APP_ADD_ASSETS]]
ie "\n # [[NEW_APP_ADD_ASSETS]]
I am not able to search for the new line.
One of the solutions I found for adding a new line was
sed -i '' 's/newline/line one\
line two/' filename.txt
But when same way done for the search string it returns :unterminated substitute pattern
sed -i '' s/'assets\"\
#'/'some new text'/ filename.txt
PS: I writing on macos
Sed works on a line-by-line base, hence it becomes tricky to add the coma to the previous line as that line has already been processed. It is possible, but the sed syntax quickly becomes messy.
To be a bit more specific:
In default operation, sed cyclically shall append a line of input, less its terminating <newline> character, into the pattern space. Reading from input shall be skipped if a <newline> was in the pattern space prior to a D command ending the previous cycle. The sed utility shall then apply in sequence all commands whose addresses select that pattern space, until a command starts the next cycle or quits. If no commands explicitly started a new cycle, then at the end of the script the pattern space shall be copied to standard output (except when -n is specified) and the pattern space shall be deleted. Whenever the pattern space is written to standard output or a named file, sed shall immediately follow it with a <newline>.
In short, if you do not manipulate the pattern space, you cannot process <newline> characters as they just do not appear!
And even shorter, if you only use the substitute command, sed only processes one line at a time!
This is also why you suffer from : unterminated substitute pattern. You are searching for a newline character, but as sed just reads one line at a time, it just does not find it and it also does not expect it. The error will vanish if you replace your newline with the symbols \n.
sed -i '' s/'assets\"\n #'/'some new text'/ filename.txt
A better way to achieve your goals would be to make use of awk. It is a bit more readable:
awk '/# [[NEW_APP_ADD_ASSETS]]/{ print t","; t="line1\nline2"; next }
{ print t; t=$0 }
END{ print t }' <file>

Replace All first 4 spaces with a tab

I am doing some documentation work, and I have a tree structure like this:
A
BB
C C
DD
How can I replace just all the occurrences of 2 spaces in the head of the line with '-', like:
A
--BB
--C C
----DD
I have tried sed 's/ /-/g', but this replaces all occurrences of 2 spaces; also sed 's/^ /-/g', this just replaces the first occurrence of 2 spaces. How can I do this?
The regular expression for four spaces at beginning of line is /^ / where I put the slashes just to demarcate the expression (they are not part of the actual regular expression, but they are used as delimiters by sed).
sed 's/^ /\t/' file
In recent sed versions, you can add an -i option to modify file in-place (that is, sed will replace the file with the modified file); on *BSD (including OSX), you need -i '' with an empty option argument.
The \t escape code for tab is also not universally supported; if that is a problem, your shell probably allows you to type a literal tab by prefixing it with ctrl-V.
(Your question title says "tab" but your question asks about dashes. To replace with two dashes, replace \t in the replacement part of the script with --, obviously.)
If you are trying to generalize to "any groups of two spaces at beginning of line should be replaced by a dash", this is not impossible to do in sed, but I would recommend Perl instead:
perl -pe 's%^((?: )+)% "-" x (length($1) / 2)%e' file
This captures the match into $1; the inner parenthesized expression matches two spaces and the + quantifier says to match that as many times as possible. The /e flag allows us to use Perl code in the replacement; this piece of code repeats the character "-" as many times as the captured expression was repeated, which is conveniently equal to half its length.

How to fix inconsistent line endings for whole VS solution?

Visual Studio will detect inconsistent line endings when opening a file and there is an option to fix it for that specific file. However, if I want to fix line endings for all files in a solution, how do I do that?
Just for a more complete answer, this worked best for me:
Replace
(?<!\r)\n
with
\r\n
in entire solution with "regEx" option.
This will set the correct line ending in all files which didn't have the correct line ending so far. It uses the negative lookahead to check for the non-existance of a \r in front of the \n.
Be careful with the other solutions: They will either modify all lines in all files (ignoring the original line ending) or delete the last character of each line.
You can use the Replace in Files command and enable regular expressions. For example, to replace end-of-lines that have a single linefeed "\n" (like, from GitHub, for example) with the standard Windows carriage-return linefeed "\r\n", search for:
([^\r]|^)\n
This says to create a group (that's why the parentheses are required), where the first character is either not a carriage-return or is the beginning of a line. The beginning of the line test is really only for the very beginning of the file, if it happens to start with a "\n". Following the group is a newline. So, you will match ";\n" which has the wrong end-of-line, but not "\r\n" which is the correct end-of-line.
And replace it with:
$1\r\n
This says to keep the group ($1) and then replace the "\n" with "\r\n".
Try doing
Edit > Advanced > Format Document
Then save the document, as long as the file doesn't get modified by another external editor, it should stay consistent. Fixes it for me.
If you have Cygwin with the cygutils package installed, you can use this chain of commands from the Cygwin shell:
unix2dos -idu *.cpp | sed -e 's/ 0 [1-9][0-9]//' -e 's/ [1-9][0-9]* 0 //' | sed '/ [1-9][0-9] / !d' | sed -e 's/ [1-9][0-9 ] //' | xargs unix2dos
(Replace the *.cpp with whatever wildcard you need)
To understand how this works, the unix2dos command is used to convert the files, but only files that have inconsistent line endings (i.e., a mixture of UNIX and DOS) need to be converted. The -idu option displays the number of dos and unix line endings in the file. For example:
0 491 Acad2S5kDim.cpp
689 0 Acad2S5kPolyline.cpp
0 120 Acad2S5kRaster.cpp
433 12 Acad2S5kXhat.cpp
0 115 AppAuditInfo.cpp
Here, only the Acad2S5kXhat.cpp file needs to be converted. The sed commands filter the output to produce a list of just the files that need to be converted, and these are then processed via xargs.

Resources