Remove empty lines and trim/squeeze blanks from files recursively - bash

I have a folder structure thus:
----project\
----datafolder1\
----file1.txt
----file2.txt
----datafolder2\
----file1.txt
----file2.txt
----file1.txt
----file2.txt
Each of the text files has lines that contain purely numerical data (integer and decimals) as well as other information that is unnecessary. These include:
blank spaces to start a line, between two numerical values of interest, before end of line, e.g.:
<blank><blank>43<blank><tab>73.5<blank><end of line>
I'd like the above to just be:
43<blank>73.5<end of line>
empty lines.
I'd like these empty lines to be removed so that all interesting data is on adjacent and contiguous lines.
lines with letters, e.g.:
---next line contains 50 customer data----
I want these to be removed as well.
Instead of doing these modifications manually, I'd like to automate this by a script that runs from project\ folder and recursively visits datafolder1 and then datafolder2, operates on the text files and then creates a modified text file with the above properties labelled modfile1.txt, modfile2.txt and so on.
Recursively visiting subfolders seems possible using the answer specified here. Using grep to find only lines that contain numbers seems possible according to answer here. However, that only works in case where each line of interest contains only a single number. In my case, a line of interest can contain multiple integers (positive or negative) and decimals separated by spaces. Finally, putting all of this together into a script is beyond my reach given my current knowledge of these tools. I am okay if all of this can be done in awk or .sh itself.

You can use awk to remove blank lines and lines that contain letters, trim leading and trailing spaces, and squeeze spaces between words as well.
# selects *.txt minus mod*.txt
find . -name '*.txt' ! -name 'mod*' -exec awk '
FNR == 1 {
close(fn)
fn = FILENAME
sub(/.*\//, "&mod", fn)
}
/[[:alpha:]]/ { next }
NF { $1 = $1; print > fn }' {} +
Wrt how $1 = $1 works, see Ed's answer here.

Here is a similar version using sed.
find -name \*.txt -print -exec sh -c "sed -r '/(^\s*$|[[:alpha:]])/d ; s/\s+/ /g ; s/(^\s|\s$)//g' '{}' > '{}.mod'" \;
There is a small issue with naming the new files as modfile.txt .
The next time you run it, it will process modfile.txt and create modmodfile.txt .
Adding a .mod suffix will prevent the modified files from being processed.
/(^\s*$|[[:alpha:]])/d # delete blank lines or lines with alpha
s/\s+/ /g # replace multiple spaces with one space
s/(^\s|\s$)//g # replace space at the beginning or end of the line with nothing

While it can be done with awk as shown in the accepted answer, it can also be done with Perl:
find . -name '*.txt' -exec perl -i.bak -nle '
next unless ( /^[\s\d\.\-]+$/ && /\d/ ); # skip unwanted lines
s/\s+/ /g; # keep only single spaces
s/^\s+|\s+$//g; # trim whitespace at start and end
print' {} +
This uses -i.bak to do inplace replacement, saving your original files with a .bak extension.
The -l option adds a newline, because we trimmed any whitespace characters from the end (also removing \r (CR) characters in case the files came from Windows)
If it's important to keep the original file names, you could do something like this afterwards
find . -name "*.txt.bak" -print0 \
| while IFS= read -r -d '' f; do
mv "${f%%.bak}" "${f%%.txt.bak}-new.txt";
mv "$f" "${f%%.bak}"
done

The regex i use to throw away empty lines (so a whole row filled with spaces and tabs, even unicode variants, constitute an empty line in this case) just do
mawk1.3.4 'BEGIN { FS = "^$" } /[[:graph:]]/'
FS = "^$" to prevent it from wasting CPU splitting fields you don't need.
Word of caution - stick with mawk 1.3 instead.
***ps :
reason for striking gnu-awk here is that despite gawk and mawk2 matching each other on /[[:graph:]]/, some of my internal testing has realized that both would drop a bunch of korean hangul, and some emojis in the 4-byte unicode space.
only mawk1.3.4 seems to correctly account for them.
ps2 :
FS = "^$" is faster than FS = RS

Related

How to extract data from content of files, including a string and ignoring new line

I want to retrieve service names from a project whole directory.
All service calls start with specific pattern:
getService().serviceName1()
getService().service2()
getService().
thirdSName()
Notice how the last match above is folded over two lines; the first line matches the pattern but the service name is wrapped onto the following line.
My solution:
grep -r "getService" *
Expected report:
serviceName1
service2
thirdSName
But my grep results are incomplete because they don't include the last service name.
grep can't easily choose how much to show of a match if it stretches across multiple lines; there is only a facility for specifying a fixed number of context lines before or after the match.
If your needs are pedestrian, maybe try something like this simple Awk script.
find . -type f -exec awk '/getService/ || more {
print FILENAME ":" NR ":" $0; more = ($0 ~ /\.[ \t]*$/) }' {} +
This simply checks whether the last non-whitespace character is a dot, and if so, selects the following line(s) for printing as well.
If your requirements are less modest, a parser for the programming language used in these files is probably the way to go. If your requirements are only marginally less modest, maybe the Awk approach can be stretched for a tiny little bit more.
(The find wrapper is because Awk doesn't have a -r option for traversing a directory tree.)
Here is find command combined with a python script:
#!/usr/bin/env bash
pyscript="$(cat <<'EOF'
import re,sys
print(
*re.findall( # Find all the occurrences
r'(?:getService\(\)\.\s*)(\w+)', # regex with non-capturing group
open(sys.argv[1]).read(), # Open and read whole file
re.DOTALL), # Make the '.' special character match any character at all
sep="\n" # print with element on its own line.
)
EOF
)"
find . -type f -exec python -c "$pyscript" {} \;

replace any character between certain pattern in multiple blocks

I have a problem similar to this one, but I can't adapt to my case.
Say I have a file with many of these lines:
f 1/1/519 2/2/2 3/3/520
f 287/4/521 1/5/519 3/6/520
f 5/7/522 1/8/523 287/9/524
I want to replace the content between the two slashes (number/anyNumber/number) of each block.
I would like to have the following result:
f 1//519 2//2 3//520
f 287//521 1//519 3//520
f 5//522 1//523 287//524
What is the correct sed (or anything else) command?
Using MacOS.
$ sed 's:/[^/]*/://:g' file
f 1//519 2//2 3//520
f 287//521 1//519 3//520
f 5//522 1//523 287//524
Easy enough making this pattern-based in sed:
sed 's#/[0-9]*/#//#g' input.txt
This matches any stretch of zero or more digits between two slashes, and replaces the whole bundle with two slashes.
In awk, you might do the same thing this way:
awk '{gsub(/\/[0-9]*\//,"//")} 1' input.txt
The gsub() command is documented on the awk man page. The 1 at the end is a shortcut for "print this line". But you might alternately treat the fields as actual fields:
awk '{for (i=2;i<=NF;i++) {split($i,a,"/"); $i=sprintf("%s//%s",a[1],a[3])} } 1' input.txt
This is more technically "right" in that it treats fields as fields, then treats subfields as subfields. But it'll undoubtedly be slower than the other options, and will also rewrite lines with OFS as field separators.
Lastly, you could use bash alone, without awk or sed:
shopt -s extglob
while read; do echo "${REPLY//\/+([0-9])\////}"; done < input.txt
This works in bash version 3 (since you're using macOS). It reads each line of input then uses Parameter Expansion to make the same translation that was done in the first two options. This solution is likely slower than the others. The extglob shell option is used to make more advanced patterns possible.
Could you please try following.
awk '{for(i=1;i<=NF;i++){gsub(/\/.*\//,"//",$i)}} 1' Input_file
Output will be as follows.
f 1//519 2//2 3//520
f 287//521 1//519 3//520
f 5//522 1//523 287//524
Answer is rather simple: cat file.txt | sed -e 's/\([0-9]\+\/\)[0-9]\+\(\/[0-9]\+\)/\1\2/g' > mod.txt
Putting something in brackets (()) allows you to refer to it later, using that you remember numbers before the first slash plus the slash (first capture group), match number between the slashes, then remember slash and any numbers after the slash (second capture group), then you just replace whole matched string with first and second capture groups, discarding everything else.
Switch g makes sed operate on every matching occurrence.

How to remove first and last folder in 'find' result output?

I want to search for folders by part of their name, which i know and it's common among these kind of folders. i used 'find' command in bash script like this
find . -type d -name "*.hg"
it just print out the whole path from current directory to the found folder itself. the foldr name has '.hg'.then i tried to use 'sed' command but i couldn't address the last part of the path. i decided to get the folder name ends in .hg save it in a variable then use 'sed' command to remove the last directory from output. i use this to get the last part, and try to save the result to a varable, no luck.
find . -type d -name "*.hg"|sed 's/*.hg$/ /'
find . -type d -name "*.hg"|awk -F/ '{print $NF}
this just print out the file names, here the folder with .hg at the end.
then i use different approach
for i in $(find . -type d -name '*.hg' );
do
$DIR = $(dirname ${i})
echo $DIR
done
this didin't work neither. can anyone point me any hint to make this works.
and yes it's homework.
You could use parameter expansion:
d=path/to/my/dir
d="${d#*/}" # remove the first dir
d="${d%/*}" # remove the last dir
echo $d # "to/my"
one problem that you have is with the pattern you are using in your sed script - there is a different pattern language used by both bash and the find command.
They use a very simple regular expression language where * means any number of any character and ? means any single character. The sed command uses a much richer regular expression language where * means any number of the previous character and . means any character (there's a lot more to it than that).
So to remove the last component of the path delivered by find you will need to use the following sed command: sed -e 's,/[^/].hg,,'
Alternatively you could use the dirname command. Pipe the output of the find command to xargs (which will run a command passing standard input as arguments to the command:
xargs -i dirname
#Pamador - that's strange. It works for me. Just to explain: the sed command needs to be quoted in single quotes just to protect against any unwanted shell expansions. The character following the 's' is a comma; what we're doing here is changing the character that sed uses to separate the two parts of the substitute command, this means that we can use the slash character without having to escape it without a preceding backslash. The next part matches any sequence of characters apart from a slash followed by any character and then hg. Honestly I should have anchored the pattern to the end of line with a $ but apart from that it's fine.
I tested it with
echo "./abc/xxx.hg" | sed -e 's,/[^/]\.hg$'
And it printed ./abc
Did I misunderstand what you wanted to do?
find . -type d -name "*.hg" | awk -v m=1 -v n=1 'NR<=m{};NR>n+m{print line[NR%n]};{line[NR%n]=$0}'
awk parameters:
m = number of lines to remove from beginning of output
n = number of
lines to remove from end of output
Bonus: If you wanted to remove 1 line from the end and you have coreutils installed, you could do this: find . -type d -name "*.hg" | ghead -n -1

How to find/replace and increment a matched number with sed/awk?

Straight to the point, I'm wondering how to use grep/find/sed/awk to match a certain string (that ends with a number) and increment that number by 1. The closest I've come is to concatenate a 1 to the end (which works well enough) because the main point is to simply change the value. Here's what I'm currently doing:
find . -type f | xargs sed -i 's/\(\?cache_version\=[0-9]\+\)/\11/g'
Since I couldn't figure out how to increment the number, I captured the whole thing and just appended a "1". Before, I had something like this:
find . -type f | xargs sed -i 's/\?cache_version\=\([0-9]\+\)/?cache_version=\11/g'
So at least I understand how to capture what I need.
Instead of explaining what this is for, I'll just explain what I want it to do. It should find text in any file, recursively, based on the current directory (isn't important, it could be any directory, so I'd configure that later), that matches "?cache_version=" with a number. It will then increment that number and replace it in the file.
Currently the stuff I have above works, it's just that I can't increment that found number at the end. It would be nicer to be able to increment instead of appending a "1" so that the future values wouldn't be "11", "111", "1111", "11111", and so on.
I've gone through dozens of articles/explanations, and often enough, the suggestion is to use awk, but I cannot for the life of me mix them. The closest I came to using awk, which doesn't actually replace anything, is:
grep -Pro '(?<=\?cache_version=)[0-9]+' . | awk -F: '{ print "match is", $2+1 }'
I'm wondering if there's some way to pipe a sed at the end and pass the original file name so that sed can have the file name and incremented number (from the awk), or whatever it needs that xargs has.
Technically, this number has no importance; this replacement is mainly to make sure there is a new number there, 100% for sure different than the last. So as I was writing this question, I realized I might as well use the system time - seconds since epoch (the technique often used by AJAX to eliminate caching for subsequent "identical" requests). I ended up with this, and it seems perfect:
CXREPLACETIME=`date +%s`; find . -type f | xargs sed -i "s/\(\?cache_version\=\)[0-9]\+/\1$CXREPLACETIME/g"
(I store the value first so all files get the same value, in case it spans multiple seconds for whatever reason)
But I would still love to know the original question, on incrementing a matched number. I'm guessing an easy solution would be to make it a bash script, but still, I thought there would be an easier way than looping through every file recursively and checking its contents for a match then replacing, since it's simply incrementing a matched number...not much else logic. I just don't want to write to any other files or something like that - it should do it in place, like sed does with the "i" option.
I think finding file isn't the difficult part for you. I therefore just go to the point, to do the +1 calculation. If you have gnu sed, it could be done in this way:
sed -r 's/(.*)(\?cache_version=)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' file
let's take an example:
kent$ cat test
ello
barbaz?cache_version=3fooooo
bye
kent$ sed -r 's/(.*)(\?cache_version=)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' test
ello
barbaz?cache_version=4fooooo
bye
you could add -i option if you like.
edit
/e allows you to pass matched part to external command, and do substitution with the execution result. Gnu sed only.
see this example: external command/tool echo, bc are used
kent$ echo "result:3*3"|sed -r 's/(result:)(.*)/echo \1$(echo "\2"\|bc)/ge'
gives output:
result:9
you could use other powerful external command, like cut, sed (again), awk...
Pure sed version:
This version has no dependencies on other commands or environment variables.
It uses explicit carrying. For carry I use the # symbol, but another name can be used if you like. Use something that is not present in your input file.
First it finds SEARCHSTRING<number> and appends a # to it.
It repeats incrementing digits that have a pending carry (that is, have a carry symbol after it: [0-9]#)
If 9 was incremented, this increment yields a carry itself, and the process will repeat until there are no more pending carries.
Finally, carries that were yielded but not added to a digit yet are replaced by 1.
sed "s/SEARCHSTRING[0-9]*[0-9]/&#/g;:a {s/0#/1/g;s/1#/2/g;s/2#/3/g;s/3#/4/g;s/4#/5/g;s/5#/6/g;s/6#/7/g;s/7#/8/g;s/8#/9/g;s/9#/#0/g;t a};s/#/1/g" numbers.txt
This perl command will search all files in current directory (without traverse it, you will need File::Find module or similar for that more complex task) and will increment the number of a line that matches cache_version=. It uses the /e flag of the regular expression that evaluates the replacement part.
perl -i.bak -lpe 'BEGIN { sub inc { my ($num) = #_; ++$num } } s/(cache_version=)(\d+)/$1 . (inc($2))/eg' *
I tested it with file in current directory with following data:
hello
cache_version=3
bye
It backups original file (ls -1):
file
file.bak
And file now with:
hello
cache_version=4
bye
I hope it can be useful for what you are looking for.
UPDATE to use File::Find for traversing directories. It accepts * as argument but will discard them with those found with File::Find. The directory to begin the search is the current of execution of the script. It is hardcoded in the line find( \&wanted, "." ).
perl -MFile::Find -i.bak -lpe '
BEGIN {
sub inc {
my ($num) = #_;
++$num
}
sub wanted {
if ( -f && ! -l ) {
push #ARGV, $File::Find::name;
}
}
#ARGV = ();
find( \&wanted, "." );
}
s/(cache_version=)(\d+)/$1 . (inc($2))/eg
' *
This is ugly (I'm a little rusty), but here's a start using sed:
orig="something1" ;
text=`echo $orig | sed "s/\([^0-9]*\)\([0-9]*\)/\1/"` ;
num=`echo $orig | sed "s/\([^0-9]*\)\([0-9]*\)/\2/"` ;
echo $text$(($num + 1))
With an original filename ($orig) of "something1", sed splits off the text and numeric portions into $text and $num, then these are combined in the final section with an incremented number, resulting in something2.
Just a start since it doesn't consider cases with numbers within the file name or names with no number at the end, but hopefully helps with your original goal of using sed.
This can actually be simplified within sed by using buffers, I believe (sed can operate recursively), but I'm really rusty with that aspect of it.
perl -pi -e 's/(\?cache_version=)(\d+)/$1.($2+1)/ge' FILE [FILE...]
or for a complete solution:
find . -type f | xargs perl -pi -e 's/(\?cache_version=)(\d+)/$1.($2+1)/ge'
perl substitution operator
/e modifier evaluates the replacement as if it were a Perl statement, using its return value as the replacement text.
. operator concatenates strings in Perl. The parentheses ensures that the arithmetic operation $2+1 takes precedence over concatenation.
/g modifier applies substitution to all matched strings within line
perl options
-p ensures that perl will execute the command on every line of each file
-i ensures that each file will be edited inplace
-e specifies the perl command(s) that are executed (in this case, the substitution operation)

How to clean a codebase, trailing whitespace, new lines etc

I have a code base that is driving me nuts with conflicts due to trailing whitespace. I'd like to clean it up.
I'd want to:
Remove all trailing whitespace
Remove any newline characters at the end of files
Convert all line endings to unix (dos2unix)
Convert all leading spaces to tabs, ie 4 spaces to tabs.
While ignoring the .git directory.
I'm on OSX Snow Leopard, and in zsh.
so far, i have:
sed -i "" 's/[ \t]*$//' **/*(.)
which works great, but sed adds a new line to the end of every file it touches, which is no good. I dont think sed can be stopped from doing this, so how can i remove these new lines? Theres probably some awk magic to be applied here.
(Complete answers also welcome)
[EDIT: Fixed whitespace trimming]
[EDIT #2: Strip trailing blank lines from end of file]
perl -i.bak -pe 'if (defined $x && /\S/) { print $x; $x = ""; } $x .= "\n" x chomp; s/\s*?$//; 1 while s/^(\t*) /$1\t/; if (eof) { print "\n"; $x = ""; }' **/*(.)
This strips trailing blank lines from the file, but leaves exactly one \n at the end of the file. Most tools expect this, and it will not show up as a blank line in most editors. However if you do want to strip that very last \n, just delete the print "\n"; part from the command.
The command works by "saving up" \n characters until a line containing a non-blank character is seen -- then it prints them all before processing that line.
Remove .bak to avoid creating backups of the original files (use at your own risk!)
\s*? matches zero or more whitespace characters non-greedily, including \r, which is the first character of the \r\n DOS line-break syntax. In Perl, $ matches either at the end of the line, or immediately before a final \n, so combined with the fact that *? matches non-greedily (trying a 0-width match first, then a 1-width match and so on) it does the right thing.
1 while s/^(\t*) /$1\t/ is just a loop that repeatedly replaces any lines beginning with any number of tabs followed by 4 spaces with one more tab than there was, until this is no longer possible. So it will work even if some lines have been partially converted to tabs already, provided all \t characters start at a column divisible by 4.
I haven't seen the **/*(.) syntax before, presumably that's a zsh extension? If it worked with sed, it will work with perl.
From Mac:
find . -iname '*.swift' -type f -exec sed -i '' 's/[[:space:]]\{1,\}$//' {} \+
This will remove all trailing spaces from all swift files from current directory recursively. You can change file types as you need.

Resources