how to find mis-spell words in bunch of files - shell

I have around 10k java files I need to find mis-spelled words in those java files for the strings which are in double-quotes
Following is giving me strings in double-quotes
find . -name "*.java" -exec grep -Po '".*?"' {} \;
But I do not know how to use spell on top of this.

I have only available Linux and ispell so if you are not on Linux the following might not work for you (as is). If you just want to find mis-spelled words and get proposals listed then you could use
find . -name "*.java" -exec grep -Po '"([^"\\]|\\.)*"' {} \; \
| ispell -a -S
The -a selects pipe-mode, -S disables sorting which tends to list better replacements first.
If you want to fix the strings in-place, then you may want to use something like
TEMP=`mktemp`
find . -name "*.java" | xargs grep -l '"...*"' \
| xargs echo /usr/bin/ispell -F ./so20836228-java-deformatter.sh > $TEMP
source $TEMP
This generates spell-checking commands which use the following ispell Java "deformatter":
#!/bin/sh
# Experimental Java ispell deformatter: use at your own risk!
/bin/sed -e '1,$ {
# introduce per-character state
s/\(.\)/\1_/g
# mark string literals
s/"_\(\(\([^"\\]_\|\\_._\)\)*\)"_/"B\1"E/g
# wipe out chars before string literals
:b s/._\(.\)B/ B\1B/g ; t b
# wipe out chars after string literals
:e s/\(.\)E._/\1E E/g ; t e
# remove per-character state
s/\(.\)./\1/g
# get rid of escape sequences
s/\\./ /g
}'
Use this experimental deformatter at your own risk.
Backup files before you work on them.
(Errors in the deformatter may damage spell-checked files.
See ispell manual page:
The program must produce exactly one character of output for each character of input, or ispell will lose synchronization and corrupt the output file.
)

Related

bash script remove squares prefix when reading a file content [duplicate]

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:
find -type f |
while read file
do
if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]
then
echo "found BOM in: $file"
fi
done
Or, if you prefer short, unreadable one-liners:
find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done
It doesn't work with filenames that contain a line break,
but such files are not to be expected anyway.
Is there any shorter or more elegant solution?
Are there any interesting text editors or macros for text editors?
What about this one simple command which not just finds but clears the nasty BOM? :)
find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \;
I love "find" :)
Warning The above will modify binary files which contain those three characters.
If you want just to show BOM files, use this one:
grep -rl $'\xEF\xBB\xBF' .
The best and easiest way to do this on Windows:
Total Commander → go to project's root dir → find files (Alt + F7) → file types *.* → Find text "EF BB BF" → check 'Hex' checkbox → search
And you get the list :)
find . -type f -print0 | xargs -0r awk '
/^\xEF\xBB\xBF/ {print FILENAME}
{nextfile}'
Most of the solutions given above test more than the first line of the file, even if some (such as Marcus's solution) then filter the results. This solution only tests the first line of each file so it should be a bit quicker.
If you accept some false positives (in case there are non-text files, or in the unlikely case there is a ZWNBSP in the middle of a file), you can use grep:
fgrep -rl `echo -ne '\xef\xbb\xbf'` .
You can use grep to find them and Perl to strip them out like so:
grep -rl $'\xEF\xBB\xBF' . | xargs perl -i -pe 's{\xEF\xBB\xBF}{}'
I would use something like:
grep -orHbm1 "^`echo -ne '\xef\xbb\xbf'`" . | sed '/:0:/!d;s/:0:.*//'
Which will ensure that the BOM occurs starting at the first byte of the file.
For a Windows user, see this (good PHP script for finding the BOM in your project).
An overkill solution to this is phptags (not the vi tool with the same name), which specifically looks for PHP scripts:
phptags --warn ./
Will output something like:
./invalid.php: TRAILING whitespace ("?>\n")
./invalid.php: UTF-8 BOM alone ("\xEF\xBB\xBF")
And the --whitespace mode will automatically fix such issues (recursively, but asserts that it only rewrites .php scripts.)
I used this to correct only JavaScript files:
find . -iname *.js -type f -exec sed 's/^\xEF\xBB\xBF//' -i.bak {} \; -exec rm {}.bak \;
find -type f -print0 | xargs -0 grep -l `printf '^\xef\xbb\xbf'` | sed 's/^/found BOM in: /'
find -print0 puts a null \0 between each file name instead of using new lines
xargs -0 expects null separated arguments instead of line separated
grep -l lists the files which match the regex
The regex ^\xeff\xbb\xbf isn't entirely correct, as it will match non-BOMed UTF-8 files if they have zero width spaces at the start of a line
If you are looking for UTF files, the file command works. It will tell you what the encoding of the file is. If there are any non ASCII characters in there it will come up with UTF.
file *.php | grep UTF
That won't work recursively though. You can probably rig up some fancy command to make it recursive, but I just searched each level individually like the following, until I ran out of levels.
file */*.php | grep UTF

How to look for files that have an extra character at the end?

I have a strange situation. A group of folks asked me to look at their hacked Wordpress site. When I got in, I noticed there were extra files here and there that had an extra non-printable character at end. In Bash, it shows it as a \r.
Just next to these files with the weird character is the original file. I'm trying to locate all these suspicious files and delete them. But the correct Bash incantation is eluding me.
find . | grep -i \?
and
find . | grep -i '\r'
aren't working
How do I use bash to find them?
Remove all files with filename ending in \r (carriage return), recursively, in current directory:
find . -type f -name $'*\r' -exec rm -fv {} +
Use ls -lh instead of rm to view the file list without removing.
Use rm -fvi to prompt before each removal.
-name GLOB specifies a matching glob pattern for find.
$'\r' is bash syntax for C style escapes.
You said "non-printable character", but ls indicates it's specifically a carriage return. The pattern '*[^[:graph:]' matches filenames ending in any non printable character, which may be relevant.
To remove all files and directories matching $'*\r' and all contents recursively: find . -name $'*\r' -exec rm -rfv {} +.
You have to pass carriage return character literally to grep. Use ANSI-C quoting in Bash.
find . -name $'*\r'
find . | grep $'\r'
find . | sed '/\x0d/!d'
if it a special character
Recursive look up
grep -ir $'\r'
# sample output
# empty line
Recursive look up + just print file name
grep -lir $'\r'
# sample output
file.txt
if it not a special character
You need to escape the backslash \ with a backslash so it becomes \\
Recursive look up
grep -ir '\\r$`
# sample output
file.txt:file.php\r
Recursive look up + just print file name
grep -lir '\\r$`
# sample output
file.txt
help:
-i case insensitive
-r recursive mode
-l print file name
\ escape another backslash
$ match the end
$'' the value is a special character e.g. \r, \t
shopt -s globstar # Enable **
shopt -s dotglob # Also cover hidden files
offending_files=(**/*$'\r')
should store into the array offending_files a list of all files which are compromised in that way. Of course you could also glob for **/*$'\r'*, which searches for all files having a carriage return anywhere in the name (not necessarily at the end).
You can then log the name of those broken files (which might make sense for auditing) and remove them.

Why is xargs not replacing the second {}

I'm using xargs to try to echo the name of a file, followed by its contents. Here is the command
find output/ -type f | xargs -I {} sh -c "echo {} ; cat {}"
However for some reason, the second replace after cat is not being replaced. Only for some files, so some files do work correctly.
To be clear, I'm not looking for a command that lets me echo the name of a file followed by its contents, I'm trying to understand why this specific command does not work.
Turns out that the command was too long, so it was working with shorter file names and failing for longer ones. From man xargs
-I replstr
Execute utility for each input line, replacing one or more occurrences of replstr in up to replacements (or 5 if no -R flag is specified) arguments to utility with the
entire line of input. The resulting arguments, after replacement is done, will not be allowed to grow beyond 255 bytes; this is implemented by concatenating as much
of the argument containing replstr as possible, to the constructed arguments to utility, up to 255 bytes. The 255 byte limit does not apply to arguments to utility
which do not contain replstr, and furthermore, no replacement will be done on utility itself. Implies -x.
The root cause of the problem is pointed out in Carlos' answer, but without a solution.
After some googling, I couldn't find a way to lift up the 255 characters limit.
So a probable way to workaround it, is to use shell variable as a substitution.
Example:
find . | xargs -I% sh -c 'F="%";iconv -f gb2312 -t utf-8 "$F">"$F.out";mv "$F.out" "$F"'
Remember to use single quotes at the outermost sh -c parameter string, we don't want the $F inside to be replaced by our parent shell.
Is it files with white space in the name that create problems? Try adding \", like this:
find output/ -type f | xargs -I {} sh -c "echo \"{}\" ; cat \"{}\""
This worked for me using Bash.

using find with variables in bash

I am new to bash scripting and need help:
I need to remove specific files from a directory . My goal is to find in each subdirectory a file called "filename.A" and remove all files that starts with "filename" with extension B,
that is: "filename01.B" , "filename02.B" etc..
I tried:
B_folders="$(find /someparentdirectory -type d -name "*.B" | sed 's# (.*\)/.*#\1#'|uniq)"
A_folders="$(find "$B_folders" -type f -name "*.A")"
for FILE in "$A_folders" ; do
A="${file%.A}"
find "$FILE" -name "$A*.B" -exec rm -f {}\;
done
Started to get problems when the directories name contained spaces.
Any suggestions for the right way to do it?
EDIT:
My goal is to find in each subdirectory (may have spaces in its name), files in the form: "filename.A"
if such files exists:
check if "filename*.B" exists And remove it,
That is: remove: "filename01.B" , "filename02.B" etc..
In bash 4, it's simply
shopt -s globstar nullglob
for f in some_parent_directory/**/filename.A; do
rm -f "${f%.A}"*.B
done
If the space is the only issue you can modify the find inside the for as follows:
find "$FILE" -name "$A*.B" -print0 | xargs -0 rm
man find shows:
-print0
True; print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows
file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output. This option corre-
sponds to the -0 option of xargs.
and xarg's manual
-0 Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are not special (every character is taken literal-
ly). Disables the end of file string, which is treated like any other argument. Useful when input items might contain white space, quote marks, or
backslashes. The GNU find -print0 option produces input suitable for this mode.

How can I process a list of files that includes spaces in its names in Unix?

I'm trying to list the files in a directory and do something to them in the Mac OS X prompt.
It should go like this: for f in $(ls -1); do echo $f; done
If I have files without spaces in their names (fileA.txt, fileB.txt), the echo works fine.
If the files include spaces in their names ("file A.txt", "file B.txt"), I get 4 strings (file, A.txt, file, B.txt).
I've tried quoting the listing command, but it only changed the problem.
If I do this: for f in $(ls -1); do echo $f; done
I get: file A.txt\nfile B.txt
(It displays correctly, but it is a single string and I need the 2 lines separated.
Step away from ls if at all possible. Use find from the findutils package.
find /target/path -type f -print0 | xargs -0 your_command_here
-print0 will cause find to output the names separated by NUL characters (ASCII zero). The -0 argument to xargs tells it to expect the arguments separated by NUL characters too, so everything will work just fine.
Replace /target/path with the path under which your files are located.
-type f will only locate files. Use -type d for directories, or omit altogether to get both.
Replace your_command_here with the command you'll use to process the file names. (Note: If you run this from a shell using echo for your_command_here you'll get everything on one line - don't get confused by that shell artifact, xargs will do the expected right thing anyway.)
Edit: Alternatively (or if you don't have xargs), you can use the much less efficient
find /target/path -type f -exec your_command_here \{\} \;
\{\} \; is the escape for {} ; which is the placeholder for the currently processed file. find will then invoke your_command_here with {} ; replaced by the file name, and since your_command_here will be launched by find and not by the shell the spaces won't matter.
The second version will be less efficient since find will launch a new process for each and every file found. xargs is smart enough to pipe the commands to a newly launched process if it can figure it's safe to do so. Prefer the xargs version if you have the choice.
for f in *; do echo "$f"; done
should do what you want. Why are you using ls instead of * ?
In general, dealing with spaces in shell is a PITA. Take a look at the $IFS variable, or better yet at Perl, Ruby, Python, etc.
Here's an answer using $IFS as discussed by derobert
http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html
You can pipe the arguments into read. For example, to cat all files in the directory:
ls -1 | while read FILENAME; do cat "$FILENAME"; done
This means you can still use ls, as you have in your question, or any other command that produces $IFS delimited output.
The while loop makes it much easier to do several things to the argument, and makes complex processing more readable in my opinion. A contrived example:
ls -1 | while read FILE
do
echo 1: "$FILE"
echo 2: "$FILE"
done
look --quoting-style option.
for instance, --quoting-style=c would produce :
$ ls --quoting-style=c
"file1" "file2" "dir one"
Check out the manpage for xargs:
it works like this:
ls -1 /tmp/*.jpeg | xargs rm

Resources