Concatenate all files in folder and subfolder with newline to one file - bash

I have some folders and subfoldes with .txt and other extensions (like .py, .html) and I want to concatenate all to one .txt file
I try this:
find . -type f -exec cat {} + > test.txt
Input:
txt1.txt:
aaaaa
test.py
print("a")
htmltest1.html:
<head></head>
Output:
aaaaaprint("a")<head></head>
Desired outup:
aaaaa
print("a")
<head></head>
So, how to modify this bash-command to get my desired output? I want to paste newline after each printed file

The problem is that the last lines of your files are not terminated with the newline character, which means they don't fulfill the POSIX definition of a text file, which may yield weird results like this.
Probably all graphical text editors I've used allow you not to put a terminating newline, and a lot of people won't put it, presumably because the editor makes it look like there's a redundant empty line at the end.
This may be the reason why some people couldn't reproduce your issue - presumably they created the sample files with well-behaving tools such as cat or vim or nano, or they did put the newline characters at the end.
So here's the issue:
user#host:~$ find . -type f -exec cat {} \;
aaaaaprint("a")<head></head>user#host:~$
To avoid these sorts of problems in the future, you should always hit <enter> after the last line of text in your file when using a graphical text editor. However, sometimes you have to work with files produced by other users, which might not know this sort of stuff, so:
here is a quick and dirty workaround (concatenating with an additional file which only contains the newline character):
user#host:~$ echo '' > /tmp/newline.txt
user#host:~$ find . -type f -exec cat {} /tmp/newline.txt \;
aaaaa
print("a")
<head></head>
user#host:~$

Use -E parameter on cat so it prints a $ at end of lines.
Then use sed to strip those out, with a literal \$ symbol anchored with $ at the end.
find . -type f -exec cat -E {} + | sed s'/\$$//' > test.txt

Related

bash script remove squares prefix when reading a file content [duplicate]

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:
find -type f |
while read file
do
if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]
then
echo "found BOM in: $file"
fi
done
Or, if you prefer short, unreadable one-liners:
find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done
It doesn't work with filenames that contain a line break,
but such files are not to be expected anyway.
Is there any shorter or more elegant solution?
Are there any interesting text editors or macros for text editors?
What about this one simple command which not just finds but clears the nasty BOM? :)
find . -type f -exec sed '1s/^\xEF\xBB\xBF//' -i {} \;
I love "find" :)
Warning The above will modify binary files which contain those three characters.
If you want just to show BOM files, use this one:
grep -rl $'\xEF\xBB\xBF' .
The best and easiest way to do this on Windows:
Total Commander → go to project's root dir → find files (Alt + F7) → file types *.* → Find text "EF BB BF" → check 'Hex' checkbox → search
And you get the list :)
find . -type f -print0 | xargs -0r awk '
/^\xEF\xBB\xBF/ {print FILENAME}
{nextfile}'
Most of the solutions given above test more than the first line of the file, even if some (such as Marcus's solution) then filter the results. This solution only tests the first line of each file so it should be a bit quicker.
If you accept some false positives (in case there are non-text files, or in the unlikely case there is a ZWNBSP in the middle of a file), you can use grep:
fgrep -rl `echo -ne '\xef\xbb\xbf'` .
You can use grep to find them and Perl to strip them out like so:
grep -rl $'\xEF\xBB\xBF' . | xargs perl -i -pe 's{\xEF\xBB\xBF}{}'
I would use something like:
grep -orHbm1 "^`echo -ne '\xef\xbb\xbf'`" . | sed '/:0:/!d;s/:0:.*//'
Which will ensure that the BOM occurs starting at the first byte of the file.
For a Windows user, see this (good PHP script for finding the BOM in your project).
An overkill solution to this is phptags (not the vi tool with the same name), which specifically looks for PHP scripts:
phptags --warn ./
Will output something like:
./invalid.php: TRAILING whitespace ("?>\n")
./invalid.php: UTF-8 BOM alone ("\xEF\xBB\xBF")
And the --whitespace mode will automatically fix such issues (recursively, but asserts that it only rewrites .php scripts.)
I used this to correct only JavaScript files:
find . -iname *.js -type f -exec sed 's/^\xEF\xBB\xBF//' -i.bak {} \; -exec rm {}.bak \;
find -type f -print0 | xargs -0 grep -l `printf '^\xef\xbb\xbf'` | sed 's/^/found BOM in: /'
find -print0 puts a null \0 between each file name instead of using new lines
xargs -0 expects null separated arguments instead of line separated
grep -l lists the files which match the regex
The regex ^\xeff\xbb\xbf isn't entirely correct, as it will match non-BOMed UTF-8 files if they have zero width spaces at the start of a line
If you are looking for UTF files, the file command works. It will tell you what the encoding of the file is. If there are any non ASCII characters in there it will come up with UTF.
file *.php | grep UTF
That won't work recursively though. You can probably rig up some fancy command to make it recursive, but I just searched each level individually like the following, until I ran out of levels.
file */*.php | grep UTF

How to look for files that have an extra character at the end?

I have a strange situation. A group of folks asked me to look at their hacked Wordpress site. When I got in, I noticed there were extra files here and there that had an extra non-printable character at end. In Bash, it shows it as a \r.
Just next to these files with the weird character is the original file. I'm trying to locate all these suspicious files and delete them. But the correct Bash incantation is eluding me.
find . | grep -i \?
and
find . | grep -i '\r'
aren't working
How do I use bash to find them?
Remove all files with filename ending in \r (carriage return), recursively, in current directory:
find . -type f -name $'*\r' -exec rm -fv {} +
Use ls -lh instead of rm to view the file list without removing.
Use rm -fvi to prompt before each removal.
-name GLOB specifies a matching glob pattern for find.
$'\r' is bash syntax for C style escapes.
You said "non-printable character", but ls indicates it's specifically a carriage return. The pattern '*[^[:graph:]' matches filenames ending in any non printable character, which may be relevant.
To remove all files and directories matching $'*\r' and all contents recursively: find . -name $'*\r' -exec rm -rfv {} +.
You have to pass carriage return character literally to grep. Use ANSI-C quoting in Bash.
find . -name $'*\r'
find . | grep $'\r'
find . | sed '/\x0d/!d'
if it a special character
Recursive look up
grep -ir $'\r'
# sample output
# empty line
Recursive look up + just print file name
grep -lir $'\r'
# sample output
file.txt
if it not a special character
You need to escape the backslash \ with a backslash so it becomes \\
Recursive look up
grep -ir '\\r$`
# sample output
file.txt:file.php\r
Recursive look up + just print file name
grep -lir '\\r$`
# sample output
file.txt
help:
-i case insensitive
-r recursive mode
-l print file name
\ escape another backslash
$ match the end
$'' the value is a special character e.g. \r, \t
shopt -s globstar # Enable **
shopt -s dotglob # Also cover hidden files
offending_files=(**/*$'\r')
should store into the array offending_files a list of all files which are compromised in that way. Of course you could also glob for **/*$'\r'*, which searches for all files having a carriage return anywhere in the name (not necessarily at the end).
You can then log the name of those broken files (which might make sense for auditing) and remove them.

Delete lines X to Y using Mac Unix Sed

Command line on a Mac. Have some text files. Want to remove certain lines from a group of files, then cat the remaining text of the file to a new merged file. Currently have the following attempt:
for file in *.txt;
do echo $file >> tempfile.html;
echo ''>>tempfile.html;
cat $file>>tempfile.html;
find . -type f -name 'tempfile.html' -exec sed -i '' '3,10d' {} +;
find . -type f -name 'tempfile.html' -exec sed -i '' '/<ACROSS>/,$d' {} +;
# ----------------
# some other stuff
# ----------------
done;
I am extracting a section of text from a bunch of files and concating them all together, but still need to know from which file each selection originated. First I concat the name of the file then (supposedly) the selection of text from each file. then repeat the process.
Plus, I need to leave the original text files in place for other purposes.
So the concatinated file would be:
filename1.txt
text-selection
more_text
filename2.txt
even-more-text
text-text-test-test
The first SED is supposed to delete from line 3 to line 10. The second is supposed to delete from the line containing to the end of the file.
However, what happens is the first deletes everything in the tempfile. The second one was doing nothing. (each were tested separately)
What am I doing wrong?
I must be missing something. Even trying -- what appears to be -- a very simple example does not work either. My hope was, the following example, would delete lines 3-10, but save the rest of the file to test.txt.
sed '3,10d' nxd2019-01-06.txt > test.txt
Your invocation of find will attempt to run sed with as many files as possible per call. But note: Addresses in sed do not address lines in each input file, they address the whole input of sed (which can consist out of many input files)
Try this:
> a.txt cat <<EOF
1
2
EOF
> b.txt cat <<EOF
3
4
EOF
Now try this:
sed 1d a.txt b.txt
2
3
4
As you can see, sed removed the first line from a.txt, not from b.txt
The problem in your case, is the second invocation of find. If will remove everything from the first occurrence of ACROSS until the last line in the last file found by find This will effectively remove the content from all but the first tempfile.html.
Having that the remaining logic in your script is working, you should just change the find invocations to:
find . -type f -name 'tempfile.html' -exec sed -i '' '3,10d' {} \;
find . -type f -name 'tempfile.html' -exec sed -i '' '/<ACROSS>/,$d' {} \;
This would call sed once per input file.

Recursive cat with file names

I'd like to cat recursively several files with same name to another file. There's an earlier question "Recursive cat all the files into single file" which helped me to get started. However I'd like to achieve the same so that each file is preceded by the filename and path, different files preferably separated with a blank line or ----- or something like that. So the resulting file would read:
files/pipo1/foo.txt
flim
flam
floo
files/pipo2/foo.txt
plim
plam
ploo
Any way to achieve this in bash?
Of course! Instead of just cating the file, you just chain actions to print the filename, cat the file, then add a line feed:
find . -name 'foo.txt' \
-print \
-exec cat {} \; \
-printf "\n"

How can I process a list of files that includes spaces in its names in Unix?

I'm trying to list the files in a directory and do something to them in the Mac OS X prompt.
It should go like this: for f in $(ls -1); do echo $f; done
If I have files without spaces in their names (fileA.txt, fileB.txt), the echo works fine.
If the files include spaces in their names ("file A.txt", "file B.txt"), I get 4 strings (file, A.txt, file, B.txt).
I've tried quoting the listing command, but it only changed the problem.
If I do this: for f in $(ls -1); do echo $f; done
I get: file A.txt\nfile B.txt
(It displays correctly, but it is a single string and I need the 2 lines separated.
Step away from ls if at all possible. Use find from the findutils package.
find /target/path -type f -print0 | xargs -0 your_command_here
-print0 will cause find to output the names separated by NUL characters (ASCII zero). The -0 argument to xargs tells it to expect the arguments separated by NUL characters too, so everything will work just fine.
Replace /target/path with the path under which your files are located.
-type f will only locate files. Use -type d for directories, or omit altogether to get both.
Replace your_command_here with the command you'll use to process the file names. (Note: If you run this from a shell using echo for your_command_here you'll get everything on one line - don't get confused by that shell artifact, xargs will do the expected right thing anyway.)
Edit: Alternatively (or if you don't have xargs), you can use the much less efficient
find /target/path -type f -exec your_command_here \{\} \;
\{\} \; is the escape for {} ; which is the placeholder for the currently processed file. find will then invoke your_command_here with {} ; replaced by the file name, and since your_command_here will be launched by find and not by the shell the spaces won't matter.
The second version will be less efficient since find will launch a new process for each and every file found. xargs is smart enough to pipe the commands to a newly launched process if it can figure it's safe to do so. Prefer the xargs version if you have the choice.
for f in *; do echo "$f"; done
should do what you want. Why are you using ls instead of * ?
In general, dealing with spaces in shell is a PITA. Take a look at the $IFS variable, or better yet at Perl, Ruby, Python, etc.
Here's an answer using $IFS as discussed by derobert
http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html
You can pipe the arguments into read. For example, to cat all files in the directory:
ls -1 | while read FILENAME; do cat "$FILENAME"; done
This means you can still use ls, as you have in your question, or any other command that produces $IFS delimited output.
The while loop makes it much easier to do several things to the argument, and makes complex processing more readable in my opinion. A contrived example:
ls -1 | while read FILE
do
echo 1: "$FILE"
echo 2: "$FILE"
done
look --quoting-style option.
for instance, --quoting-style=c would produce :
$ ls --quoting-style=c
"file1" "file2" "dir one"
Check out the manpage for xargs:
it works like this:
ls -1 /tmp/*.jpeg | xargs rm

Resources