Quick ls command - bash

I've got to get a directory listing that contains about 2 million files, but when I do an ls command on it nothing comes back. I've waited 3 hours. I've tried ls | tee directory.txt, but that seems to hang forever.
I assume the server is doing a lot of inode sorting. Is there any way to speed up the ls command to just get a directory listing of filenames? I don't care about size, dates, permission or the like at this time.

ls -U
will do the ls without sorting.
Another source of slowness is --color. On some linux machines, there is a convenience alias which adds --color=auto' to the ls call, making it look up file attributes for each file found (slow), to color the display. This can be avoided by ls -U --color=never or \ls -U.

I have a directory with 4 million files in it and the only way I got ls to spit out files immediately without a lot of churning first was
ls -1U

Try using:
find . -type f -maxdepth 1
This will only list the files in the directory, leave out the -type f argument if you want to list files and directories.

This question seems to be interesting and I was going through multiple answers that were posted. To understand the efficiency of the answers posted, I have executed them on 2 million files and found the results as below.
$ time tar cvf /dev/null . &> /tmp/file-count
real 37m16.553s
user 0m11.525s
sys 0m41.291s
------------------------------------------------------
$ time echo ./* &> /tmp/file-count
real 0m50.808s
user 0m49.291s
sys 0m1.404s
------------------------------------------------------
$ time ls &> /tmp/file-count
real 0m42.167s
user 0m40.323s
sys 0m1.648s
------------------------------------------------------
$ time find . &> /tmp/file-count
real 0m2.738s
user 0m1.044s
sys 0m1.684s
------------------------------------------------------
$ time ls -U &> /tmp/file-count
real 0m2.494s
user 0m0.848s
sys 0m1.452s
------------------------------------------------------
$ time ls -f &> /tmp/file-count
real 0m2.313s
user 0m0.856s
sys 0m1.448s
------------------------------------------------------
To summarize the results
ls -f command ran a bit faster than ls -U. Disabling color might have caused this improvement.
find command ran third with an average speed of 2.738 seconds.
Running just ls took 42.16 seconds. Here in my system ls is an alias for ls --color=auto
Using shell expansion feature with echo ./* ran for 50.80 seconds.
And the tar based solution took about 37 miuntes.
All tests were done seperately when system was in idle condition.
One important thing to note here is that the file lists are not printed in the terminal rather
they were redirected to a file and the file count was calculated later with wc command.
Commands ran too slow if the outputs where printed on the screen.
Any ideas why this happens ?

This would be the fastest option AFAIK: ls -1 -f.
-1 (No columns)
-f (No sorting)

Using
ls -1 -f
is about 10 times faster and it is easy to do (I tested with 1 million files, but my original problem had 6 800 000 000 files)
But in my case I needed to check if some specific directory contains more than 10 000 files. If there were more than 10 000 files, I am not anymore interested that how many files there is. I just quit the program so that it will run faster and wont try to read the rest one-by-one. If there are less than 10 000, I will print the exact amount. Speed of my program is quite similar to ls -1 -f if you specify bigger value for parameter than amount of files.
You can use my program find_if_more.pl in current directory by typing:
find_if_more.pl 999999999
If you are just interested if there are more than n files, script will finish faster than ls -1 -f with very large amount of files.
#!/usr/bin/perl
use warnings;
my ($maxcount) = #ARGV;
my $dir = '.';
$filecount = 0;
if (not defined $maxcount) {
die "Need maxcount\n";
}
opendir(DIR, $dir) or die $!;
while (my $file = readdir(DIR)) {
$filecount = $filecount + 1;
last if $filecount> $maxcount
}
print $filecount;
closedir(DIR);
exit 0;

You can redirect output and run the ls process in the background.
ls > myls.txt &
This would allow you to go on about your business while its running. It wouldn't lock up your shell.
Not sure about what options are for running ls and getting less data back. You could always run man ls to check.

This is probably not a helpful answer, but if you don't have find you may be able to make do with tar
$ tar cvf /dev/null .
I am told by people older than me that, "back in the day", single-user and recovery environments were a lot more limited than they are nowadays. That's where this trick comes from.

I'm assuming you are using GNU ls?
try
\ls
It will unalias the usual ls (ls --color=auto).

If a process "doesn't come back", I recommend strace to analyze how a process is interacting with the operating system.
In case of ls:
$strace ls
you would have seen that it reads all directory entries (getdents(2)) before it actually outputs anything. (sorting… as it was already mentioned here)

Things to try:
Check ls isn't aliased?
alias ls
Perhaps try find instead?
find . \( -type d -name . -prune \) -o \( -type f -print \)
Hope this helps.

Some followup:
You don't mention what OS you're running on, which would help indicate which version of ls you're using. This probably isn't a 'bash' question as much as an ls question. My guess is that you're using GNU ls, which has some features that are useful in some contexts, but kill you on big directories.
GNU ls Trying to have prettier arranging of columns. GNU ls tries to do a smart arrange of all the filenames. In a huge directory, this will take some time, and memory.
To 'fix' this, you can try:
ls -1 # no columns at all
find BSD ls someplace, http://www.freebsd.org/cgi/cvsweb.cgi/src/bin/ls/ and use that on your big directories.
Use other tools, such as find

There are several ways to get a list of files:
Use this command to get a list without sorting:
ls -U
or send the list of files to a file by using:
ls /Folder/path > ~/Desktop/List.txt

What partition type are you using?
Having millions of small files in one directory it might be a good idea to use JFS or ReiserFS which have better performance with many small sized files.

How about find ./ -type f (which will find all files in the currently directory)? Take off the -type f to find everything.

You should provide information about what operating system and the type of filesystem you are using. On certain flavours of UNIX and certain filesystems you might be able to use the commands ff and ncheck as alternatives.

I had a directory with timestamps in the file names. I wanted to check the date of the latest file and found find . -type f -maxdepth 1 | sort | tail -n 1 to be about twice as fast as ls -alh.

Lots of other good solutions here, but in the interest of completeness:
echo *

You can also make use of xargs. Just pipe the output of ls through xargs.
ls | xargs
If that doesn't work and the find examples above aren't working, try piping them to xargs as it can help the memory usage that might be causing your problems.

Related

Grep - showing current directory/file in a recursive search

The problem
Sometimes, when I run the grep tool recursively it gets stuck in some big directories or in some big files, and I would like to see the directory or file name because perhaps I may realise I don't need to scan that specific directory/file the next time I use grep for a similar purpose, therefore excluding it with the corresponding grep options.
Is there a way to tell grep the current path directory/file which is being scanned in such searches?
My attempts
I tried to search here but it's impossible to find something since usually the keywords current directory are used for other reasons, so there is a conflicting terminology.
I have also tried things like:
man grep | grep -i current
man grep | grep -i status
(and many others) without success so far.
EDIT: I have just found a useful answer here which is for a different problem, but I guess that it may work if I modify the following code by adding an echo command somewhere in the for loop, although I have also just realised it requires bash 4 and sadly I have bash 3.
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[#]}
for ((i=0; i<total; i+=100)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[#]:i:100}" >>results.txt
done
find . -type f -exec echo grepping {} \; -exec time grep pattern {} \; 2>&1
find . -type f to find all the files recursively.
-exec echo grepping {} to call out each file
-exec time grep ... {} to report the time each grep takes
2>&1 to get time's stderr onto stdout.
This doesn't report a total time per directory. Doing that this way either requires more advanced find, to find leaf dirs for grep -d, or to add some cumulative time per path, which I'd do with perl -p... but that's nontrivial as well.

Using ? wildcard with ls

I'm trying to use the ? wildcard to display only 1 character files, and ?.* to display 1 character files with extensions.
what works:
cd /mydir
ls ? ?.*
I'm trying to use this in a shell script so therefor i cant use "cd"
What i'm trying to get to work
ls ? ?.* /mydir
and it gives me the output:
ls: cannot access ?.*: No such file or directory
I've also tried:
ls /mydir ? ?.*
which gives me the exact same output as before.
From a comment you wrote:
im in college for linux administrator and 1 of my current classes in shell scripting. My teacher is just going over basic stuff. And, my current assingment is to get the number of files in the tmp directory of our class server, the number of files that end in .log and the number of files that only have 1 character names and store the data in a file and then display the stored data to the user. I know it's stupid, but it's my assignment.
I only hope that they don't teach you to parse the output of ls in college... it's one of the most terrible things to do. Please refer to these links:
Why you shouldn't parse the output of ls(1)
Don't ever do these
The solution you chose
ls /mydir/? /mydir/?.* | wc -l
is broken in two cases:
If there are no matching files, you'll get an error. You can fix that in two ways: use shopt -s nullglob or just redirect stderr to devnull.
If there's a newline in a file name. Try it: touch $'a.lol\nlol\n\lol\nlol\nlol'. LOL.
The proper bash way is the following:
shopt -s nullglob
shopt -u failglob
files=( /mydir/? /mydir/?.* )
echo "There are ${#files[#]} files found."
When you write ls ? ?.* /mydir, you're trying to display the files matching three distincts patterns: ?, ?.*, and /mydir. You want to match only /mydir/? and /mydir/?.*, hence this command: ls /mydir/? /mydir/?.*.
Edit: while this is a correct answer to the initial question (listing /mydir/? and /mydir/?.*), OP wanted to do this to parse the output and get the file count. See #gniourf_gniourf's answer, which is a much better way to do this.
cd works perfectly within a shell script, use it. For minimal impact on the script, I would use a subshell:
( cd /mydir && ls ? ?.* )
That way, you don't change the current working directory of the script (and neither $OLDPWD, which would be clobbered with cd /mydir; ...; cd -;).
While ls seems like an obvious choice, find is probably more suitable:
find /mydir \! -name "." -a \( -name "?" -o -name "?.*" \)

Copying files with specific size to other directory

Its a interview question. Interviewer asked this "basic" shell script question when he understand i don't have experience in shell scripting. Here is question.
Copy files from one directory which has size greater than 500 K to another directory.
I can do it immediately in c lang but seems difficult in shell script as never tried it.I am familiar with unix basic commands so i tried it, but i can just able to extract those file names using below command.
du -sk * | awk '{ if ($1>500) print $2 }'
Also,Let me know good shell script examples book.
It can be done in several ways. I'd try and use find:
find $FIRSTDIRECTORY -size +500k -exec cp "{\} $SECONDDIRECTORY \;
To limit to the current directory, use -maxdepth option.
du recurses into subdirectories, which is probably not desired (you could have asked for clarification if that point was ambiguous). More likely you were expected to use ls -l or ls -s to get the sizes.
But what you did works to select some files and print their names, so let's build on it. You have a command that outputs a list of names. You need to put the output of that command into the command line of a cp. If your du|awk outputs this:
Makefile
foo.c
bar.h
you want to run this:
cp Makefile foo.c bar.h otherdirectory
So how you do that is with COMMAND SUBSTITUTION which is written as $(...) like this:
cd firstdirectory
cp $(du -sk * | awk '{ if ($1>500) print $2 }') otherdirectory
And that's a functioning script. The du|awk command runs first, and its output is used to build the cp command. There are a lot of subtle drawbacks that would make it unsuitable for general use, but that's how beginner-level shell scripts usually are.
find . -mindepth 1 -maxdepth 1 -type f -size +BYTESc -exec cp -t DESTDIR {}\+
The c suffix on the size is essential; the size is in bytes. Otherwise, you get probably-unexpected rounding behaviour in determining the result of the -size check. If the copying is meant to be recursive, you will need to take care of creating any destination directory also.

bash "map" equivalent: run command on each file [duplicate]

This question already has answers here:
Execute command on all files in a directory
(10 answers)
Closed 1 year ago.
I often have a command that processes one file, and I want to run it on every file in a directory. Is there any built-in way to do this?
For example, say I have a program data which outputs an important number about a file:
./data foo
137
./data bar
42
I want to run it on every file in the directory in some manner like this:
map data `ls *`
ls * | map data
to yield output like this:
foo: 137
bar: 42
If you are just trying to execute your data program on a bunch of files, the easiest/least complicated way is to use -exec in find.
Say you wanted to execute data on all txt files in the current directory (and subdirectories). This is all you'd need:
find . -name "*.txt" -exec data {} \;
If you wanted to restrict it to the current directory, you could do this:
find . -maxdepth 1 -name "*.txt" -exec data {} \;
There are lots of options with find.
If you just want to run a command on every file you can do this:
for i in *; do data "$i"; done
If you also wish to display the filename that it is currently working on then you could use this:
for i in *; do echo -n "$i: "; data "$i"; done
It looks like you want xargs:
find . --maxdepth 1 | xargs -d'\n' data
To print each command first, it gets a little more complex:
find . --maxdepth 1 | xargs -d'\n' -I {} bash -c "echo {}; data {}"
You should avoid parsing ls:
find . -maxdepth 1 | while read -r file; do do_something_with "$file"; done
or
while read -r file; do do_something_with "$file"; done < <(find . -maxdepth 1)
The latter doesn't create a subshell out of the while loop.
The common methods are:
ls * | while read file; do data "$file"; done
for file in *; do data "$file"; done
The second can run into problems if you have whitespace in filenames; in that case you'd probably want to make sure it runs in a subshell, and set IFS:
( IFS=$'\n'; for file in *; do data "$file"; done )
You can easily wrap the first one up in a script:
#!/bin/bash
# map.bash
while read file; do
"$1" "$file"
done
which can be executed as you requested - just be careful never to accidentally execute anything dumb with it. The benefit of using a looping construct is that you can easily place multiple commands inside it as part of a one-liner, unlike xargs where you'll have to place them in an executable script for it to run.
Of course, you can also just use the utility xargs:
find -maxdepth 0 * | xargs -n 1 data
Note that you should make sure indicators are turned off (ls --indicator-style=none) if you normally use them, or the # appended to symlinks will turn them into nonexistent filenames.
GNU Parallel specializes in making these kind of mappings:
parallel data ::: *
It will run one job on each CPU core in parallel.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Since you specifically asked about this in terms of "map", I thought I'd share this function I have in my personal shell library:
# map_lines: evaluate a command for each line of input
map_lines()
{
while read line ; do
$1 $line
done
}
I use this in the manner that you for a solution:
$ ls | map_lines ./data
I named it map_lines instead of map as I assumed some day I may implement a map_args where you would use it like this:
$ map_args ./data *
That function would look like this:
map_args()
{
cmd="$1" ; shift
for arg ; do
$cmd "$arg"
done
}
Try this:
for i in *; do echo ${i}: `data $i`; done
You can create a shell script like so:
#!/bin/bash
cd /path/to/your/dir
for file in `dir -d *` ; do
./data "$file"
done
That loops through every file in /path/to/your/dir and runs your "data" script on it. Be sure to chmod the above script so that it is executable.
You could also use PRLL.
ls doesn't handle blanks, linefeeds and other funky stuff in filenames and should be avoided where possible.
find is only useful if you like to dive into subdirs, or if you want to make usage from the other options (mtime, size, you name it).
But many commands handle multiple files themself, so don't need a for-loop:
for d in * ; do du -s $d; done
but
du -s *
md5sum e*
identify *jpg
grep bash ../*.sh
I have just written this script specifically to address the same need.
http://gist.github.com/kindaro/4ba601d19f09331750bd
It uses find to build a set of files to transpose, which allows for finer selection of files to map from but allows a window for harder mistakes as well.
I designed two modes of operation: the first mode runs a command with "source file" and "target file" arguments, while the second mode supplies source file contents to a command as stdin and writes its stdout into a target file.
We may further consider adding support for parallel execution and maybe limiting the set of custom find arguments to a few most necessary ones. I am not really sure if that's the right things to do.

Unix Shell scripting for copying files and creating directory

I have a source directory eg /my/source/directory/ and a destination directory eg /my/dest/directory/, which I want to mirror with some constraints.
I want to copy files which meet certain criteria of the find command, eg -ctime -2 (less than 2 days old) to the dest directory to mirror it
I want to include some of the prefix so I know where it came from, eg /source/directory
I'd like to do all this with absolute paths so it doesn't depend which directory I run from
I'd guess not having cd commands is good practice too.
I want the subdirectories created if they don't exist
So
/my/source/directory/1/foo.txt -> /my/dest/directory/source/directory/1/foo.txt
/my/source/directory/2/3/bar.txt -> /my/dest/directory/source/directory/2/3/bar.txt
I've hacked together the following command line but it seems a bit ugly, can anyone do better?
find /my/source/directory -ctime -2 -type f -printf "%P\n" | xargs -IFILE rsync -avR /my/./source/directory/FILE /my/dest/directory/
Please comment if you think I should add this command line as an answer myself, I didn't want to be greedy for reputation.
This is remarkably similar to a (closed) question: Bash scripting copying files without overwriting. The answer I gave cites the 'find | cpio' solution mentioned in other answers (minus the time criteria, but that's the difference between 'similar' and 'same'), and also outlines a solution using GNU 'tar'.
ctime
When I tested on Solaris, neither GNU tar nor (Solaris) cpio was able to preserve the ctime setting; indeed, I'm not sure that there is any way to do that. For example, the touch command can set the atime or the mtime or both - but not the ctime. The utime() system call also only takes the mtime or atime values; it does not handle ctime. So, I believe that if you find a solution that preserves ctime, that solution is likely to be platform-specific. (Weird example: hack the disk device and edit the data in the inode - not portable, requires elevated privileges.) Rereading the question, though, I see that 'preserving ctime' is not part of the requirements (phew); it is simply the criterion for whether the file is copied or not.
chdir
I think that the 'cd' operations are necessary - but they can be wholly localized to the script or command line, though, as illustrated in the question cited and the command lines below, the second of which assumes GNU tar.
(cd /my; find source/directory -ctime -2 | cpio -pvdm /my/dest/directory)
(cd /my; find source/directory -ctime -2 | tar -cf - -F - ) |
(cd /my/dest/directory; tar -xf -)
Without using chdir() (aka cd), you need specialized tools or options to handle the manipulation of the pathnames on the fly.
Names with blanks, newlines, etc
The GNU-specific 'find -print0' and 'xargs -0' are very powerful and effective, as noted by Adam Hawes. Funnily enough, GNU cpio has an option to handle the output from 'find -print0', and that is '--null' or its short form '-0'. So, using GNU find and GNU cpio, the safe command is:
(cd /my; find source/directory -ctime -2 -print0 |
cpio -pvdm0 /my/dest/directory)
Note:This does not overwrite pre-existing files under the backup directory. Add -u to the cpio command for that.
Similarly, GNU tar supports --null (apparently with no -0 short-form), and could also be used:
(cd /my; find source/directory -ctime -2 -print0 | tar -cf - -F - --null ) |
(cd /my/dest/directory; tar -xf -)
The GNU handling of file names with the null terminator is extremely clever and a valuable innovation (though I only became aware of it fairly recently, courtesy of SO; it has been in GNU tar for at least a decade).
You could try cpio using the copy-pass mode, -p. I usually use it with overwrite all (-u), create directories (-d), and maintain modification time (-m).
find myfiles | cpio -pmud target-dir
Keep in mind that find should produce relative path names, which doesn't fit your absolute path criteria. This cold be of course be 'solved' using cd, which you also don't like (why not?)
(cd mypath; find myfiles | cpio ... )
The brackets will spawn a subshell, and will keep the state-change (i.e. the directory switch) local. You could also define a small procedure to abstract away the 'uglyness'.
IF you're using find always use -print0 and pipe the output through xargs -0; well almost always. The first file with a space in its name will bork the script if you use the default newline terminator output of find.
I agree with all the other posters - use cpio or tar if you can. It'll do what you want and save the hassle.
An alternative is to use tar,
(cd $SOURCE; tar cf - .) | (cd $DESTINATION; tar xf -)
EDIT:
Ah, I missed the bit about preserving CTIME. I believe most implementations of tar will preserve mtime, but if preserving ctime is critical, then cpio is indeed the only way.
Also, some tar implementations (GNU tar being one) can select the files to include based on atime and mtime, though seemingly not ctime.
#!/bin/sh
SRC=/my/source/directory
DST=/my/dest/directory
for i in $(find $SRC -ctime -2 -type f) ; do
SUBDST=$DST$(dirname $i)
mkdir -p $SUBDST
cp -p $i $SUBDST
done
And I suppose, since you want to include "where it came from", that you are going to use different source directories. This script can be modified to take source dir as an argument simply by replacing SRC=/my/source/directory, with SRC=$1
EDIT: Removed redundant if statement.
Does not work when filenames have whitespaces.
!/usr/bin/sh
script to copy files with same directory structure"
echo "Please enter Full Path of Source DIR (Starting with / and ending with /):"
read spath
echo " Please enter Full Path of Destination location (Starting with / and ending with /):"
read dpath
si=echo "$spath" | awk -F/ '{print NF-1}'
for fname in find $spath -type f -print
do
cdir=echo $fname | awk -F/ '{ for (i='$si'; i<NF; i++) printf "%s/", $i; printf "\n"; }'
if [ $cdir ]; then
if [ ! -d "$dpath$cdir" ]; then
mkdir -p $dpath$cdir
fi
fi
cp $fname $dpath$cdir
done

Resources