WC on OSX - Return includes spaces - macos

When I run the word count command in OSX terminal like wc -c file.txt I get the below answer that includes spaces padded before the answer. Does anyone know why this happens, or how I can prevent it?
18000 file.txt
I would expect to get:
18000 file.txt
This occurs using bash or bourne shell.

The POSIX standard for wc may be read to imply that there are no leading blanks, but does not say that explicitly. Standards are like that.
This is what it says:
By default, the standard output shall contain an entry for each input file of the form:
"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>
and does not mention the formats for the single-column options such as -c.
A quick check shows me that AIX, OSX, Solaris use a format which specifies the number of digits for the value — to align columns (and differ in the number of digits). HPUX and Linux do not.
So it is just an implementation detail.

I suppose it is a way of getting outputs to line up nicely, and as far as I know there is no option to wc which fine tunes the output format.
You could get rid of them pretty easily by piping through sed 's/^ *//', for example.
There may be an even simpler solution, depending on why you want to get rid of them.

At least under macOS/bash wc exhibits the behavior of outputting trailing positional TABs.
It can be avoided using expr:
echo -n "some words" | expr $(wc -c)
>> 10
echo -n "some words" | expr $(wc -w)
>> 2
Note: The -n prevents echoing a newline character which would count as 1 in wc -c

This bugs me every time I write a script that counts lines or characters. I wish that wc were defined not to emit the extra spaces, but it's not, so we're stuck with them.
When I write a script, instead of
nlines=`wc -l $file`
I always say
nlines=`wc -l < $file`
so that wc's output doesn't include the filename, but that doesn't help with the extra spaces. The trick I use next is to add 0 to the number, like this:
nlines=`expr $nlines + 0` # get rid of trailing spaces

Related

Shell: How can I posixly determine if a file contains one or more null characters?

There are some great answers that address removing null characters from a file- it seems like sed is probably the most effective way. However, all the other questions I have been able to find are concerned not with finding null characters but removing them.
There are certain questions that do provide valid solutions- however, I am having difficulty finding a POSIX-compliant solution that does not rely on GNUisms. The two solutions I've seen that work use cat with the -v option and grep with the -P option (neither of which shall be supported).
I make it a habit to delegate as much as possible to the shell, but the shell can't help me here because it is not possible to store a null character in a variable. External tools are the only option, but I can't even find a way with them when I adhere to POSIX-compliant options.
One possible way would be tr and wc:
[ "$(tr -cd '\0' < file | wc -c)" -ge 0 ]
Alternatively, od and grep will allow stopping on the first one without reading the rest of the file:
od -A n -t x1 file | grep -q 00
Use tr -d '\000' to trim nulls into a temporary file and use wc -c to get the number of characters in the file. If the temporary-file doesn't match (cmp -s) the original, that contained nulls, and the output from wc can be used to compute the number of nulls -- the point of the question.
p.s.: grep -P isn't POSIX either. Nor is the -C option found in POSIX.

Unix Epoch to date with sed

I wanna change unix epoch to normal date
i'm trying:
sed < file.json -e 's/\([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/`date -r \1`/g'
any hint?
With the lack of information from your post, I can not give you a better answer than this but it is possible to execute commands using sed!
You have different ways to do it you can use
directly sed e instruction followed by the command to be
executed, if you do not pass a command to e then it will treat the content of the pattern buffer as external command.
use a simple substitute command with sed and pipe the output to sh
Example 1:
echo 12687278 | sed "s/\([0-9]\{8,\}\)/date -d #\1/;e"
Example 2:
echo 12687278 | sed "s/\([0-9]\{8,\}\)/date -d #\1/" |sh
Test 1 (with Japanese locale LC_TIME=ja_JP.UTF-8):
Test 2 (with Japanese locale LC_TIME=ja_JP.UTF-8):
Remarks:
I will let you adapt the date command accordingly to your system specifications
Since modern dates are longer than 8 characters, the sed command uses an
open ended length specifier of at least 8, rather than exactly 8.
Allan has a nice way to tackle dynamic arguments: write a script dynamically and pipe it to a shell! It works. It tends to be a bit more insecure because you could potentially pipe unintentional shell components to sh - for example if rm -f some-important-file was in the file along with the numbers , the sed pipeline wouldn't change that line, and it would also be passed to sh along with the date commands. Obviously, this is only a concern if you don't control the input. But mistakes can happen.
A similar method I much prefer is with xargs. It's a bit of a head trip for new users, but very powerful. The idea behind xargs is that it takes its input from its standard in, then adds it to the command comprised of its own non-option arguments and runs the command(s). For instance,
$ echo -e "/tmp\n/usr/lib" | xargs ls -d
/tmp /usr/lib
Its a trivial example of course, but you can see more exactly how this works by adding an echo:
echo -e "/tmp\n/usr/lib" | xargs echo ls -d
ls -d /tmp /usr/lib
The input to xargs becomes the additional arguments to the command specified in xargs's own arguments. Read that twice if necessary, or better yet, fiddle with this powerful tool, and the light bulb should come on.
Here's how I would approach what you're doing. Of course I'm not sure if this is actually a logical thing to do in your case, but given the detail you went into in your question, it's the best I can do.
$ cat dates.txt
Dates:
1517363346
I can run a command like this:
$ sed -ne '/^[0-9]\{8,\}$/ p' < dates.txt | xargs -I % -n 1 date -d #%
Tue Jan 30 19:49:06 CST 2018
Makes sense, because I used the commnad echo -e "Dates:\ndate +%s" > dates.txt to make the file a few minutes before I wrote this post! Let's go through it together and I'll break down what I'm doing here.
For one thing, I'm running sed with -n. This tells it not to print the lines by default. That makes this script work if not every line has an 8+ digit "date" in it. I also added anchors to the start (^) and end ($) of the regex so the line had only the approprate digits ( I realize this may not be perfect for you, but without understanding your its input, I can't do better ). These are important changes if your file is not entirely comprised of date strings. Additionally, I am matching at least 8 characters, as modern date strings are going to be more like 10 characters long. Finally, I added a command p to sed. This tells it to print the matching lines, which is necessary because I specifically said not to print the nonmatching lines.
The next bit is the xargs iteslf. The sed will write a date string out to xargs's standard input. I set only a few settings for xargs. By default it will add the standard input to the end of the command, separated by a space. I didn't want a space, so I used -I to specify a replacement string. % doesn't have a special meaning; its just a placeholder that gets replaced with the input. I used % because its not a special character but rarely is used in commands. Finally, I added -n 1 to make sure only 1 input was used per execution of date. ( xargs can also add many inputs together, as in my ls example above).
The end result? Sed matches lines that consist, exclusively, of 8 or more numeric values, outputting the matching lines. The pipe then sends this output to xargs, which takes each line separately (-n 1) and, replacing the placeholder (-I %) with each match, then executes the date command.
This is a shell pattern I really like, and use every day, and with some clever tweaks, can be very powerful. I encourage anyone who uses linux shell to get to know xargs right away.
There is another option for GNU sed users. While the BSD land folks were pretty true to their old BSD unix roots, the GNU folks, who wrote their userspace from scratch, added many wonderful enhancements to the standards. GNU Sed can apparently run a subshell command for you and then do the replacement for you, which would be dramatically easier. Since you are using the bsd style date invocation, I'm going to assume you don't have gnu sed at your disposal.
Using sed: tested with macOs only
There is a slight difference with the command date that should use the flag (-r) instead of (-d) exclusive to macOS
echo 12687278 | sed "s/\([0-9]\{8,\}\)/$(date -r \1)/g"
Results:
Thu Jan 1 09:00:01 JST 1970

BASH Palindrome Checker

This is my first time posting on here so bear with me please.
I received a bash assignment but my professor is completely unhelpful and so are his notes.
Our assignment is to filter and print out palindromes from a file. In this case, the directory is:
/usr/share/dict/words
The word lengths range from 3 to 45 and are supposed to only filter lowercase letters (the dictionary given has characters and uppercases, as well as lowercase letters). i.e. "-dkas-das" so something like "q-evvavve-q" may count as a palindrome but i shouldn't be getting that as a proper result.
Anyways, I can get it to filter out x amount of words and return (not filtering only lowercase though).
grep "^...$" /usr/share/dict/words |
grep "\(.\).\1"
And I can use subsequent lines for 5 letter words and 7 and so on:
grep "^.....$" /usr/share/dict/words |
grep "\(.\)\(.\).\2\1"
But the prof does not want that. We are supposed to use a loop. I get the concept but I don't know the syntax, and like I said, the notes are very unhelpful.
What I tried was setting variables x=... and y=.. and in a while loop, having x=$x$y but that didn't work (syntax error) and neither did x+=..
Any help is appreciated. Even getting my non-lowercase letters filtered out.
Thanks!
EDIT:
If you're providing a solution or a hint to a solution, the simplest method is prefered.
Preferably one that uses 2 grep statements and a loop.
Thanks again.
Like this:
for word in `grep -E '^[a-z]{3,45}$' /usr/share/dict/words`;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Output using my dictionary:
aha
bib
bob
boob
...
wow
Update
As pointed out in the comments, reading in most of the dictionary into a variable in the for loop might not be the most efficient, and risks triggering errors in some shells. Here's an updated version:
grep -E '^[a-z]{3,45}$' /usr/share/dict/words | while read -r word;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Why use grep? Bash will happily do that for you:
#!/bin/bash
is_pal() {
local w=$1
while (( ${#w} > 1 )); do
[[ ${w:0:1} = ${w: -1} ]] || return 1
w=${w:1:-1}
done
}
while read word; do
is_pal "$word" && echo "$word"
done
Save this as banana, chmod +x banana and enjoy:
./banana < /usr/share/dict/words
If you only want to keep the words with at least three characters:
grep ... /usr/share/dict/words | ./banana
If you only want to keep the words that only contain lowercase and have at least three letters:
grep '^[[:lower:]]\{3,\}$' /usr/share/dict/words | ./banana
The multiple greps are wasteful. You can simply do
grep -E '^([a-z])[a-z]\1$' /usr/share/dict/words
in one fell swoop, and similarly, put the expressions on grep's standard input like this:
echo '^([a-z])[a-z]\1$
^([a-z])([a-z])\2\1$
^([a-z])([a-z])[a-z]\2\1$' | grep -E -f - /usr/share/dict/words
However, regular grep does not permit backreferences beyond \9. With grep -P you can use double-digit backreferences, too.
The following script constructs the entire expression in a loop. Unfortunately, grep -P does not allow for the -f option, so we build a big thumpin' variable to hold the pattern. Then we can actually also simplify to a single pattern of the form ^(.)(?:.|(.)(?:.|(.)....\3)?\2?\1$, except we use [a-z] instead of . to restrict to just lowercase.
head=''
tail=''
for i in $(seq 1 22); do
head="$head([a-z])(?:[a-z]|"
tail="\\$i${tail:+)?}$tail"
done
grep -P "^${head%|})?$tail$" /usr/share/dict/words
The single grep should be a lot faster than individually invoking grep 22 or 43 times on the large input file. If you want to sort by length, just add that as a filter at the end of the pipeline; it should still be way faster than multiple passes over the entire dictionary.
The expression ${tail+:)?} evaluates to a closing parenthesis and question mark only when tail is non-empty, which is a convenient way to force the \1 back-reference to be non-optional. Somewhat similarly, ${head%|} trims the final alternation operator from the ultimate value of $head.
Ok here is something to get you started:
I suggest to use the plan you have above, just generate the number of "." using a for loop.
This question will explain how to make a for loop from 3 to 45:
How do I iterate over a range of numbers defined by variables in Bash?
for i in {3..45};
do
* put your code above here *
done
Now you just need to figure out how to make "i" number of dots "." in your first grep and you are done.
Also, look into sed, it can nuke the non-lowercase answers for you..
Another solution that uses a Perl-compatible regular expressions (PCRE) with recursion, heavily inspired by this answer:
grep -P '^(?:([a-z])(?=[a-z]*(\1(?(2)\2))$))++[a-z]?\2?$' /usr/share/dict/words

Shell script: Count number of files in a particular type extension in single folder

I am new with shell script.
I need to save the number of files with particular extension(.properties) in a variable using shell script.
I have used
ls |grep .properties$ |wc -l
but this command prints the number of properties files in the folder. How can I assign this value in a variable.
I have tried
count=${ls |grep .properties$ |wc -l}
But it is showing error like:
./replicate.sh: line 57: ${ls |grep .properties$ |wc -l}: bad substitution
What is this type of errors?
Please anyone help me to save the number of particular files in a variable for future use.
You're using the wrong brackets, it should be $() (command output substitution) rather than ${} (variable substitution).
count=$(ls -1 | grep '\.properties$' | wc -l)
You'll also notice I've use ls -1 to force one file per line in case your ls doesn't do this automatically for pipelines, and changed the pattern to match the . correctly.
You can also bypass the grep totally if you use something like:
count=$(ls -1 *.properties 2>/dev/null | wc -l)
Just watch out for "evil" filenames like those with embedded newlines for example, though my ls seems to handle these fine by replacing the newline with a ? character - that's not necessarily a good idea for doing things with files but it works okay for counting them.
There are better tools to use if you have such beasts and you need the actual file name, but they're rare enough that you generally don't have to worry about it.
You could use a loop with globbing:
count=0
for i in *.properties; do
count=$((count+1))
done
If you are using a shell that supports arrays, you can simply capture all such file names
files=( *.properties )
and then determine the number of array elements
count=${#files[#]}
(The above assumes bash; other shells may require slightly different syntax.)
You'd better use find instead of parsing ls. Then, use the var=$(command) syntax to store the value.
var=$(find . -maxdepth 1 -name "*\.properties" | wc -l)
Reference: Why you shouldn't parse the output of ls.
To solve the problem appearing if any file name contains new lines, you can use what chepner suggests in the comments:
var=$(find . -maxdepth 1 -name "*\.properties" -exec 'echo 1' | wc -l)
so that for every match it will print not the name, but any random character (in this case, 1) and then the amount of them will be counted to produce the correct output.
Use:
count=`ls|grep .properties$ | wc -l`
echo $count
You could write your assignment like this:
count=$(ls -q | grep -c '\.properties$')
or
count=$(ls -qA | grep -c '\.properties$')
if you want to include hidden files.
This works with all kind of filenames because we're using ls with q.
Sure it's easier to link to some webpage that tells you to "never parse ls" than to read the ls manual and see there's a q option (and that most implementations default to q if the output is to a terminal device which explains why some people here state their ls seems to handle filenames with newlines just fine by replacing the newline with a ? character).

How to print the number of characters in each line of a text file

I would like to print the number of characters in each line of a text file using a unix command. I know it is simple with powershell
gc abc.txt | % {$_.length}
but I need unix command.
Use Awk.
awk '{ print length }' abc.txt
while IFS= read -r line; do echo ${#line}; done < abc.txt
It is POSIX, so it should work everywhere.
Edit: Added -r as suggested by William.
Edit: Beware of Unicode handling. Bash and zsh, with correctly set locale, will show number of codepoints, but dash will show bytes—so you have to check what your shell does. And then there many other possible definitions of length in Unicode anyway, so it depends on what you actually want.
Edit: Prefix with IFS= to avoid losing leading and trailing spaces.
Here is example using xargs:
$ xargs -d '\n' -I% sh -c 'echo % | wc -c' < file
I've tried the other answers listed above, but they are very far from decent solutions when dealing with large files -- especially once a single line's size occupies more than ~1/4 of available RAM.
Both bash and awk slurp the entire line, even though for this problem it's not needed. Bash will error out once a line is too long, even if you have enough memory.
I've implemented an extremely simple, fairly unoptimized python script that when tested with large files (~4 GB per line) doesn't slurp, and is by far a better solution than those given.
If this is time critical code for production, you can rewrite the ideas in C or perform better optimizations on the read call (instead of only reading a single byte at a time), after testing that this is indeed a bottleneck.
Code assumes newline is a linefeed character, which is a good assumption for Unix, but YMMV on Mac OS/Windows. Be sure the file ends with a linefeed to ensure the last line character count isn't overlooked.
from sys import stdin, exit
counter = 0
while True:
byte = stdin.buffer.read(1)
counter += 1
if not byte:
exit()
if byte == b'\x0a':
print(counter-1)
counter = 0
Try this:
while read line
do
echo -e |wc -m
done <abc.txt

Resources