How to capture digits in front of specific keyword in bash - bash

Imagine following string:
<tr><td>12,3</td><td>deg</td><td>23,4</td><td>humi</td><td>34,5</td><td>press</td></tr>
In bash, how do I extract 23.4, based on the condition that it is followed by humi?

grep -o works well for this sort of thing. I'm sure performance would be better with a single sed command than two greps but that's rarely a serious concern.
X='<tr><td>12,3</td><td>deg</td><td>23,4</td><td>humi</td><td>34,5</td><td>press</td></tr>'
echo $X | grep -o '[0-9,.]*</td><td>humi' | grep -o '[0-9,.]*'
# Result: 23,4
You can additionally pipe through tr , . to get English number format.

Related

Shell: How can I posixly determine if a file contains one or more null characters?

There are some great answers that address removing null characters from a file- it seems like sed is probably the most effective way. However, all the other questions I have been able to find are concerned not with finding null characters but removing them.
There are certain questions that do provide valid solutions- however, I am having difficulty finding a POSIX-compliant solution that does not rely on GNUisms. The two solutions I've seen that work use cat with the -v option and grep with the -P option (neither of which shall be supported).
I make it a habit to delegate as much as possible to the shell, but the shell can't help me here because it is not possible to store a null character in a variable. External tools are the only option, but I can't even find a way with them when I adhere to POSIX-compliant options.
One possible way would be tr and wc:
[ "$(tr -cd '\0' < file | wc -c)" -ge 0 ]
Alternatively, od and grep will allow stopping on the first one without reading the rest of the file:
od -A n -t x1 file | grep -q 00
Use tr -d '\000' to trim nulls into a temporary file and use wc -c to get the number of characters in the file. If the temporary-file doesn't match (cmp -s) the original, that contained nulls, and the output from wc can be used to compute the number of nulls -- the point of the question.
p.s.: grep -P isn't POSIX either. Nor is the -C option found in POSIX.

Printing multiple parts of the same line matching a pattern using bash

I am writing a unix command to get lines matching abcd at position 87-90 and for the lines matching this critieria it should get me position 10-15, 124-128,250-265.I tried something like this.
grep -h abcd sample.txt |cut -c 10-15,cut -c 124-128,cut -c 250-260
Though this is syntactically wrong I hope it conveys what I am trying to achieve.Could you help me concatenate all the results from the multiple cuts?
cut -c accepts a list of characters. As described in the man page, "each list is made up of one range, or many ranges separated by commas."
grep -h abcd sample.txt | cut -c 10-15,124-128,250-260

WC on OSX - Return includes spaces

When I run the word count command in OSX terminal like wc -c file.txt I get the below answer that includes spaces padded before the answer. Does anyone know why this happens, or how I can prevent it?
18000 file.txt
I would expect to get:
18000 file.txt
This occurs using bash or bourne shell.
The POSIX standard for wc may be read to imply that there are no leading blanks, but does not say that explicitly. Standards are like that.
This is what it says:
By default, the standard output shall contain an entry for each input file of the form:
"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>
and does not mention the formats for the single-column options such as -c.
A quick check shows me that AIX, OSX, Solaris use a format which specifies the number of digits for the value — to align columns (and differ in the number of digits). HPUX and Linux do not.
So it is just an implementation detail.
I suppose it is a way of getting outputs to line up nicely, and as far as I know there is no option to wc which fine tunes the output format.
You could get rid of them pretty easily by piping through sed 's/^ *//', for example.
There may be an even simpler solution, depending on why you want to get rid of them.
At least under macOS/bash wc exhibits the behavior of outputting trailing positional TABs.
It can be avoided using expr:
echo -n "some words" | expr $(wc -c)
>> 10
echo -n "some words" | expr $(wc -w)
>> 2
Note: The -n prevents echoing a newline character which would count as 1 in wc -c
This bugs me every time I write a script that counts lines or characters. I wish that wc were defined not to emit the extra spaces, but it's not, so we're stuck with them.
When I write a script, instead of
nlines=`wc -l $file`
I always say
nlines=`wc -l < $file`
so that wc's output doesn't include the filename, but that doesn't help with the extra spaces. The trick I use next is to add 0 to the number, like this:
nlines=`expr $nlines + 0` # get rid of trailing spaces

Using ssh remote plus grep

I'm running a shell script like above:
vQtde=`ssh user#server 'ls -lrt /mnta2/gvt/Interfaces/output/BI/sent/*.?${vDiaAnterior}* | grep "${vMDAtual}0[345678]:" |wc -l'`
And the return is on error: ksh: /usr/bin/sh: arg list too long
I know that the same script in local server return 9, how can I escape "" in remote grep ?
The variables are:
vDiaAtual=`date +%d`
vMesAtual=`date +%b`
vMDAtual=" $vMesAtual $vDiaAtual ";
vDiaAnterior=120614
The problem here is not with grep. The problem is following: the argument /mnta2/gvt/Interfaces/output/BI/sent/*.?${vDiaAnterior}* is expanded by shell (by ksh in the case) and the resulting list is too big.
It would be better to do simply ls -lrt /mnta2/gvt/Interfaces/output/BI/sent/ and then add additional grep after it.
Something like:
ls -lrt /mnta2/gvt/Interfaces/output/BI/sent/ | grep "\..${vDiaAnterior}" | grep ...
Based on information regarding that error message, I'm not sure if escaping the quotes is the real issue here.
What is it that you're ultimately trying to do? There's probably a slightly different way to approach it that avoids this problem. It appears that you're trying to count the number of files with a certain "last modified" date. Is this accurate? If so, I highly recommend against using the output of ls to do that. The output is inconsistent between platforms and can even change between versions. The find utility is much better suited for this sort of thing.
Try something like this instead:
dir=/mnta2/gvt/Interfaces/output/BI/sent/
pattern="*.?${vDiaAnterior}*"
time= # Fill this in based on the "last modified" time that you're looking for
find $dir -iname "$pattern" -mtime $time -exec printf '.' \; | wc -c
You can omit using the extra variables, they're only there to make the code more readable on the webpage.
This will search the given directory for all files with names that match the specified wildcard pattern and with "last modified" times that match whatever you specify. For each match found, the code printf '.' (which prints one dot to stdout) will be run. wc then counts the number of dot characters, which will be equal to the number of matching files found. The benefit of this method is that it minimizes the amount of data that needs to be piped between programs (including between the shell and ls). find handles the wildcard matching internally instead of requiring the shell to expand the wildcard and pass the result to ls. You're also only sending one character per matching file to wc instead of one long line of ls output per match. That should reduce the chances that you encounter the "arg list too long" error.
I resolved the problem with this ways:
- Create a file .sh in local server receiving a parameters:
#!/usr/local/bin/bash
vDiaAnterior="${1}";
vMDAtual="${2}";
ls -l /mnta2/gvt/Interfaces/output/BI/sent/*.?${vDiaAnterior}AMA | grep "${vMDAtual}[345678]:" | wc -l;
Call remote :
ssh user#server ". /mnta1/prod_med1/scriptsf/ver_jobs_3_horas.sh $vDiaAnterior '$vMDAtual'"
Result: 9 Files.
Best Regards,
Cauca

Grep characters before and after match?

Using this:
grep -A1 -B1 "test_pattern" file
will produce one line before and after the matched pattern in the file. Is there a way to display not lines but a specified number of characters?
The lines in my file are pretty big so I am not interested in printing the entire line but rather only observe the match in context. Any suggestions on how to do this?
3 characters before and 4 characters after
$> echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}'
23_string_and
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
This will match up to 5 characters before and after your pattern. The -o switch tells grep to only show the match and -E to use an extended regular expression. Make sure to put the quotes around your expression, else it might be interpreted by the shell.
You could use
awk '/test_pattern/ {
match($0, /test_pattern/); print substr($0, RSTART - 10, RLENGTH + 20);
}' file
You mean, like this:
grep -o '.\{0,20\}test_pattern.\{0,20\}' file
?
That will print up to twenty characters on either side of test_pattern. The \{0,20\} notation is like *, but specifies zero to twenty repetitions instead of zero or more.The -o says to show only the match itself, rather than the entire line.
I'll never easily remember these cryptic command modifiers so I took the top answer and turned it into a function in my ~/.bashrc file:
cgrep() {
# For files that are arrays 10's of thousands of characters print.
# Use cpgrep to print 30 characters before and after search pattern.
if [ $# -eq 2 ] ; then
# Format was 'cgrep "search string" /path/to/filename'
grep -o -P ".{0,30}$1.{0,30}" "$2"
else
# Format was 'cat /path/to/filename | cgrep "search string"
grep -o -P ".{0,30}$1.{0,30}"
fi
} # cgrep()
Here's what it looks like in action:
$ ll /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
-rw-r--r-- 1 rick rick 25780 Jul 3 19:05 /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
$ cat /tmp/rick/scp.Mf7UdS/Mf7UdS.Source | cgrep "Link to iconic"
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
$ cgrep "Link to iconic" /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
The file in question is one continuous 25K line and it is hopeless to find what you are looking for using regular grep.
Notice the two different ways you can call cgrep that parallels grep method.
There is a "niftier" way of creating the function where "$2" is only passed when set which would save 4 lines of code. I don't have it handy though. Something like ${parm2} $parm2. If I find it I'll revise the function and this answer.
With gawk , you can use match function:
x="hey there how are you"
echo "$x" |awk --re-interval '{match($0,/(.{4})how(.{4})/,a);print a[1],a[2]}'
ere are
If you are ok with perl, more flexible solution : Following will print three characters before the pattern followed by actual pattern and then 5 character after the pattern.
echo hey there how are you |perl -lne 'print "$1$2$3" if /(.{3})(there)(.{5})/'
ey there how
This can also be applied to words instead of just characters.Following will print one word before the actual matching string.
echo hey there how are you |perl -lne 'print $1 if /(\w+) there/'
hey
Following will print one word after the pattern:
echo hey there how are you |perl -lne 'print $2 if /(\w+) there (\w+)/'
how
Following will print one word before the pattern , then the actual word and then one word after the pattern:
echo hey there how are you |perl -lne 'print "$1$2$3" if /(\w+)( there )(\w+)/'
hey there how
If using ripgreg this is how you would do it:
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
You can use regexp grep for finding + second grep for highlight
echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}' | grep string
23_string_and
With ugrep you can specify -ABC context with option -o (--only-matching) to show the match with extra characters of context before and/or after the match, fitting the match plus the context within the specified -ABC width. For example:
ugrep -o -C30 pattern testfile.txt
gives:
1: ... long line with an example pattern to match. The line could...
2: ...nother example line with a pattern.
The same on a terminal with color highlighting gives:
Multiple matches on a line are either shown with [+nnn more]:
or with option -k (--column-number) to show each individually with context and the column number:
The context width is the number of Unicode characters displayed (UTF-8/16/32), not just ASCII.
I personally do something similar to the posted answers.. but since the dot key, like any keyboard key, can be tapped or held down.. and I often don't need a lot of context(if I needed more I might do the lines like grep -C but often like you I don't want lines before and after), so I find it much quicker for entering the command, to just tap the dot key for how many dots / how many characters, if it's a few then tapping the key, or hold it down for more.
e.g. echo zzzabczzzz | grep -o '.abc..'
Will have the abc pattern with one dot before and two after. ( in regex language, Dot matches any character). Others used dot too but with curly braces to specify repetition.
If I wanted to be strict re between (0 or x) characters and exactly y characters, then i'd use the curlies.. and -P, as others have done.
There is a setting re whether dot matches new line but you can look into that if it's a concern/interest.

Resources