Searching a file (grep/awk) for 2 carriage return/line-feed characters - bash

I'm trying to write a script that'll simply count the occurrences of \r\n\r\n in a file. (Opening the sample file in vim binary mode shows me the ^M character in the proper places, and the newline is still read as a newline).
Anyway, I know there are tons of solutions, but they don't seem to get me what I want.
e.g. awk -e '/\r/,/\r/!d' or using $'\n' as part of the grep statement.
However, none of these seem to produce what I need. I can't find the \r\n\r\n pattern with grep's "trick", since that just expands one variable. The awk solution is greedy, and so gets me way more lines than I want/need.
Switching grep to binary/Perl/no-newline mode seems to be closer to what I want,
e.g. grep -UPzo '\x0D', but really what I want then is grep -UPzo '\x0D\x00\x0D\x00', which doesn't produce the output I want.
It seems like such a simple task.

By default, awk treats \n as the record separator. That makes it very hard to count \r\n\r\n. If we choose some other record separator, say a letter, then we can easily count the appearance of this combination. Thus:
awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a' file
Here, gsub returns the number of substitutions made. These are summed and, after the end of the file has been reached, we print the total number.
Example
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'\r\n\r\n\r\n\r\na' | awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a'
2
Alternate solution (GNU awk)
We can tell it to treat \r\n\r\n as the record separator and then return the count (minus 1) of the number of records:
cat file <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
In awk, RS is the record separator and NR is the count of the number of records. Since we are using a multiple-character record separator, this requires GNU awk.
If the file ends with \r\n\r\n, the above would be off by one. To avoid that, the echo -n 1 statement is used to assure that there are always at least one character after the last \r\n\r\n in the file.
Examples
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'abc\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'abc\r\n\r\ndef' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'\r\n\r\n\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2
$ echo -n $'1\r\n\r\n2\r\n\r\n3' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2

Related

read values of txt file from bash [duplicate]

This question already has answers here:
How to grep for contents after pattern?
(8 answers)
Closed 5 years ago.
I'm trying to read values from a text file.
I have test1.txt which looks like
sub1 1 2 3
sub8 4 5 6
I want to obtain values '1 2 3' when I specify 'sub1'.
The closest I get is:
subj="sub1"
grep "$subj" test1.txt
But the answer is:
sub8 4 5 6
I've read that grep gives you the next line to the match, so I've tried to change the text file to the following:
test2.txt looks like:
sub1
1 2 3
sub8
4 5 6
However, when I type
grep "$subj" test2.txt
The answer is:
sub1
It should be something super simple but I've tried awk, seg, grep,egrep, cat and none is working...I've also read some posts somehow related but none was really helpful
Awk works: awk '$1 == "'"$subj"'" { print $2, $3, $4 }' test1.txt
The command outputs fields two, three, and four for all lines in test1.txt where the first field is $subj (i.e.: the contents of the variable named subj).
With your original text file format:
target=sub1
while IFS=$' \t\n' read -r key values; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
This requires no external tools, using only functionality built into bash itself. See BashFAQ #1.
As has come up during debugging in comments, if you have a traditional Apple-format text file (CR newlines only), then you might want something more like:
target=sub1
while IFS=$' \t\n' read -r -d $'\r' key values || [[ $key ]]; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
Alternately, using awk (for a standard UNIX text file):
target="sub1"
awk -v target="$target" '$1 == target { $1 = ""; print; }' <test1.txt
...or, for a file with CR-only newlines:
target="sub1"
tr '\r' '\n' <test1.txt | awk -v target="$target" '$1 == target { $1 = ""; print; }'
This version will be slower if the text file being read is small (since awk, like any other external tool, takes time to start up); but faster if it's large (since awk's operation is much faster than that of bash's built-ins once it's done starting up).
grep "sub1" test1.txt | cut -c6-
or
grep -A 1 "sub1" test2.txt | tail -n 1
You doing it right, but it seems like test1.txt has a wrong value in it.
with grep foo you get all lines with foo in it. use grep -m1 foo to find the first line with foo in it only.
then you can use cut -d" " -f2- to get all the values behind foo, while seperated by empty spaces.
In the end the command would look like this ...
$ subj="sub1"
$ grep -m1 "$subj" test1.txt | cut -d" " -f2-
But this doenst explain why you could not find sub1 in the first place.
Did you read the proper file ?
There's a bunch of ways to do this (and shorter/more efficient answers than what I'm giving you), but I'm assuming you're a beginner at bash, and therefore I'll give you something that's easy to understand:
egrep "^$subj\>" file.txt | sed "s/^\S*\>\s*//"
or
egrep "^$subj\>" file.txt | sed "s/^[^[:blank:]]*\>[[:blank:]]*//"
The first part, egrep, will search for you subject at the beginning of the line in file.txt (that's what the ^ symbol does in the grep string). It also is looking for a whole word (the \> is looking for an end of word boundary -- that way sub1 doesn't match sub12 in the file.) Notice you have to use egrep to get the \>, as grep by default doesn't recognize that escape sequence. Once done finding the lines, egrep then passes it's output to sed, which will strip the first word and trailing whitespace off of each line. Again, the ^ symbol in the sed command, specifies it should only match at the beginning of the line. The \S* tells it to read as many non-whitespace characters as it can. Then the \s* tells sed to gobble up as many whitespace as it can. sed then replaces everything it matched with nothing, leaving the other stuff behind.
BTW, there's a help page in Stack overflow that tells you how to format your questions (I'm guessing that was the reason you got a downvote).
-------------- EDIT ---------
As pointed out, if you are on a Mac or something like that you have to use [:alnum:] instead of \S, and [:blank:] instead of \s in your sed expression (as these are portable to all platforms)
awk '/sub1/{ print $2,$3,$4 }' file
1 2 3
What happens? After regexp /sub1/ the three following fields are printed.
Any drawbacks? It affects the space.
Sed also works: sed -n -e 's/^'"$subj"' *//p' file1.txt
It outputs all lines matching $subj at the beginning of a line after having removed the matching word and the spaces following. If TABs are used the spaces should be replaced by something like [[:space:]].

How to add 100 spaces at end of each line of a file in Unix

I have a file which is supposed to contain 200 characters in each line. I received a source file with only 100 characters in each line. I need to add 100 extra white spaces to each line now. If it were few blank spaces, we could have used sed like:
sed 's/$/ /' filename > newfilename
Since it's 100 spaces, can anyone tell me is it possible to add in Unix?
If you want to have fixed n chars per line (don't trust the input file has exact m chars per line) follow this. For the input file with varying number of chars per line:
$ cat file
1
12
123
1234
12345
extend to 10 chars per line.
$ awk '{printf "%-10s\n", $0}' file | cat -e
1 $
12 $
123 $
1234 $
12345 $
Obviously change 10 to 200 in your script. Here $ shows end of line, it's not there as a character. You don't need cat -e, here just to show the line is extended.
With awk
awk '{printf "%s%100s\n", $0, ""}' file.dat
$0 refers to the entire line.
Updated after Glenn's suggestion
Somewhat how Glenn suggests in the comments, the substitution is unnecessary, you can just add the spaces - although, taking that logic further, you don't even need the addition, you can just say them after the original line.
perl -nlE 'say $_," "x100' file
Original Answer
With Perl:
perl -pe 's/$/" " x 100/e' file
That says... "Substitute (s) the end of each line ($) with the calculated expression (e) of 100 repetitions of a space".
If you wanted to pad all lines to, say, 200 characters even if the input file was ragged (all lines of differing length), you could use something like this:
perl -pe '$pad=200-length;s/$/" " x $pad/e'
which would make up lines of 83, 102 and 197 characters to 200 each.
If you use Bash, you can still use sed, but use some readline functionality to keep you from manually typing 100 spaces (see manual for "Readline arguments").
You start typing normally:
sed 's/$/
Now, you want to insert 100 spaces. You can do this by prepending hitting the space bar with a readline argument to indicate that you want it to happen 100 times, i.e., you manually enter what would look like this as a readline keybinding:
M-1 0 0 \040
Or, if your meta key is the alt key: Alt+1 00Space
This inserts 100 spaces, and you get
sed 's/$/ /' filename
after typing the rest of the command.
This is useful for working in an interactive shell, but not very pretty for scripts – use any of the other solutions for that.
Just in case you are looking for a bash solution,
while IFS= read -r line
do
printf "%s%100s\n" "$line"
done < file > newfile
Test
Say I have a file with 3 lines it it as
$ wc -c file
16 file
$ wc -c newfile
316 newfile
Original Answer
spaces=$(echo {1..101} | tr -d 0-9)
while read line
do
echo -e "${line}${spaces}\n" >> newfile
done < file
You can use printf in awk:
awk '{printf "%s%*.s\n", $0, 100, " "}' filename > newfile
This printf will append 100 spaces at the end of each newline.
Another way in GNU awk using string-manipulation function sprintf.
awk 'BEGIN{s=sprintf("%-100s", "");}{print $0 s}' input-file > file-with-spaces
A proof with an example:-
$ cat input-file
1234jjj hdhyvb 1234jjj
6789mmm mddyss skjhude
khora77 koemm sado666
nn1004 nn1004 457fffy
$ wc -c input-file
92 input-file
$ awk 'BEGIN{s=sprintf("%-100s", "");}{print $0 s}' input-file > file-with-spaces
$ wc -c file-with-spaces
492 file-with-spaces

Using sed to extract strings from a text file

I have text data in this form:
^Well/Well[ADV]+ADV ^John/John[N]+N ^has/have[V]+V+3sg+PRES ^a/a[ART]
^quite/quite[ADV]+ADV ^different/different[ADJ]+ADJ ^not/not[PART]
^necessarily/necessarily[ADV]+ADV ^more/more[ADV]+ADV
^elaborated/elaborate[V]+V+PPART ^theology/theology[N]+N *edu$
And I want it to be processed to this form:
Well John have a quite different not necessarily more elaborate theology
Basically, I need every string between the starting character / and the ending character [.
Here is what I tried, but I just get empty files...
#!/bin/bash
for file in probe/*.txt
do sed '///,/[/d' $file > $file.aa
mv $file.aa $file
done
awk to the rescue!
$ awk -F/ -v RS=^ -v ORS=' ' '{print $1}' file
Well John has a quite different not necessarily more elaborated theology
Explanation set record separator (RS) to ^ to separate your logical groups, also set the field separator (FS) to / and print the first field as your requirement. Finally, setting the output field separator (OFS) to space (instead of the default new line) keeps the extracted fields on the same line.
With GNU grep and Perl compatible regular expressions (-P):
$ echo $(grep -Po '(?<=/)[^[]*' infile)
Well John have a quite different not necessarily more elaborate theology
-o retains just the matches, (?<=/) is a positive look-behind ("make sure there is a /, but don't include it in the match"), and [^[]* is "a sequence of characters other than [".
grep -Po prints one match per line; by using the output of grep as arguments to echo, we convert the newlines into spaces (could also be done by piping to tr '\n' ' ').
cat file|grep -oE "\/[^\[]*\[" |sed -e 's#^/##' -e 's/\[$//' | tr -s "\n" " "

how to grep multiples variable in bash

I need to grep multiple strings, but i don't know the exact number of strings.
My code is :
s2=( $(echo $1 | awk -F"," '{ for (i=1; i<=NF ; i++) {print $i} }') )
for pattern in "${s2[#]}"; do
ssh -q host tail -f /some/path |
grep -w -i --line-buffered "$pattern" > some_file 2>/dev/null &
done
now, the code is not doing what it's supposed to do. For example if i run ./script s1,s2,s3,s4,.....
it prints all lines that contain s1,s2,s3....
The script is supposed to do something like grep "$s1" | grep "$s2" | grep "$s3" ....
grep doesn't have an option to match all of a set of patterns. So the best solution is to use another tool, such as awk (or your choice of scripting languages, but awk will work fine).
Note, however, that awk and grep have subtly different regular expression implementations. It's not clear from the question whether the target strings are literal strings or regular expression patterns, and if the latter, what the expectations are. However, since the argument comes delimited with commas, I'm assuming that the pieces are simple strings and should not be interpreted as patterns.
If you want the strings to be interpreted as patterns, you can change index to match in the following little program:
ssh -q host tail -f /some/path |
awk -v STRINGS="$1" -v IGNORECASE=1 \
'BEGIN{split(STRINGS,strings,/,/)}
{for(i in strings)if(!index($0,strings[i]))next}
{print;fflush()}'
Note:
IGNORECASE is only available in gnu awk; in (most) other implementations, it will do nothing. It seems that is what you want, based on the fact that you used -i in your grep invocation.
fflush() is also an extension, although it works with both gawk and mawk. In Posix awk, fflush requires an argument; if you were using Posix awk, you'd be better off printing to stderr.
You can use extended grep
egrep "$s1|$s2|$s3" fileName
If you don't know how many pattern you need to grep, but you have all of them in an array called s, you can use
egrep $(sed 's/ /|/g' <<< "${s[#]}") fileName
This creates a herestring with all elements of the array, sed replaces the field separator of bash (space) with | and if we feed that to egrep we grep all strings that are in the array s.
test.sh:
#!/bin/bash -x
a=" $#"
grep ${a// / -e } .bashrc
it works that way:
$ ./test.sh 1 2 3
+ a=' 1 2 3'
+ grep -e 1 -e 2 -e 3 .bashrc
(here is lots of text that fits all the arguments)

Awk replace a column with its hash value

How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
So, you don't really want to be doing this with awk. Any of the popular high-level scripting languages -- Perl, Python, Ruby, etc. -- would do this in a way that was simpler and more robust. Having said that, something like this will work.
Given input like this:
this is a test
(E.g., a row with four columns), we can replace a given column with its md5 checksum like this:
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
$2=cksum
print
}' < sample
This relies on GNU awk (you'll probably have this by default on a Linux system), and it uses openssl to generate the md5 checksum. We first build a shell command line in tmp to pass the selected column to the md5 command. Then we pipe the output into the cksum variable, and replace column 2 with the checksum. Given the sample input above, the output of this awk script would be:
this 7e1b6dbfa824d5d114e96981cededd00 a test
I copy pasted larsks's response, but I have added the close line, to avoid the problem indicated in this post: gawk / awk: piping date to getline *sometimes* won't work
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample
This might work using Bash/GNU sed:
<<<"this is a test" sed -r 's/(\S+\s)(\S+)(.*)/echo "\1 $(md5sum <<<"\2") \3"/e;s/ - //'
this 7e1b6dbfa824d5d114e96981cededd00 a test
or a mostly sed solution:
<<<"this is a test" sed -r 'h;s/^\S+\s(\S+).*/md5sum <<<"\1"/e;G;s/^(\S+).*\n(\S+)\s\S+\s(.*)/\2 \1 \3/'
this 7e1b6dbfa824d5d114e96981cededd00 a test
Replaces is from this is a test with md5sum
Explanation:
In the first:- identify the columns and use back references as parameters in the Bash command which is substituted and evaluated then make cosmetic changes to lose the file description (in this case standard input) generated by the md5sum command.
In the second:- similar to the first but hive the input string into the hold space, then after evaluating the md5sum command, append the string G to the pattern space (md5sum result) and using substitution arrange to suit.
You can also do that with perl :
echo "aze qsd wxc" | perl -MDigest::MD5 -ne 'print "$1 ".Digest::MD5::md5_hex($2)." $3" if /([^ ]+) ([^ ]+) ([^ ]+)/'
aze 511e33b4b0fe4bf75aa3bbac63311e5a wxc
If you want to obfuscate large amount of data it might be faster than sed and awk which need to fork a md5sum process for each lines.
You might have a better time with read than awk, though I haven't done any benchmarking.
the input (scratch001.txt):
foo|bar|foobar|baz|bang|bazbang
baz|bang|bazbang|foo|bar|foobar
transformed using read:
while IFS="|" read -r one fish twofish red fishy bluefishy; do
twofish=`echo -n $twofish | md5sum | tr -d " -"`
echo "$one|$fish|$twofish|$red|$fishy|$bluefishy"
done < scratch001.txt
produces the output:
foo|bar|3858f62230ac3c915f300c664312c63f|baz|bang|bazbang
baz|bang|19e737ea1f14d36fc0a85fbe0c3e76f9|foo|bar|foobar

Resources