Awk/Sed - how to print selection between two patterns? - bash

From reference: catonmat.net I think I could get the interested selection between two patterns using the following:
Source Text (one line): 6 June 2013 08.32.435 UTF+8 Report /content/folder[#name='....' Failure ....
Here the important part is the path to report , therefore I am using:
awk '/content\/folder\[#name=/,/Failure/' source.csv
I got the entire matched line, instead of only the content path between the two matches.
I have also tried to:
sed -n '/content\/folder\[#name/,/Failure/ {/content\/folder\[#name\|Failure/!p}' source.csv
Still returning the entire line...
What was wrong?

Try this:
sed -n '|content/folder\[#name.*Failure|s|.*content/folder\[#name\(.*\)Failure.*|\1|' source.csv
/re1/,/re2/ is for selecting a range of lines, not a range of text within a line. Since content/folder and Failure are on the same line, you don't need a range, just a regex that matches a line containing both. Then use s/// to extra the part between them.

sed 's,.*/content/folder\[#name=\(.*\)Failure.*,\1,' source.csv

grep -Po '(?<=#name=).*(?=Failure)' source.csv

Related

Adding a new line to a text file after 5 occurrences of a comma in Bash

I have a text file that is basically one giant excel file on one line in a text file. An example would be like this:
Name,Age,Year,Michael,27,2018,Carl,19,2018
I need to change the third occurance of a comma into a new line so that I get
Name,Age,Year
Michael,27,2018
Carl,19,2018
Please let me know if that is too ambiguous and as always thank you in advance for all the help!
With Gnu sed:
sed -E 's/(([^,]*,){2}[^,]*),/\1\n/g'
To change the number of fields per line, change {2} to one less than the number of fields. For example, to change every fifth comma (as in the title of your question), you would use:
sed -E 's/(([^,]*,){4}[^,]*),/\1\n/g'
In the regular expression, [^,]*, is "zero or more characters other than , followed by a ,; in other words, it is a single comma-delimited field. This won't work if the fields are quoted strings with internal commas or newlines.
Regardless of what Linux's man sed says, the -E flag is an extension to Posix sed, which causes sed to use extended regular expressions (EREs) rather than basic regular expressions (see man 7 regex). -E also works on BSD sed, used by default on Mac OS X. (Thanks to #EdMorton for the note.)
With GNU awk for multi-char RS:
$ awk -v RS='[,\n]' '{ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
With any awk:
$ awk -v RS=',' '{sub(/\n$/,""); ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
Try this:
$ cat /tmp/22.txt
Name,Age,Year,Michael,27,2018,Carl,19,2018,Nooka,35,1945,Name1,11,19811
$ echo "Name,Age,Year"; grep -o "[a-zA-Z][a-zA-Z0-9]*,[1-9][0-9]*,[1-9][0-9]\{3\}" /tmp/22.txt
Michael,27,2018
Carl,19,2018
Nooka,35,1945
Name1,11,1981
Or, ,[1-9][0-9]\{3\} if you don't want to put [0-9] 3 more times for the YYYY part.
PS: This solution will give you only YYYY for the year (even if the data for YYYY is 19811 (typo mistakes if any), you'll still get 1981
You are looking for 3 fragments, each without a comma and separated by a comma.
The last fields can give problems (not ending with a comma and mayby only two fields.
The next command looks fine.
grep -Eo "([^,]*[,]{0,1}){0,3}" inputfile
This might work for you (GNU sed):
sed 's/,/\n/3;P;D' file
Replace every third , with a newline, print ,delete the first line and repeat.

How to use sed to delete last several character of a pattern

I've gone through all of the threads but still cannot find the answer.
For example.
I have a timestamp of format: yyyy-mm-dd hh:mm:ss.xxx
where xxx indicates the milliseconds.
I want to get rid of the xxx part, notice that this timestamp is not in certain position so we cannot take it as a part in end of line or start of line.(in unix command or in bash script)
The method I can think of is to use sed, but all i can do is to get the pattern, but don't know what to do to process the pattern, it seems that all pattern does is to locate the lines instead of the pattern itself. So generally we can think of the question like: how to use sed to delete last several letters of a certain pattern.
Thanks for reading.
Note that xxx can be 0-999, so it can be 1,2,3 digits, sample is like:
asfd,asasfsf,afas,2017-10-20 13:22:22.0,333,222,0.002
nyh,nyhny,nhy,2 23 4 23 32:23:14.czxv,2017-10-20 13:22:22.234,12.0,234.22
nyh,nyhny,nhy,2017-10-20 13:22:22.234,12.0
wn,rrwn,daff,2017-10-20 13:22:32.543,12,32
What I expect is:
asfd,asasfsf,afas,2017-10-20 13:22:22,333,222,0.002
nyh,nyhny,nhy,2 23 4 23 32:23:14.czxv,2017-10-20 13:22:22,12.0,234.22
nyh,nyhny,nhy,2017-10-20 13:22:22,12.0
wn,rrwn,daff,2017-10-20 13:22:32,12,32
As per OP's shown Input_file proposing the new following solution.
awk '{sub(/\.[^,]*/,"",$2)} 1' Input_file
Explanation: Adding explanation of awk code also here.
awk '{
sub(/\.[^,]*/,"",$2) ##sub is awk in-built utility, which will substitute on basis of sub(text/regex which we need to replace,"new_text"/variable_value,For a current line/variable/field), so in this case I am using a REGEX which will look from a DOT to first occurrence of comma(,) which I am substituting with NULL in 2nd field(your 2nd field is the one which is having timing details because awk has space as delimiter by default).
}
1 ##awk works on method of condition then action. So Here I am making condition TRUE by mentioning 1 and no action is mentioned so be default print action will happen.
' Input_file
This might work for you (GNU sed):
sed 's/\(....-..-.. ..:..:..\)\..../\1/g' file
This is very lazy but most likely will work 99% of the time. It matches on the time stamp separators and then removes the .xxx at the end. If you want, you can be more specific i.e.
sed 's/\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\} [0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\)\.[0-9]\{3\}/\1/g' file
Using the -r option, removes the toothpick mess:
sed -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})\.[0-9]{3}/\1/g' file

How to detect some pattern with grep -f on a file in terminal, and extract those lines without the pattern

I'm on mac terminal.
I have a txt file with one column with 9 IDs, allofthem.txt, where every ID starts with ¨rs¨:
rs382216
rs11168036
rs9296559
rs9349407
rs10948363
rs9271192
rs11771145
rs11767557
rs11
Also, I have another txt file, useful.txt, with those IDs that were useful in an analysis I did. It looks the same, one column with several rows of IDs, but with less IDS, only 5.
rs9349407
rs10948363
rs9271192
rs11
Problem:I want to generate a new txt file with the non-useful ones (the ones that appear in allofthem.txt but not in useful.txt).
I want to do the inverse of:
grep -f useful.txt allofthem.txt
I want to use some systematic way of deleting all the IDs in useful and obtain a file with the remaining ones. Maybe with awk or sed, but I can´t see it. Can you help me, please? Thanks in advance!
Desired output:
rs382216
rs11168036
rs9296559
rs11771145
rs11767557
-v option does the inverse for you:
grep -vxf useful.txt allofthem.txt > remaining.txt
-x option matches the whole line in allofthem.txt, not parts.
As #hek2mgl rightly pointed out, you need -F if you want to treat the content of useful.txt as strings and not patterns:
grep -vxFf useful.txt allofthem.txt > remaining.txt
Make sure your files have no leading or trailing white spaces - they could affect the results.
I recommend to use awk:
awk 'FNR==NR{patterns[$0];next} $0 in patterns' useful.txt allofthem.txt
Explanation:
FNR==NR is true as long as we are reading useful.txt. We create an index in patterns for every line of useful.txt. next stops further processing.
$0 in patterns runs, because of the previous next statement, on every line of allofthem.txt. It checks for every line of that file if it is a key in patterns. If that checks evaluates to true awk will print that line.

How to pull a value from between 2 strings which occur several times in a file

I am trying to pull the value from inbetween 2 strings and line break each result. I am then hoping to combine this with another value from the same document being pulled the same way. The problem is there are NO linebreaks in this file and it is quite large. Here is an example of the file.
<ID>47</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>myhost.domain.local</DNS_NAME> <IP_ADDRESS>10.0.0.1</IP_ADDRESS><ID>60</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>yourhost.domain.local</DNS_NAME><IP_ADDRESS>10.0.0.2</IP_ADDRESS>
My end result would ideally look something like this.
ID-----DNS_NAME
47-----myhost.domain.local
60-----yourhost.domain.local
My closest attemps so far have been creating variables with grep, but I cant seem to format them into a table. Im also very new to scripting so forgive my ignorance.
If your grep supports -P (--Perl-regexp), then you're free to use the below regex.
$ grep -oP '<ID>\K[^<>]*(?=</ID>)|<DNS_NAME>\K[^<>]*(?=</DNS_NAME>)' file | sed 'N;s/\n/-----/g'
47-----myhost.domain.local
60-----yourhost.domain.local
\K Discards the previously matched characters from printing.
(?=...) posiitve lookahead assertion which asserts where the match would occur. It won't consume any characters.
Here is an gnu awk (do to multiple characters in RS) to get your data:
awk -v RS="<ID>" -F"<|>" 'NR>1 {print $1"-----"$9}' file
47-----myhost.domain.local
60-----yourhost.domain.local

Use awk to extract value from a line

I have these two lines within a file:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
where I'd like to get the following as output using awk or sed:
3
50000
Using this sed command does not work as I had hoped, and I suspect this is due to the presence of the quotes and delimiters in my line entry.
sed -n '/WORD1/,/WORD2/p' /path/to/file
How can I extract the values I want from the file?
awk -F'[<>]' '{print $3}' input.txt
input.txt:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
Output:
3
50000
sed -e 's/[a-zA-Z.<\/>= \-]//g' file
Using sed:
sed -E 's/.*limit"*>([0-9]+)<.*/\1/' file
Explanation:
.* takes care of everything that comes before the string limit
limit"* takes care of both the lines, one with limit" and the other one with just limit
([0-9]+) takes care of matching numbers and only numbers as stated in your requirement.
\1 is actually a shortcut for capturing pattern. When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. For more details, please refer https://www.inkling.com/read/introducing-regular-expressions-michael-fitzgerald-1st/chapter-4/capturing-groups-and
The script solution with parameter expansion:
#!/bin/bash
while read line || test -n "$line" ; do
value="${line%<*}"
printf "%s\n" "${value##*\>}"
done <"$1"
output:
$ ./ltags.sh dat/ltags.txt
3
50000
Looks like XML to me, so assuming it forms part of some valid XML, e.g.
<root>
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
</root>
You can use Perl's XML::Simple and do something like this:
perl -MXML::Simple -E '$xml = XMLin("file"); say $xml->{"first-value"}->{"content"}; say $xml->{"second-value-limit"}'
Output:
3
50000
If the XML structure is more complicated, then you may have to drill down a bit deeper to get to the values you want. If that's the case, you should edit the question to show the bigger picture.
Ashkan's awk solution is straightforward, but let me suggest a sed solution that accepts non-integer numbers:
sed -n 's/[^>]*>\([.[:digit:]]*\)<.*/\1/p' input.txt
This extracts the number between the first > character of the line and the following <. In my RE this "number" can be the empty string, if you don't want to accept an empty string please add the -r option to sed and replace \([.[:digit:]]*\) by ([.[:digit:]]+).

Resources