Use awk to extract value from a line - bash

I have these two lines within a file:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
where I'd like to get the following as output using awk or sed:
3
50000
Using this sed command does not work as I had hoped, and I suspect this is due to the presence of the quotes and delimiters in my line entry.
sed -n '/WORD1/,/WORD2/p' /path/to/file
How can I extract the values I want from the file?

awk -F'[<>]' '{print $3}' input.txt
input.txt:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
Output:
3
50000

sed -e 's/[a-zA-Z.<\/>= \-]//g' file

Using sed:
sed -E 's/.*limit"*>([0-9]+)<.*/\1/' file
Explanation:
.* takes care of everything that comes before the string limit
limit"* takes care of both the lines, one with limit" and the other one with just limit
([0-9]+) takes care of matching numbers and only numbers as stated in your requirement.
\1 is actually a shortcut for capturing pattern. When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. For more details, please refer https://www.inkling.com/read/introducing-regular-expressions-michael-fitzgerald-1st/chapter-4/capturing-groups-and

The script solution with parameter expansion:
#!/bin/bash
while read line || test -n "$line" ; do
value="${line%<*}"
printf "%s\n" "${value##*\>}"
done <"$1"
output:
$ ./ltags.sh dat/ltags.txt
3
50000

Looks like XML to me, so assuming it forms part of some valid XML, e.g.
<root>
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
</root>
You can use Perl's XML::Simple and do something like this:
perl -MXML::Simple -E '$xml = XMLin("file"); say $xml->{"first-value"}->{"content"}; say $xml->{"second-value-limit"}'
Output:
3
50000
If the XML structure is more complicated, then you may have to drill down a bit deeper to get to the values you want. If that's the case, you should edit the question to show the bigger picture.

Ashkan's awk solution is straightforward, but let me suggest a sed solution that accepts non-integer numbers:
sed -n 's/[^>]*>\([.[:digit:]]*\)<.*/\1/p' input.txt
This extracts the number between the first > character of the line and the following <. In my RE this "number" can be the empty string, if you don't want to accept an empty string please add the -r option to sed and replace \([.[:digit:]]*\) by ([.[:digit:]]+).

Related

Delete everything before a pattern

I'm trying to clean a text file.
I want to delete everything start before the first 12 numbers.
1:0:135103079189:0:0:2:0::135103079189:000011:00
A:908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Output desired:
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Here's my command but seems not working.
sed '/:\([0-9]\{12\}\)/d' t.txt
the d command in sed will delete entire line on matching the given regex, you need to use s command to search and replace only part of line... however, for given problem, sed is not suitable as it doesn't support non-greedy regex
you can use perl instead
$ perl -pe's/^.*?(?=\d{12}:)//' ip.txt
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
.*? match zero or more characters as minimally as possible
(?=\d{12}:) only if it is followed by 12-digits ending with :
use perl -i -pe for in-place editing
some possible corner cases
$ # this is matching part of field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe's/^.*?(?=\d{12}:)//'
135103079189:23:603307102606:1
$ # this is not matching 12-digit field at end of line
$ echo 'foo:123:135103079189' | perl -pe's/^.*?(?=\d{12}:)//'
foo:123:135103079189
$ # so, add start/end of line matching cases and restrict 12-digits to whole field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe 's/^(?:.*?:)?(?=\d{12}(:|$))//'
603307102606:1
$ echo 'foo:123:135103079189' | perl -pe's/^(?:.*?:)?(?=\d{12}(:|$))//'
135103079189
Could you please try following.
awk --re-interval 'match($0,/[0-9]{12}/){print substr($0,RSTART)}' Input_file
Since I have OLD version of awk so I am using --re-interval you could remove it in case you have new version of it.
This might work for you (GNU sed):
sed -n 's/[0-9]\{12\}/\n&/;s/.*\n//p' file
We only want to print specific lines so use the -n option to turn off automatic printing. If a line contains a 12 digit number, insert a newline before it. Remove any characters before and including a newline and print the result.
If you want to print lines that do not contain a 12 digit number as is, use:
sed 's/[0-9]\{12\}/\n&/;s/.*\n//' file
The crux of the problem is to identify the start of a multi-character string, insert a unique marker and delete all characters before and including the unique marker. As sed uses the newline to delimit lines, only the user can introduce newlines into the pattern space and as a result, newlines will always be unique.
Taking the nice answer from #Sundeep, in case you would like to use grep or pcregrep (macOS/BSD) you could give a try to:
$ grep -oP '^(?:.*?:)?(?=\d{12})\K.*' file
or
$ pcregrep -o '^(?:.*?:)?(?=\d{12})\K.*' file
The \K will ignore everything after the pattern
Alternative thoughts - I almost think your data is too dirty for a quick sed fix but if generally it's all similar to your sample set of data then certainly pick one of the answers with sed etc. However if you wanted to be more particular about it you could build up a set of commands to ensure the values. I like doing this for debugging and when speed isn't urgent.
Take this tiny sample of code, you could do this other ways but I'm getting the value for each part of the string and I know the order because it contiguous. You could then set up controls on which parts to keep and such as it builds out say a new string per line. Overwrought for sure, but sometimes that is a better long term approach.
#!/bin/bash
while IFS= read -r line ;do
IFS=':' read -r -a array <<< "$line"
for ((i=0; i<${#array[#]}; i++)) ;do
echo "part : ${array[$i]}"
done
done < "test_data.txt"
You could then build the data back up how you wanted and more easily understand what's happening every step of the way ..
part : 1
part : 0
part : 135103079189
part : 0
part : 0
part : 2
part : 0
part :
part : 135103079189
part : 000011
part : 00
part : A
part : 908529896240
part : 0

Adding a new line to a text file after 5 occurrences of a comma in Bash

I have a text file that is basically one giant excel file on one line in a text file. An example would be like this:
Name,Age,Year,Michael,27,2018,Carl,19,2018
I need to change the third occurance of a comma into a new line so that I get
Name,Age,Year
Michael,27,2018
Carl,19,2018
Please let me know if that is too ambiguous and as always thank you in advance for all the help!
With Gnu sed:
sed -E 's/(([^,]*,){2}[^,]*),/\1\n/g'
To change the number of fields per line, change {2} to one less than the number of fields. For example, to change every fifth comma (as in the title of your question), you would use:
sed -E 's/(([^,]*,){4}[^,]*),/\1\n/g'
In the regular expression, [^,]*, is "zero or more characters other than , followed by a ,; in other words, it is a single comma-delimited field. This won't work if the fields are quoted strings with internal commas or newlines.
Regardless of what Linux's man sed says, the -E flag is an extension to Posix sed, which causes sed to use extended regular expressions (EREs) rather than basic regular expressions (see man 7 regex). -E also works on BSD sed, used by default on Mac OS X. (Thanks to #EdMorton for the note.)
With GNU awk for multi-char RS:
$ awk -v RS='[,\n]' '{ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
With any awk:
$ awk -v RS=',' '{sub(/\n$/,""); ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
Try this:
$ cat /tmp/22.txt
Name,Age,Year,Michael,27,2018,Carl,19,2018,Nooka,35,1945,Name1,11,19811
$ echo "Name,Age,Year"; grep -o "[a-zA-Z][a-zA-Z0-9]*,[1-9][0-9]*,[1-9][0-9]\{3\}" /tmp/22.txt
Michael,27,2018
Carl,19,2018
Nooka,35,1945
Name1,11,1981
Or, ,[1-9][0-9]\{3\} if you don't want to put [0-9] 3 more times for the YYYY part.
PS: This solution will give you only YYYY for the year (even if the data for YYYY is 19811 (typo mistakes if any), you'll still get 1981
You are looking for 3 fragments, each without a comma and separated by a comma.
The last fields can give problems (not ending with a comma and mayby only two fields.
The next command looks fine.
grep -Eo "([^,]*[,]{0,1}){0,3}" inputfile
This might work for you (GNU sed):
sed 's/,/\n/3;P;D' file
Replace every third , with a newline, print ,delete the first line and repeat.

Unix Shell scripting in AIX(Sed command)

I have a text file which consists of jobname,business name and time in min seperated with '-'(SfdcDataGovSeq-IntegraterJob-43).There are many jobs in this text file. I want to search with the jobname and change the time from 43 to 0 only for that particular row and update the same text file. Kindly advise what needs to be done.
Query that i am using : (cat test.txt | grep "SfdcDataGovSeq" | sed -e 's/43/0/' > test.txt) but the whole file is getting replaced with only one line.
sed -e '/SfdcDataGovSeq/ s/43/0/' test.txt
This will only replace if the search is positive.
Agreed with Ed, Here is a workaround to put word boundaries Although Equality with awk is robust.
sed -e '/SfdcDataGovSeq/ s/\<43\>/0/g' test.txt
You should be using awk instead of sed:
awk 'BEGIN{FS=OFS="-"} $1=="SfdcDataGovSeq" && $3==43{$3=0} 1' file
Since it does full string or numeric (not regexp) matches on specific fields, the above is far more robust than the currently accepted sed answer which would wreak havoc on your input file given various possible input values.

Delete unknown amount of regexps using sed

I'm trying to get a bunch of regular expressions for a file (one per line) and then fit those regexps into something like this /$regexp/d . I'm trying it this way:
while read line;do sed "/$line/d" to_delete.file >> output;done < to_delete.txt
But it says me 'unknown command', even if I change the delimiter.
--- EDIT
The to_delete.txt file has slashes but i'm already scraping them and that's where i find the error.
To avoid problem with / in regex sed is allow to use another separator, so you can use e.g. sed "\|$line|d".
Secondary if you put script into double-quotes you shoud add space between address range and action e.g. "\|$line| d"
But I see a general mistake in the script. The loop will print into output all to_delete.file (exept 1 line with regexp) by each loop. I suppose it is not the thing what OP wants.
If you'd like to exclude content of to_delete.txt from to_delete.file it can be easy done by grep
grep -vFf "to_delete.txt" "to_delete.file" > output

How to parse a config file using sed

I've never used sed apart from the few hours trying to solve this. I have a config file with parameters like:
test.us.param=value
test.eu.param=value
prod.us.param=value
prod.eu.param=value
I need to parse these and output this if REGIONID is US:
test.param=value
prod.param=value
Any help on how to do this (with sed or otherwise) would be great.
This works for me:
sed -n 's/\.us\././p'
i.e. if the ".us." can be replaced by a dot, print the result.
If there are hundreds and hundreds of lines it might be more efficient to first search for lines containing .us. and then do the string replacement... AWK is another good choice or pipe grep into sed
cat INPUT_FILE | grep "\.us\." | sed 's/\.us\./\./g'
Of course if '.us.' can be in the value this isn't sufficient.
You could also do with with the address syntax (technically you can embed the second sed into the first statement as well just can't remember syntax)
sed -n '/\(prod\|test\).us.[^=]*=/p' FILE | sed 's/\.us\./\./g'
We should probably do something cleaner. If the format is always environment.region.param we could look at forcing this only to occur on the text PRIOR to the equal sign.
sed -n 's/^\([^,]*\)\.us\.\([^=]\)=/\1.\2=/g'
This will only work on lines starting with any number of chars followed by '.' then 'us', then '.' and then anynumber prior to '=' sign. This way we won't potentially modify '.us.' if found within a "value"

Resources