How to extract from a file text between tokens using bash scripts

How to extract from a file text between tokens using bash scripts - bash

I was reading this question: Extract lines between 2 tokens in a text file using bash
because I have a very similar problem...
I have to extract (and save it to $variable before printing) text in this xml file:
<--more labels up this line>
<ExtraDataItem name="GUI/LastVMSelected" value="14cd3204-4774-46b8-be89-cc834efcba89"/>
<--more labels and text down this line-->
I only need to get the value= (obviously without brackets and no 'value='), but first, I think it have to search "GUI/LastVMSelected" to get to this line, because there could be a similar value field in other lines,and the value of that label is that i want.

If they are on the same line (as they seem to be from your example), it's even easier. Just:
sed -ne '/name="GUI\/LastVMSelected"/s/.*value="\([^"]*\)".*/\1/p'
Explanation:
-n: Suppress default print
/name="GUI\/LastVMSelected"/: only lines matching this pattern
s/.value="([^"])"./\1/p
substitute everything, capturing the parenthesized part (the value of value)
and print the result

I'm assuming that you're extracting from an XML document. If that is the case, have a look at the XMLStarlet command-line tools for processing XML. There's some documentation for querying XML docs here.

Use this:
for f in `grep "GUI/LastVMSelected" filename.txt | cut -d " " -f3`; do echo ${f:7:36}; done
grep gets you only the lines you need
cut splits the lines using some separator, and returns the Nth result of the split
-d " " sets the separator to space
-f3 returns the third result (1-based indexing)
${f:7:36} extracts the substring starting at index 7 that is 36 characters long. This gets rid of the leading value=" and trailing slash, etc.
Obviously if the order of the fields changes, this will break, but if you're just after something quick and dirty that works, this should be it.

Using my answer from the question you linked:
sed -n '/<!--more labels up this line-->/{:a;n;/<!--more labels and text down this line-->/b;\|GUI/LastVMSelected|s/value="\([^=]*\)"/\1/p;ba}' inputfile
Explanation:
-n - don't do an implicit print
/<!-- this is token 1 -->/{ - if the starting marker is found, then
:a - label "a"
n - read the next line
/<!-- this is token 2 -->/q - if it's the ending marker, quit
\|GUI/LastVMSelected| - if the line matches the string
s/value="\([^"]*\)"/\1/p - print the string after 'value=' and before the next quote
ba - branch to label "a"
} end if

Related

How can I replace all occurrences of a value in a text file just in one column using the Sed command in shell script (columns are seperated by ;)? [duplicate]

This question already has answers here:
sed: replace values in a single column
(3 answers)
Closed last month.
I have a file that has columns seperated by a semi column(;) and I want to change all occurrences of a word in a particular column only to another word. The column number differentiates based on the variable that holds the column number. The word I want to change is stored in a variable, and the word I want to change to is stored in a variable too.
I tried
sed -i "s/\<$word\>/$wordUpdate/g" $anyFile
I tried this but it changed all occurrences of word in the whole file! I only want in a particular column
the number of column is stored in a variable called numColumn
and the columns are seperated by a semi column ;

It is much simpler to use awk for column edits, e.g. if your input looks like this:
68;61;83;27;60;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12
You can replace 60 in column c with s with something like this:
<infile awk '$c ~ m { $c = s } 1' FS=';' OFS=';' c=5 m=60 s=XX
Output:
68;61;83;27;XX;70;84;11;46;62;93;97;40;23;19
33;70;17;49;81;21;68;83;16;6;42;38;68;81;89
73;40;95;64;32;33;77;56;23;11;70;28;33;80;24
8;9;74;6;86;78;87;41;11;79;23;28;71;99;15
29;87;77;9;98;12;7;66;60;85;20;14;55;97;17
39;24;21;58;23;61;39;26;57;70;76;16;70;53;8
37;46;18;64;56;28;86;7;80;71;94;46;19;53;43
71;2;47;62;9;21;68;9;9;80;32;59;73;74;72
20;34;89;58;74;92;86;35;48;81;50;6;63;67;90
78;17;6;63;61;65;75;31;33;82;24;5;90;46;12

This might work for you (GNU sed):
word=foo wordUpdate=bar numColumn=3
sed -i 'y/;/\n/
s#.*#echo "&" | sed "'${numColumn}'s/\<'${word}'\>/'${wordUpdate}'/"#e
y/\n/;/' file
Convert each line into a separate file where the columns are lines.
Substitute the matching line (column number) with the word for the updated word.
Reverse the conversion.
N.B. The solution relies on the GNU only e evaluation flag. Also the word and updateWord may need to be quoted.

This can be done with a little creativity...
Note that I'm using double-quotes to embed the logic. This takes a little extra care to double your \'s on backreferences.
$: word=baz; c=3; new=XX; lead="^([^;]*;){$((c-1))}"; sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file
1;2;3;4;5;6;7;8;9;0;
foo;bar;XX;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
Explained:
lead="^([^;]*;){$((c-1))}"
^ means at the start of a record
(...) is grouping for the following {...} which specified repetition
[^;]* mean zero or more non-semicolons
$((c-1)) does the math and returns one less than the desired column; if you want to look at column 3, it returns two.
SO, ^([^;]*;){$((c-1))} at the start of the record, one-less-than-column occurrences of non-semicolons followed by a semicolon
thus, sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" file mean read file and on records where $word occurs in the requested column, save everything before it, and put that stuff back, but replace $word with $new.
Even if you MUST use sed, I recommend a function.
fix(){
local word="$1" col="$2" new="$3" file="$4"
local lead="^([^;]*;){$((col-1))}"
sed -E "/$lead$word;/{s/($lead)$word/\\1$new/}" "$file"
}
In use -
$: fix bar 2 HI file
1;2;3;4;5;6;7;8;9;0;
foo;HI;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix 1 1 XX file
XX;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
$: fix bar 2 '(^_^)' file
1;2;3;4;5;6;7;8;9;0;
foo;(^_^);baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
No changes if no matches -
$: fix bar 5 HI file
1;2;3;4;5;6;7;8;9;0;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
NOTE -
This logic requires trailing delimiters if you ever want to match the last field -
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;HI;
foo;bar;baz;qux;foo;bar;baz;qux;
a;b;c;d;e;f;g;
delimiters removed:
$: fix 0 10 HI file
1;2;3;4;5;6;7;8;9;0
foo;bar;baz;qux;foo;bar;baz;qux
a;b;c;d;e;f;g
Otherwise you have to complicate the logic a bit.
But honestly, for field parsing, you'd be so much better served to use awk, or even perl or python, or for that matter a bash loop, though that's going to be relatively slow.

Add multiple elements to the text file in a specific way using Bash

I have a text file that contains a list of "word sequences" and I need to add some "" and "," to each word sequence, I´m thinking in use a bash command.
Here is the data:
NTSS
NGTG
NVSQ
NITL
NFTS
...
I need to add "" to each word sequence and separate with ","
Here an expected output:
"NTSS",
"NGTG",
"NVSQ",
"NITL",
...
Any recommendation with BASH to do that?

This can be done in many ways, but sed is perfect for the job.
sed 's/^.*$/"\0",/' < file.txt
This replacement simply matches the whole line and replaces it according to what you need.
The one above is a regular expression replacement, which has the structure:
s/<pattern to match>/<replacement>/
^ matches the beginning of the line
.* matches any character any number of times
$ matches the end of the line
In the replacement part, \0 represents the whole string that has matched the pattern (the entire line in this case)
Check out some regular expression tutorial for more.
If you prefer a purely bash alternative, you can use:
while read -r line; do echo "\"${line}\","; done < file.txt

Shell scripting cut -d " " -f4 file.txt command

I have a file with words separated by only single space.
I want to read 4th word from each line of file using command:
cut -d " " -f4 file.txt
It works fine, but I don't understand its property.
If a line contains 4 or more words then it prints the 4th word.
If a line contains only 1 word then it prints that word.
If a line contains 2 or 3 words then it prints nothing.
I want to know that how it is working.

From man cut:
-f, --fields=LIST
select only these fields; also print any line that contains no delimiter character, unless the -s option is specified
If a line contains 1 word, then it does not contain the delimiter and therefore cut prints the whole line (which is exactly that one word).
Other cases are obvious: the line contains at least one delimiter, therefore it prints the fourth word, if available.
If you add the -s parameter, it will print the fourth word only if available (and thus ignore lines with one word without delimiter).

By default, cut expects each input line to contain the delimiter (space in the OP example). Lines that do not contain the delimited are printed as-is.
The default behavior can be changes with -s, which will always print the 4th column, even when the delimited is not found on the line (the case of ` word). Use
cut -s -d " " -f4 file.txt
As to the why this is the default behavior - no clear answer. Probably, this behavior was used to allow some lines to be excluded from the filtering. The initial Unix systems had lot of semi-structured files, where this functionality could have been used to process man pages, nroff pages and similar.
From the man page:
-f list
Cut based on a list of fields, assumed to be separated in the file by
a delimiter character (see -d). Each selected field shall be output.
Output fields shall be separated by a single occurrence of the field
delimiter character. Lines with no field delimiters shall be passed
through intact, unless -s is specified. It shall not be an error to
select fields not present in the input line.
-s, --only-delimited do not print lines not containing delimiters
See also: https://unix.stackexchange.com/questions/157677/does-cut-return-any-fields-if-separator-does-not-exist

Make changes to a file (sed, awk)

I am trying to clean up the next file:
1. 10.160.120.10 ; 140.0.0.40 ;Data-- 1155~00120~xtl~12/01/2016 03:00:24~000BBBBBA4FB~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
2. ¼?Amµxðïej£„7‹ìËÏð‡.4 --
3. 10.160.120.11 ; 140.10.10.10 ;Data-- 1155~00120~xtl~12/01/2016 03:00:54~2B3BB1EB1BBB~£ˆD]†CÀ,£ÑÉ»In&Ry+/jÑ%A¡ã ÷d_#C÷—NÏÕÞ
3. Ü‚úè"åD\’c\ûñ7x°yFÃ¦ï --
Note that the numbers are not an actual part of the file. They are just reference for the number of line. The size of the line depends on the encoded message (That is why the 3 is reapeated because it basically one line). There are thousands of records but they follow the same pattern. Each record ends with a (--).
Basically what I am trying to achive is to just get the IPs side by side.
For example:
10.160.120.10 000BBBBBA4FB
My first step would be to delete everything between the first (;) and the fourth (~) since that pattern is the same for each record.
Which leads me to this.
sed 's/;.*~//'
However this particular command would delete everything untill the last (~) and not the fourth.
If it succesfully removes everything between the first (;) and the fourth (~) it would get me something like this:
0.165.65.113 0008B9A4F3~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
¼?Amµxðïej£„7‹ìËÏð‡.4 --
And then I guess I could delete everything after the first (~) so I can get the desired output.
Am I following the right procedure? Should I achive this with swd or awk? Any suggestion are appreciated!

Instead of trying to remove stuff, why don't you just keep the stuff you want?
sed -r -n 's/^[^0-9]*(([0-9]{1,3}\.){3}[0-9]{1,3}).*([0-9A-F]{12}).*$/\1 \3/p'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
# IP Address 12 Hex digits
Explanation:
\1 \3 means enter everything that matched the first and the third set of parenthesis of the search term.
^[^0-9]* matches all non-digits from the beginning of the file
([0-9]{1,3}\.){3}[0-9]{1,3} matches an IP address. The whole term is in parentheses because we want to keep it. The inner (...) could be referenced as \2 in the replacement term, but we don't need that.
[0-9A-F]{12} is simply 12 hexadecimal digits (upper case, use `[0-9a-fA-F] if you expect lower cases as well)

Assuming your data struture is the same
use several field separator at once with a class including ";" and "~". Be carefull , not space alone as separator like by default that return a different field 3 (and 6)
awk -F '[[:blank:]*[;~][[:blank:]]*' '/--$/ {print $1 " " $7}' YourFile
Assuming there is only space char and no tab as separator and data line have Data
awk -F ' *[;~] *' '/--$/ {print $1 " " $7}' YourFile

Bash: Find and replace all variable characters up to a constant character with a constant string

I've seen many search and replace threads based on the assumption that 1. you either know what string or substring you are explicitly looking for or 2. you know the exact position it is at within the string or 3. both combined.
In my situation I have one csv file containing one column and 1M rows. e.g.
1,google.com
2,yahoo.com
3,twitter.com
4,xyz.com
For every column, I want to replace every character (the incrementing integers) up to and including the comma with the http semicolon dble forward slash dubdubdub
So far I have the following
HTTPSTRING="http://www."
cat X.csv << Will this ensure that the while block is executed on this file?
while IFS=, read line
do {$line/(.*?),/HTTPSTRING} << This is where I am having trouble
done
exit 0
and I would likea text file containing one URL per line e.g.
http://www.google.com
...
http://www.${999,999_more_urls}
Thank you so much in advance
Lewis

This does a greedy match, which would be problematic if you ever have any commas other than the one that separates the initial integer from the characters you want to retain. But it works on your sample X.csv file, producing a Y.csv file that meets your output specification.
HTTPSTRING="http://www."
while read line
do
echo ${line/*,/$HTTPSTRING}
done < X.csv > Y.csv
exit 0
For what it's worth, if you put this in a script, you can take the file input/input redirection parts out of the code itself, and instead apply them when calling the script.
If you're not strictly limited to bash itself, you might want to consider using sed. Either of these should do what you want, differing only in whether you prefer to escape the slashes in your string or use a non-standard delimiter:
sed 's/[0-9]*,/http:\/\/www./' X.csv > Y.csv
sed 's~[0-9]*,~http://www.~' X.csv > Y.csv

Your script is close. You can pipe the output of cat directly to the while loop, but it's better to use input redirection ( < X.csv). Using IFS=, before read will split the line into fields separated by a comma, but you are just missing a variable to hold the second field.
HTTPSTRING="http://www."
while IFS=, read number domain
do
echo "$HTTPSTRING$domain"
done < X.csv

You could use commands only, there is no need for an explicit Bash loop :
cut -d',' -f2 < X.csv | sed 's_^_http://www._' > Y.txt
Notice that the usual / used after the s in sed is replaced by _ because it is included in the string to replace. ^ matches the start of the line.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to extract from a file text between tokens using bash scripts - bash

I'm assuming that you're extracting from an XML document. If that is the case, have a look at the XMLStarlet command-line tools for processing XML. There's some documentation for querying XML docs here.

Related

How can I replace all occurrences of a value in a text file just in one column using the Sed command in shell script (columns are seperated by ;)? [duplicate]

Add multiple elements to the text file in a specific way using Bash

Shell scripting cut -d " " -f4 file.txt command

Make changes to a file (sed, awk)

Bash: Find and replace all variable characters up to a constant character with a constant string

Categories

Resources