Shell script - remove all before and after - shell

Find the next link if the Link header contains rel=next..
Getting the link header can result in different strings.. I need to find the next link.
e.g.
Link: <http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;
would be http://mygithub.com/api/v3/organizations/20/repos?page=3
Link: <http://mygithub.com/api/v3/organizations/4/repos?page=2>; rel="next", <http://mygithub.com/api/v3/organizations/4/repos?page=2>; rel="last"
would be http://mygithub.com/api/v3/organizations/4/repos?page=2
Played with sed and parameter expansion - not that experienced so got stuck :)

Please be aware that parsing HTML with non-html tools it fraught with peril; you will see that this works, and assume you can get away with it always. You'll spend hours trying to get the next level of complexity to work, when you should be studying how to use html-aware tools. Don't say we didn't warn you (-;, but
printf "<http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;\n" \
| awk -F" " '{
for(i=1;i<=NF;i++){
if ($i == "rel=next,") {
gsub(/[<>]/,"",$(i-1);sub(/;$/,"",$(i-1))
print $(i-1)
}
}
}'
produces required output:
http://mygithub.com/api/v3/organizations/20/repos?page=3
To save the output of a script section into a variable, you wrap the code for command-substitution, in this case
nextReposLink=$( printf .... | awk '....' )
#-------------^^--------------------------^
The ^ pointed items are modern syntax for command-substitution. The code inside of $( ... ) is executed and the standard output is passed as a argument to the invoking command line. (The original syntax for command substitution is/was `cmds` and works the same in the simple case var=`cmds` . You can nest modern cmd-substitution easily, whereas the old version requires a lot of escape character fiddling. Avoid it if you can.
Note that about any s/str/rep/ that sed can do, awk can do the same, but requires the use of the sub(/regx/, "repl", "str") or gsub(sameArgs) functions. In this particular case, you may need to escape the <> like \<\>.
Be sure to always dbl-quote the use of variables, i.e. echo "$nextReposLink".
IHTH

Well - I put one of your URL strings in a text file and was able to pull out the first URL with two cuts.
[root#oelinux2 ~]# cat test
Link: <http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;
Then with using cut:
cat test | cut -d "<" -f2 | cut -d ">" -f1
[root#oelinux2 ~]# cat test | cut -d "<" -f2 | cut -d ">" -f1
http://mygithub.com/api/v3/organizations/20/repos?page=1
That's one option - if you are just looking to get the first URL in the string. Basically - that's just grabbing what's between the two delimiters "<" and ">"
With Cut:
-d is the 'delimiter'
-f is the field you want to get.
If you wanted to get a later URL in that string, you could change the fields (-f #) and see what you get :)

Related

sed/Awk/cut... How to decide which to use to parse Docker output?

My output:
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
jenkins/jenkins lts 806f56c84444 8 days ago 703MB
mongo latest 0da05d84b1fe 2 weeks ago 394MB
I would like to just cut the image ID alone from the output.
I tried using cut:
docker images | cut -d " " -f1
REPOSITORY
jenkins/jenkins
The -f1 just gives me the repository names, if I use -f3 it tends to be empty. Since the delimiter is not a single space I don't see how to get the desired output.
Can we cut based on field names?
I read the documentation and did not see anything relevant. I also saw that there is a way to achieve this using sed/AWK which i'm still figuring out.
In the meanwhile is there a easier way to achieve this using the cut command?
I'm new to Unix/Linux, how can I determine which of Sed/AWK/Cut to prefer?
Your input seems to have a fixed width of 20 chars for each field, so you can make use of gawk's FIELDWIDTHS feature.
$ awk -v FIELDWIDTHS="20 20 20 20 20" '{ print $3 }' file
IMAGE ID
806f56c84444
0da05d84b1fe
$
$ awk -v FIELDWIDTHS="20 20 20 20 20" '{ printf "%20s%20s\n", $1, $3 }' file
REPOSITORY IMAGE ID
jenkins/jenkins 806f56c84444
mongo 0da05d84b1fe
From man gawk:
If the FIELDWIDTHS variable is set to a space-separated list of numbers, each field is expected to have fixed width, and gawk splits up the record using the specified widths. Each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. The value of FS is ignored. Assigning a new value to FS or FPAT overrides the use of FIELDWIDTHS.
You have to "squeeze" the space padding in the default output to single space.
1 2 == 1-space-space-2 == Field 1 before 1st space, Field between 1st and 2nd space, Field 3 after 2nd space.
cut -d' ' -f1 ==> '1'
cut -d' ' -f2 ==> '' empty field between 1st and 2nd delimiter
cut -d' ' -f3 ==> '2'
So, in your case use sed to replace consecutive spaces with 1:
docker images | sed 's/ */ /g' | cut -d " " -f1,3
If the output is fixed columns widths, then you can use this variant of cut:
docker images | cut -c1-20,41-60
This will cut out columns 41 to 60, where we find the Image ID.
If ever the output uses TAB for padding, you should use expand -t n to make the output consistently space padded then apply the appropriate cut -cx,y, e.g. (numbers may need adjusting):
docker images | expand -t 4 | cut -c1-20,41-60
Try this:
docker images | tr -s ' ' | cut -f3 -d' '
The command tr -s ' ' convert multiple spaces into a single one and after with cut you can grab your field. This work fine if values in your field haven't spaces.
With Procedural Text Edit it's :
forEach line {
if (contains ci "REPOSITORY") { remove }
keepRange word 2 1
}
removeEmptyLines // <- optional
In the general case, avoid parsing output meant for human consumption. Many modern utilities offer an option to produce output in some standard format like JSON or XML, or even CSV (though that is less strictly specified, and exists in multiple "dialects").
docker in particular has a generalized --format option which allows you to specify your own output format:
docker images --format "{{.ID}}"
If you cannot avoid writing your own parser (are you really sure!? Look again!), cut is suitable for output with a specific single-character delimiter, or otherwise fairly regular output. For everything else, I would go with Awk. Out of the box, it parses columns from sequences of whitespace, so it does precisely what you specifically ask for:
docker images | awk 'NR>1 { print $3 }'
(NR>1 skips the first line, which contains the column headers.)
In the case of fixed-width columns, it allows you to pull out a string by index:
docker images | awk 'NR>1 { print substr($0, 41, 12) }'
... though you could do that with cut, too:
docker images | cut -c41-53
... but notice that Docker might adjust column widths depending on your screen size!
Awk lets you write regular expression extractions, too:
awk 'NR>1 { sub(/^([^[:space:]]*[[:space:]]+){2}/, ""); sub(/[[:space]].*/, ""); print }'
This is where it overlaps with sed:
sed -n '2,$s/^[^ ]\+[ ]\+[^ ]\+[ ]\+\([^ ]\+\)[ ].*/\1/p'
though sed is significantly less human-readable, especially for nontrivial scripts. (This is still pretty trivial.)
If you haven't used regex before, the above will seem cryptic, but it really isn't very hard to pick apart. We are looking for sequences of non-spaces (a field in a column) followed by sequences of spaces (a column separator) - two before the ID field and whatever comes after it, starting from the first space after the ID column.
If you want to learn shell scripting, you should probably also learn at least the basics of Awk (and a passing familiarity with sed). If you just want to get the job done, and perhaps aren't specifically interested in learning U*x tools (though you probably should be anyway!), perhaps instead learn a modern scripting language like Python or Ruby.
... Here's a Python docker library:
import docker
client = docker.from_env()
for image in client.images.list():
print(image.id)
Can we cut based on field names? No.
How can I determine which of Sed/AWK/Cut to prefer? YMMV. For this particular input where fields are separated by two or more spaces, using awk you could set field separator to " +" (two or more spaces), look for desired field name (IMAGE ID below) and print only that particular field:
$ awk -F" +" ' # set field separator
{
if(f=="") # while we have not determined the desired field
for(i=1;i<=NF;i++) # ... keep looking
if($i=="IMAGE ID")
f=i
if(f!="") # once found
print $f # start printing it
}' file
Output:
IMAGE ID
806f56c84444
0da05d84b1fe
As one-liner:
$ awk -F" +" '{if(f=="")for(i=1;i<=NF;i++)if($i=="IMAGE ID")f=i;if(f!="")print $f}' file

Can I use grep to extract a single column of a CSV file?

I'm trying to solve o problem I have to do as soon as possible.
I have a csv file, fields separated by ;.
I'm asked to make a shell command using grep to list only the third column, using regex. I can't use cut. It is an exercise.
My file is like this:
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
2;Wayne;Watkins;22;Lanme Place;Cotoiwi;NC;86578
3;Danny;Vega;25;Fofci Center;Momahbih;MS;21027
4;Larry;Robinson;23;Bammek Boulevard;Gaizatoh;NE;27517
5;Myrtie;Black;20;Savon Square;Gokubpat;PA;92219
6;Nellie;Greene;23;Utebu Plaza;Rotvezri;VA;17526
7;Clyde;Reynolds;19;Lupow Ridge;Kedkuha;WI;29749
8;Calvin;Reyes;47;Paad Loop;Beejdij;KS;29247
9;Douglas;Graves;43;Gouk Square;Sekolim;NY;13226
10;Josephine;Estrada;48;Ocgig Pike;Beheho;WI;87305
11;Eugene;Matthews;26;Daew Drive;Riftemij;ME;93302
12;Stanley;Tucker;54;Cure View;Woocabu;OH;45475
13;Lina;Holloway;41;Sajric River;Furutwe;ME;62184
14;Hettie;Carlson;57;Zuheho Pike;Gokrobo;PA;89098
15;Maud;Phelps;57;Lafni Drive;Gokemu;MD;87066
16;Della;Roberson;53;Zafe Glen;Celoshuv;WV;56749
17;Cory;Roberson;56;Riltav Manor;Uwsupep;LA;07983
18;Stella;Hayes;30;Omki Square;Figjitu;GA;35813
19;Robert;Griffin;22;Kiroc Road;Wiregu;OH;39594
20;Clyde;Reynolds;19;Lupow Ridge;Kedkuha;WI;29749
21;Calvin;Reyes;47;Paad Loop;Beejdij;KS;29247
22;Douglas;Graves;43;Gouk Square;Sekolim;NY;13226
23;Josephine;Estrada;48;Ocgig Pike;Beheho;WI;87305
24;Eugene;Matthews;26;Daew Drive;Riftemij;ME;93302
I think I should use something like: cat < test.csv | grep 'regex'.
Thanks.
Right Tools For The Job: Using awk or cut
Assuming you want to match the third column against a specific field:
awk -F';' '$3 ~ /Foo/ { print $0 }' file.txt
...will print any line where the third field contains Foo. (Changing print $0 to print $3 would print only that third field).
If you just want to print the third column regardless, use cut: cut -d';' -f3 <file.txt
Wrong Tool For The Job: Using GNU grep
On a system where grep has the -o option, you can chain two instances together -- one to trim everything after the fourth column (and remove lines with less than four columns), another to take only the last remaining column (thus, the fourth):
str='foo;bar;baz;qux;meh;whatever'
grep -Eo '^[^;]*[;][^;]*[;][^;]*[;][^;]*' <<<"$str" \
| grep -Eo '[^;]+$'
To explain how that works:
^, outside of square brackets, matches only at the beginning of a line.
[^;]* matches any character except ; zero-or-more times.
[;] matches only the character ;.
...thus, each [^;]*[;] in the regex matches a single field, whether or not that field contains text. Putting four of those in the first stage means we're matching only fields, and grep -o tells grep to only emit content it was successfully able to match.
If you just need the 3rd field and it's always properly delimited with ';' why not use 'cut'?
cut -d';' -f3 <filename>
UPDATED:
OP wasn't clear, maybe only want to look at the 3rd line?
head -3 <filename> | tail -1
OR.. Maybe just getting of list of the things that appear in the 3rd field?
Not clear what the intended use of 'grep' would be??
cut -d';' -f3 <filename> | sort -u
As the other answers have said, using grep is a bad/unfortunate idea.
The only way I can think of using grep is to pull out a specific row where the 3rd column == some value. E.g.,
grep '^\([^;]*;\)\{2\}Bell;' test.txt
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
Or if the first column is the index (not counting it as a column):
grep '^\([^;]*;\)\{3\}39;' test.txt
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
Even using grep in this case leads to a pretty ugly solution.
Edit: Didn't see Charles Duffy's answer... that's pretty clever.

Regex - Pattern Matching in Shell

I am trying to match a pattern and extract the values that comes after it. I have used below regex pattern matchching, it it dint help me. No values got extracted as I got blank value when I echoed it.
Someone let me know what mistake I made.
Sample regex:
class="remove_link_style">Site Issue - Please check</a></td><td>
Working</td><td>
<ahref="/0051043899"class="remove_link_style">
patten used: text=$(echo "class="remove_link_style">Site Issue - Please check</a></td><td>Working</td><td><ahref="/0051043899"class="remove_link_style">" | grep -o --perl-regexp "(?class="remove_link_style")[a-zA-Z0-9_]+"")
I also wanted to extract the string that comes after class="remove_link_style" but before </a></td><td>
I think you would find a lot of references and advice not to parse XML with bash tools like grep/sed/awk . With this context, I would advise using any of the parsing tools like http://xmlsoft.org/xmllint.html or http://xmlstar.sourceforge.net/doc/xmlstarlet.txt . But if you'd like to quickly extract the contents, you can combine grep and cut as below.
echo 'class="remove_link_style">GB|Trekkinn-UK|Manualcrawlrequest|1</a></td><td>WorkInProgress</td><td><ahref="/0051043899"class="remove_link_style">' | grep -Eo 'style"[^<>]*>[^<>]+' | cut -f2 -d">"
This prints out:
GB|Trekkinn-UK|Manualcrawlrequest|1
WorkInProgress
EDIT : As per OP's ask, store the output into an array.
If you need the output to be stored in an array, you need to set the IFS since you have white spaces in your elements.
IFS=$'\n'
result=($(echo 'class="remove_link_style">Site Issue - Please check</a></td><td>Working</td><td><ahref="/0051043899"class="remove_link_style">' | grep -Eo 'style"[^<>]*>[^<>]+' | cut -f2 -d">"))
unset IFS
for i in "${result[#]}"; do echo $i; done
Site Issue - Please check
Working

grep pipe searching for one word, not line

For some reason I cannot get this to output just the version of this line. I suspect it has something to do with how grep interprets the dash.
This command:
admin#DEV:~/TEMP$ sendemail
Yields the following:
sendemail-1.56 by Brandon Zehm
More output below omitted
The first line is of interest. I'm trying to store the version to variable.
TESTVAR=$(sendemail | grep '\s1.56\s')
Does anyone see what I am doing wrong? Thanks
TESTVAR is just empty. Even without TESTVAR, the output is empty.
I just tried the following too, thinking this might work.
sendemail | grep '\<1.56\>'
I just tried it again, while editing and I think I have another issue. Perhaps im not handling the output correctly. Its outputting the entire line, but I can see that grep is finding 1.56 because it highlights it in the line.
$ TESTVAR=$(echo 'sendemail-1.56 by Brandon Zehm' | grep -Eo '1.56')
$ echo $TESTVAR
1.56
The point is grep -Eo '1.56'
from grep man page:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output
line.
Your regular expression doesn't match the form of the version. You have specified that the version is surrounded by spaces, yet in front of it you have a dash.
Replace the first \s with the capitalized form \S, or explicit set of characters and it should work.
I'm wondering: In your example you seem to know the version (since you grep for it), so you could just assign the version string to the variable. I assume that you want to obtain any (unknown) version string there. The regular expression for this in sed could be (using POSIX character classes):
sendemail |sed -n -r '1 s/sendemail-([[:digit:]]+\.[[:digit:]]+).*/\1/ p'
The -n suppresses the normal default output of every line; -r enables extended regular expressions; the leading 1 tells sed to only work on line 1 (I assume the version appears in the first line). I anchored the version number to the telltale string sendemail- so that potential other numbers elsewhere in that line are not matched. If the program name changes or the hyphen goes away in future versions, this wouldn't match any longer though.
Both the grep solution above and this one have the disadvantage to read the whole output which (as emails go these days) may be long. In addition, grep would find all other lines in the program's output which contain the pattern (if it's indeed emails, somebody might discuss this problem in them, with examples!). If it's indeed the first line, piping through head -1 first would be efficient and prudent.
jayadevan#jayadevan-Vostro-2520:~$ echo $sendmail
sendemail-1.56 by Brandon Zehm
jayadevan#jayadevan-Vostro-2520:~$ echo $sendmail | cut -f2 -d "-" | cut -f1 -d" "
1.56

Shell scripting - how to properly parse output, learning examples.

So I want to automate a manual task using shell scripting, but I'm a little lost as to how to parse the output of a few commands. I would be able to this in other languages without a problem, so I'll just explain what I'm going for in psuedo code and provide an example of the cmd output I'm trying to parse.
Example of output:
Chg 2167467 on 2012/02/13 by user1234#filename 'description of submission'
What I need to parse out is '2167467'. So what I want to do is split on spaces and take element 1 to use in another command. The output of my next command looks like this:
Change 2167463 by user1234#filename on 2012/02/13 18:10:15
description of submission
Affected files ...
... //filepath/dir1/dir2/dir3/filename#2298 edit
I need to parse out '//filepath/dir1/dir2/dir3/filename#2298' and use that in another command. Again, what I would do is remove the blank lines from the output, grab the 4th line, and split on space. From there I would grab the 1st element from the split and use it in my next command.
How can I do this in shell scripting? Examples or a point to some tutorials would be great.
Its not clear if you want to use the result from the first command for processing the 2nd command. If that is true, then
targString=$( cmd1 | awk '{print $2}')
command2 | sed -n "/${targString}/{n;n;n;s#.*[/][/]#//#;p;}"
Your example data has 2 different Chg values in it, (2167467, 2167463), so if you just want to process this output in 2 different ways, its even simpler
cmd1 | awk '{print $2}'
cmd2 | sed -n '/Change/{n;n;n;s#.*[/][/]#//#;p;}'
I hope this helps.
I'm not 100% clear on your question, but I would use awk.
http://www.cyberciti.biz/faq/bash-scripting-using-awk/
Your first variable would look something like this
temp="Chg 2167467 on 2012/02/13 by user1234#filename 'description of submission'"
To get the number you want do this:
temp=`echo $temp | cut -f2 -d" "`
Let the output of your second command be saved to a file something like this
command $temp > file.txt
To get what you want from the file you can run this:
temp=`tail -1 file.txt | cut -f2 -d" "`
rm file.txt
The last block of code gets the last nonwhite line of the file and delimits on the second set of white spaces

Resources