I am trying to get language code from pages by curl
I wrote below and work...
curl -Ls yahoo.com | grep "lang=" | head -1 | cut -d ' ' -f 3 | cut -d"\"" -f 2
but sometimes code is different like
curl -Ls stick-it.app | grep "lang=" | head -1 | cut -d ' ' -f 3 | cut -d"\"" -f 2
they wrote like
<html dir="rtl" lang="he-IL">
I just need to get he-IL
If is there any other way, I would appreciate it...
Using any sed in any shell on every Unix box:
$ curl -Ls yahoo.com | sed -n 's/^<html.* lang="\([^"]*\).*/\1/p'
en-US
If you have gnu-grep then using -P (perl regex):
curl -Ls yahoo.com | grep -oP '\slang="\K[^"]+'
he-IL
With awk's match function one could try following too.
your_curl_command | awk '
match($0,/^<html.*lang="[^"]*/){
val=substr($0,RSTART,RLENGTH)
sub(/.*lang="/,"",val)
print val
}
'
Explanation: Adding detailed explanation for above.
your_curl_command | awk ' ##Starting awk program from here.
match($0,/^<html.*lang="[^"]*/){ ##using match function to match regex starting from <html till lang=" till next 1st occurrence of "
val=substr($0,RSTART,RLENGTH) ##Creating val which has substring of matched values.
sub(/.*lang="/,"",val) ##Substituting everything till lang=" with NULL in val here.
print val ##printing val here.
}
'
Another variation using gnu awk and a pattern with a capture group using match:
match(string, regexp [, array])
curl -Ls yahoo.com | awk 'match($0, /<html [^<>]*lang="([^"]*)"/, a) {print a[1]}'
Output
en-US
The pattern matches
<html Match literally
[^<>]* Match 0+ any char except < or >
lang=" Match literally
([^"]*) Capture group 1 (denoted by a[1] in the example code) matching 0+ times any char except "
" Closing double quote
Related
I am trying to create a big regex from many options in a file, to be used in gawk. The goal is to find matches in lines.txt which match ANY of the options in regex.txt
File of lines to be searched
echo -n "dog
cat
bobcat" > lines.txt
File of regular expressions which will be combined into a big regex
echo -n "dog
cat" > regex.txt
I know the structure of what I am trying to do, but when I use sed to insert positional matching characters into the regex I get a trailing |.
This is what I currently have
rgx=$(cat "regex.txt" | sed 's#^#\\\\<#' | tr '\n' '|')
gawk -v regex=$rgx 'BEGIN {IGNORECASE = 1} {print gsub(regex,"")}' lines.txt
Current output from gawk is
1
1
7
Desired output from gawk is
1
1
0
Please help
It makes no sense to also use sed when you're using awk. It sounds like you want something like:
gawk '
BEGIN { IGNORECASE = 1 }
NR == FNR {
regex = (NR>1 ? regex "|" : "") "\\<" $0 "\\>"
next
}
{ print gsub(regex,"") }
' regex.txt lines.txt
1
1
0
#Stef's comment gets me to the desired output.
My sed was inserting a newline at the end, which was getting replaced by the | and this was causing the unexpected behavior.
So the working regex is rgx=$(cat "regex.txt" | sed 's#^#\\\\<#' | tr '\n' '|' | sed 's#|$##')
But, as Ed Morton's accepted answer shows, this can be done more elegantly using only gawk. I clearly need to learn more about awk!
I want to cut my url https://jenkins-crumbtest2.origin-ctc-core-nonprod.com/ into https://origin-ctc-core-nonprod.com I have tried several ways to handle it
$ echo https://jenkins-crumbtest2-test.origin-ctc-core-nonprod.com/ | cut -d"/" -f3 | cut -d"/" -f5
jenkins-crumbtest2.origin-ctc-core-nonprod.com
I have 3 inputs which i want to pass to get the expected output. I want to pass any of the input to get the same output.
Input:
1. https://jenkins-crumbtest2-test.origin-ctc-core-nonprod.com/ (or)
2. https://jenkins-crumbtest2.origin-ctc-core-nonprod.com/ (or)
3. https://jenkins-crumbtest2-test-lite.origin-ctc-core-nonprod.com/
Expected Output:
https://origin-ctc-core-nonprod.com
Can someone please help me ?
Could you please try following. Written and tested with shown samples only.
awk '{gsub(/:\/\/.*test\.|:\/\/.*crumbtest2\.|:\/\/.*test-lite\./,"://")} 1' Input_file
OR non-one liner form of solution above is as follows.
awk '
{
gsub(/:\/\/.*test\.|:\/\/.*crumbtest2\.|:\/\/.*test-lite\./,"://")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
gsub(/:\/\/.*test\.|:\/\/.*crumbtest2\.|:\/\/.*test-lite\./,"://") ##Gobally substituting everything till test OR crumbtest OR test-lite with :// in line.
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name h
This awk skips the records that don't have fixed string origin-ctc-core-nonprod.com in them:
awk 'match($0,/origin-ctc-core-nonprod\.com/){print "https://" substr($0,RSTART,RLENGTH)}'
You can use it with: echostring| awk ..., catfile|or awk ...file .
Explined:
awk ' # using awk
match($0,/origin-ctc-core-nonprod\.com/) { # if fixed string is matched
print "https://" substr($0,RSTART,RLENGTH) # output https:// and fixed string
# exit # uncomment if you want only
}' # one line of output like in sample
Or if you don't need the https:// part, you could just use grep:
grep -om 1 "origin-ctc-core-nonprod\.com"
Then again:
$ var=$(grep -om 1 "origin-ctc-core-nonprod\.com" file) && echo https://$var
my text-
(
"en-US"
)
what i need -
en-US
currently im able to get it by piping it with
... | tr -d '[:space:]' | sed s/'("'// | sed s/'("'// | sed s/'")'//
I wonder if there is a simple way to extract the string between the qoutes rather than chopping off useless parts one by one
... | grep -oP '(?<=").*(?=")'
Explanation:
-o: Only output matching string
-P: Use Perl style RegEx
(?<="): Lookbehind, so only match text that is preceded by a double quote
.*: Match any characters
(?="): Lookahead, so only match text that is followed by a double quote
With sed
echo '(
"en-US"
)' | sed -rn 's/.*"(.*)".*/\1/p'
with 2 commands
echo '(
"en-US"
)' | tr -d "\n" | cut -d '"' -f2
Could you please try following. Where var is the bash variable haveing shown sample value stored in it.
echo "$var" | awk 'match($0,/".*"/){print substr($0,RSTART+1,RLENGTH-2)}'
Explanation: Following is only for explanation purposes.
echo "$var" | ##Using echo to print variable named var and using |(pipe) to send its output to awk command as an Input.
awk ' ##Starting awk program from here.
match($0,/".*"/){ ##using match function of awk to match a regex which is to match from till next occurrence of " by this match 2 default variables named RSTART and RLENGTH will be set as per values.
print substr($0,RSTART+1,RLENGTH-2) ##Where RSTART means starting point index of matched regex and RLENGTH means matched regex length, here printing sub-string whose starting point is RSTART and ending point of RLENGTH to get only values between " as per request.
}' ##Closing awk command here.
Consider using
... | grep -o '"[^"]\{1,\}"' | sed -e 's/^"//' -e 's/"$//'
grep will extract all substrings between quotes (excluding empty ones), the sed later will remove the quotes on both ends.
And this one ?
... | grep '"' | cut -d '"' -f 2
It works if you have just 1 quoted value by line.
Running this command fails:
$(printf "awk '{%sprint}'" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')) file.txt
It gives the following error:
awk: '
awk: ^ invalid char ''' in expression
However, if I run the substituted command, I get this:
awk '{gsub("ACB",1);gsub("ASW",2);gsub("BEB",3);gsub("CDX",4);gsub("CEU",5);gsub("CHB",6);gsub("CHS",7);gsub("CLM",8);gsub("ESN",9);gsub("FIN",10);gsub("GBR",11);gsub("GIH",12);gsub("GWD",13);gsub("IBS",14);gsub("ITU",15);gsub("JPT",16);gsub("KHV",17);gsub("LWK",18);gsub("MSL",19);gsub("MXL",20);gsub("PEL",21);gsub("PJL",22);gsub("PUR",23);gsub("STU",24);gsub("TSI",25);gsub("YRI",26);print}'
which I can run like so:
awk '{gsub("ACB",1);gsub("ASW",2);gsub("BEB",3);gsub("CDX",4);gsub("CEU",5);gsub("CHB",6);gsub("CHS",7);gsub("CLM",8);gsub("ESN",9);gsub("FIN",10);gsub("GBR",11);gsub("GIH",12);gsub("GWD",13);gsub("IBS",14);gsub("ITU",15);gsub("JPT",16);gsub("KHV",17);gsub("LWK",18);gsub("MSL",19);gsub("MXL",20);gsub("PEL",21);gsub("PJL",22);gsub("PUR",23);gsub("STU",24);gsub("TSI",25);gsub("YRI",26);print}' file.txt
And it works perfectly. What am I doing wrong?
#ChrisLear gave me a working solution, but I still don't quite understand what the command solution is doing. Here's the working code:
$(printf "awk {%sprint}" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')) file.txt
The single quotes around {%sprint} are removed. Why do those single quotes break the command substitution?
edit: changed backtick to $(...) notation. Also added solution I don't understand.
Try removing the quotes from the command being generated.
`printf "awk {%sprint}" $(tail -n +2 file.txt | cut -f2 | sort | uniq | awk 'BEGIN{a=1}{printf "gsub(\"%s\",%i);", $1,a++}')` file.txt
For an explanation, see the accepted answer at Why does command substitution change how quoted arguments work?
It looks like you're trying to take a bunch of unique 2nd fields from a file starting at line 2 and map those to numbers based on their alphabetic ordering, then apply the change to the same file. If so then with GNU awk for sorted_in and inplace editing that'd be:
awk -i inplace '
NR==FNR {
if (NR>1) {
map[$2]
}
next
}
FNR==1 {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (str in map) {
map[str] = ++i
}
}
{
$2 = map[$2]
print
}
' file.txt
If that's not what you need then edit your question to show concise, testable sample input and expected output.
I am validating few columns in a pipe delimited file. My second column is defaulted with '*'.
E.g. data of file to be validated:
abc|* |123
def|** |456
ghi|* |789
2nd record has 2 stars due to erroneous data.
I teied it as:
Value_to_match="*"
unmatch_count=cat <filename>| cut -d'|' -f2 | awk '{$1=$1};1' | grep -vw "$Value_to_match" | sort -n | uniq | wc -l
echo "unmatch_count"
This gives me count as 0 whereas I am expecting 1 (for **) as I have used -w with grep which is exact match and -v which is invert match.
How can I grep **?
The problem here is grep considering ** a regular expression. To prevent this, use -F to use fixed strings:
grep -F '**' file
However, you have an unnecessarily big set of piped operations, while awk alone can handle it quite well.
If you want to check lines containing ** in the second column, say:
$ awk -F"|" '$2 ~ /\*\*/' file
def|** |456
If you want to count how many of such lines you have, say:
$ awk -F"|" '$2 ~ /\*\*/ {sum++} END {print sum}' file
1
Note the usage of awk:
-F"|" to set the field separator to |.
$2 ~ /\*\*/ to say: hey, in every line check if the second field contains two asterisks (remember we sliced lines by |). We are escaping the * because it has a special meaning as a regular expression.
If you want to output those lines that have just one asterisk as second field, say:
$ awk -F"|" '$2 ~ /^*\s*$/' file
abc|* |123
ghi|* |789
Or check for those not matching this regex with !~:
$ awk -F"|" '$2 !~ /^*\s*$/' a
def|** |456