awk: copy from A to B and output..? - bash

my file is bookmarks, backup-6.session
inside file is long long letters, i need copy all url (many) see here example inside
......"charset":"UTF-8","ID":3602197775,"docshellID":0,"originalURI":"https://www.youtube.com/watch?v=axxxxxxxxsxsx","docIdentifier":470,"structuredCloneState":"AAAAA.....
result to output text.txt
https://www.youtube.com/watch?v=axxxxxxxxsxsx
https://www.youtube.com/watch?v=bxxxxxxxxsxsx
https://www.youtube.com/watch?v=cxxxxxxxxsxsx
https://www.youtube.com/watch?v=dxxxxxxxxsxsx
....
....
there are start before than A "originalURI":" to end "
comand to be: AWK, SED.. (i dont know what is best command for me)
thank you

With GNU awk for multi-char RS and RT:
$ awk -v RS='"originalURI":"[^"]+' 'sub(/.*"/,"",RT){print RT}' file
https://www.youtube.com/watch?v=axxxxxxxxsxsx

You could also use grep, for example:
grep -oh "https://www\.youtube\.com/watch?v=[A-Za-z0-9]*" backup-6.session > text.txt
That is if the axxxxxxxxsxsx part contains only letters from A-Z, a-z or digits 0-9, and is not followed by any of those.
Notice the flags for grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.

The awk solution would be as follows:
awk -F, '{ for (i=1;i<=NF;i++) { if ( $i ~ "originalURI") { spit($i,add,":");print gensub("\"","","g",add[2])":"gensub("\"","","g",add[3])} } }' filename
We loop through each field separated by "," and then pattern match against "originalURI" Then we split this string using ":" and the function split and remove the quotation marks with the function gensub.
The sed solution would be as follows:
sed -rn 's/^.*originalURI":"(.*)","docIdentifier.*$/\1/p' filename
Run sed with extended regular expression (-r) and suppress the output (-n) Substitute the string with the regular expression enclosed in brackets (/1) printing the result.

Related

How can I parse CSV files with quoted fields containing commas, in awk?

I have a big CSV field, and I use awk with the field separator set to a comma. However, some fields are quoted and contain a comma, and I'm facing this issue:
Original file:
Downloads $ cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"
I am trying this way:
Downloads $ cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd
Expecting result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
I would use a tool that is able to properly parse CSV, such as xsv. With it, the command would look like
$ xsv select 2-4 testfile.csv
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"
or, if you really want every value quoted, with a second step:
$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Include (escaped) quotes in your field separator flag, and add them to your output print fields:
testfile.csv | awk -F "\",\"" '{print "\""$2"\",\""$3"\",\""$4}'
output:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
If gawk or GNU awk is available, you can make use of FPAT, which matches the fields, instead of splitting on field separators.
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, '{print $2, $3, $4}' testfile.csv
Result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
The string ([^,]+)|(\"[^\"]+\") is a regex pattern which matches either of:
([^,]+) ... matches a sequence of any characters other than a comma.
(\"[^\"]+\") ... matches a string enclosed by double quotes (which may include commas in between).
The parentheses around the patterns are put for visual clarity purpose and the regex will work without them such as FPAT='[^,]+|\"[^\"]+\"' because the alternative | has lower precedence.

How to ignore case when using awk or sed [duplicate]

sed -i '/first/i This line to be added'
In this case,how to ignore case while searching for pattern =first
You can use the following:
sed 's/[Ff][Ii][Rr][Ss][Tt]/last/g' file
Otherwise, you have the /I and n/i flags:
sed 's/first/last/Ig' file
From man sed:
I
i
The I modifier to regular-expression matching is a GNU extension which
makes sed match regexp in a case-insensitive manner.
Test
$ cat file
first
FiRst
FIRST
fir3st
$ sed 's/[Ff][Ii][Rr][Ss][Tt]/last/g' file
last
last
last
fir3st
$ sed 's/first/last/Ig' file
last
last
last
fir3st
GNU sed
sed '/first/Ii This line to be added' file
You can try
sed 's/first/somethingelse/gI'
if you want to save some typing, try awk. I don't think sed has that option
awk -v IGNORECASE="1" '/first/{your logic}' file
For versions of awk that don't understand the IGNORECASE special variable, you can use something like this:
awk 'toupper($0) ~ /PATTERN/ { print "string to insert" } 1' file
Convert each line to uppercase before testing whether it matches the pattern and if it does, print the string. 1 is the shortest true condition, so awk does the default thing: { print }.
To use a variable, you could go with this:
awk -v var="$foo" 'BEGIN { pattern = toupper(foo) } toupper($0) ~ pattern { print "string to insert" } 1' file
This passes the shell variable $foo and transforms it to uppercase before the file is processed.
Slightly shorter with bash would be to use -v pattern="${foo^^}" and skip the BEGIN block.
Use the following, \b for word boundary
sed 's/\bfirst\b/This line to be added/Ig' file

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

Find the pattern (YYYY-MM-DD) and replace it with the same value concatenating with apostrophes

I have this kind of data:
1,1990-01-01,2,A,2015-02-09
1,NULL,2,A,2015-02-09
1,1990-01-01,2,A,NULL
And looking for solution which will replace each date in the file with the old value but adding apostrophes. Basically expected result from the example will be:
1,'1990-01-01',2,A,'2015-02-09'
1,NULL,2,A,'2015-02-09'
1,'1990-01-01',2,A,NULL
I have found the way how to find the pattern which match my date, but still can't get with what I can then replace it.
sed 's/[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/????/' a.txt > b.txt
Catch the date in a group by surrounding the pattern with parentheses (). Then you can use this catched group with \1 (second group would be \2 etc.).
sed "s/\([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]\)/'\1'/g"
Note the g at the end, which ensures that all matches are replaced (if there are more than one in one line).
If you add -r switch to sed, the awkward backslashes before () can be omitted:
sed -r "s/([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])/'\1'/g"
This can be further simplified using quantifiers:
sed -r "s/([0-9]{4}-[0-9]{2}-[0-9]{2})/'\1'/g"
Or even:
sed -r "s/([0-9]{4}-([0-9]{2}){2})/'\1'/g"
As mentioned in the comments: Also, in this particular case, you may use & instead of \1, which matches the whole looked-up expression, and omit the ():
sed -r "s/[0-9]{4}(-[0-9]{2}){2}/'&'/g"
You need to use a capture group, as well as replace all matching occurrences with the g flag.
sed 's/\([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]\)/'"'"'\1'"'"'/g' a.txt > b.txt
The replacement text is a bit confusing because a single-quoted string in shell cannot contain a single quote, so you have to close the single-quoted string, then use a double-quoted single-quote. Using $'...'-style quoting in bash simplies it a bit, at the cost of needing to escape the backslashes.
sed $'s/\\([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]\\)/\'\1\'/g' a.txt > b.txt
Or, you can simply double-quote the script, since there's nothing currently in it that is subject to expansion:
sed "s/\([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]\)/'\1'/g" a.txt > b.txt
There is also the special & replacement text, which expands to whatever the regular expressions matches, so you can avoid an explicit capture group:
sed "s/[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/'&'/g" a.txt > b.txt
With GNU sed:
sed -E 's/([0-9]{2,4}-?){3}/'\''&'\''/g' file
Depending on your file content, the dates may also be described as 1 or 2 followed by a combination of nine dashes or digits:
sed -E 's/[12][-0-9]{9}/'\''&'\''/g" file
Here is one in awk:
$ awk -v q="'" '
BEGIN { FS=OFS="," } # set selimiters
{
for(i=1;i<=NF;i++) # loop all fields
if($i~/[0-9]{4}-[0-9]{2}-[0-9]{2}/) # if field has a date looking string
$i=q $i q # quote it
}1' file
Output:
1,'1990-01-01',2,A,'2015-02-09'
1,NULL,2,A,'2015-02-09'
1,'1990-01-01',2,A,NULL
Could you please try following.(REGEX mentioned inside match could be written as [0-9]{4}-[0-9]{2}-[0-9]{2} too but since my awk is of old version so couldn't test it, you could try it once)
awk -v s1="'" '
{
while(match($0,/[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/)){
val=val substr($0,1,RSTART-1) s1 substr($0,RSTART,RLENGTH) s1
$0=substr($0,RSTART+RLENGTH)
}
print val
val=""
}' Input_file
Output will be as follows.
1,'1990-01-01',2,A,'2015-02-09'
1,NULL,2,A,'2015-02-09'
1,'1990-01-01'
With Perl, it is simple
perl -pe ' s/(\d{4}-\d\d-\d\d)/\x27$1\x27/g '
with inputs - \x27 is used for single quotes
$ cat liubo.txt
1,1990-01-01,2,A,2015-02-09
1,NULL,2,A,2015-02-09
1,1990-01-01,2,A,NULL
$ perl -pe ' s/(\d{4}-\d\d-\d\d)/\x27$1\x27/g ' liubo.txt
1,'1990-01-01',2,A,'2015-02-09'
1,NULL,2,A,'2015-02-09'
1,'1990-01-01',2,A,NULL
$
If you want to use single quotes, then escape $ and wrap the command in double quotes
$ perl -pe " s/(\d{4}-\d\d-\d\d)/\'\$1\'/g " liubo.txt
1,'1990-01-01',2,A,'2015-02-09'
1,NULL,2,A,'2015-02-09'
1,'1990-01-01',2,A,NULL
$

How to retrieve digits including the separator "."

I am using grep to get a string like this: ANS_LENGTH=266.50 then I use sed to only get the digits: 266.50
This is my full command: grep --text 'ANS_LENGTH=' log.txt | sed -e 's/[^[[:digit:]]]*//g'
The result is : 26650
How can this line be changed so the result still shows the separator: 266.50
You don't need grep if you are going to use sed. Just use sed' // to match the lines you need to print.
sed -n '/ANS_LENGTH/s/[^=]*=\(.*\)/\1/p' log.txt
-n will suppress printing of lines that do not match /ANS_LENGTH/
Using captured group we print the value next to = sign.
p flag at the end allows to print the lines that matches our //.
If your grep happens to support -P option then you can do:
grep -oP '(?<=ANS_LENGTH=).*' log.txt
(?<=...) is a look-behind construct that allows us to match the lines you need. This requires the -P option
-o allows us to print only the value part.
You need to match a literal dot as well as the digits.
Try sed -e 's/[^[[:digit:]\.]]*//g'
The dot will match any single character. Escaping it with the backslash will match only a literal dot.
Here is some awk example:
cat file:
some data ANS_LENGTH=266.50 other=22
not mye data=43
gnu awk (due to RS)
awk '/ANS_LENGTH/ {f=NR} f&&NR-1==f' RS="[ =]" file
266.50
awk '/ANS_LENGTH/ {getline;print}' RS="[ =]" file
266.50
Plain awk
awk -F"[ =]" '{for(i=1;i<=NF;i++) if ($i=="ANS_LENGTH") print $(i+1)}' file
266.50
awk '{for(i=1;i<=NF;i++) if ($i~"ANS_LENGTH") {split($i,a,"=");print a[2]}}' file
266.50

Resources