I have a file named file.txt
$cat file.txt
1./abc/cde/go/ftg133333.jpg
2./abc/cde/go/ftg24555.jpg
3./abc/cde/go/ftg133333.gif
4./abt/cte/come/ftg24555.jpg
5./abc/cde/go/ftg133333.jpg
6./abc/cde/go/ftg24555.pdf
MY GOAL: To get only one line from lines who's first, second and third PATH are the same and have the same file EXTENSION.
Note each PATH is separated by forward slash "/". Eg in the first line of the list, the first PATH is abc, second PATH is cde and third PATH is go.
File EXTENSION is .jpg, .gif,.pdf... always at the end of the line.
HERE IS WHAT I TRIED
sort -u -t '/' -k1 -k2 -k3
My thoughts
Using / as a delimiter gives me 4 fields in each line. Sorting them with "-u" will remove all but 1 line with unique First, Second and 3rd field/PATH. But obviously, I didn't take into account the EXTENSION(jpg,pdf,gif) in this case.
MY QUESTION
I need a way to grep only 1 of the lines if the first, second and third field are same and have the same EXTENSION using "/" as delimiter to divide it into fields. I want to output it to a another file, say file2.txt.
In the file2.txt, how do I add a word say "KALI" before the extension in each line, so it will look something like /abc/cde/go/ftg13333KALI.jpg using line 1 as an example in file.txt above.
Desired Output
/abc/cde/go/ftg133333KALI.jpg
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
COMMENT
Line 1,2 & 5 have the same 1st,2nd and 3rd field, with same file extension
".jpg" so only line 1 should be in the output.
Line 3 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".gif".
Line 4 has different 1st, 2nd and 3rd field, hence it in output.
Line 6 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".pdf".
$ awk '{ # using awk
n=split($0,a,/\//) # split by / to get all path components
m=split(a[n],b,".") # split last by . to get the extension
}
m>1 && !seen[a[2],a[3],a[4],b[m]]++ { # if ext exists and is unique with 3 1st dirs
for(i=2;i<=n;i++) # loop component parts and print
printf "/%s%s",a[i],(i==n?ORS:"")
}' file
Output:
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
I split by / separately from .s in case there are .s in dir names.
Missed the KALI part:
$ awk '{
n=split($0,a,/\//)
m=split(a[n],b,".")
}
m>1&&!seen[a[2],a[3],a[4],b[m]]++ {
for(i=2;i<n;i++)
printf "/%s",a[i]
for(i=1;i<=m;i++)
printf "%s%s",(i==1?"/":(i==m?"KALI.":".")),b[i]
print ""
}' file
Output:
/abc/cde/go/ftg133333KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg24555KALI.pdf
Using awk:
$ awk -F/ '{ split($5, ext, "\\.")
if (!(($2,$3,$4,ext[2]) in files)) files[$2,$3,$4,ext[2]]=$0
}
END { for (f in files) {
sub("\\.", "KALI.", files[f])
print files[f]
}}' input.txt
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
/abc/cde/go/ftg133333KALI.jpg
another awk
$ awk -F'[./]' '!a[$2,$3,$4,$NF]++' file
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
assumes . doesn't exist in directory names (not necessarily true in general).
Let's say i have the following type of filename formats :
CO#ATH2000.dat , CO#MAR2000.dat
Each of these, have data like that following:
....
"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
Now i also have the following .sh file that can merge ALL those .dat files into one single output .dat file.
for filename in `ls CO#*`; do
cat $filename >> CO#combined.dat
done
Now here is the problem. I want inside CO#combined.dat, at each line, before the start of the values, to have a 'standard' value according to the filename-parameter. For example i want each file with ATH in its filename have 3, at the start of each line and with MAR in its filename have 22,.
So the CO#combined.dat should be something like this:
....
3,"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
3,"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
20,"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
20,"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
So in conclusion i want the script to do the above procedure!
Thanks in advance!
With awk you can take advantage of the built-in FILENAME variable along with the fact that you can supply multiple files to a given invocation. awk processes each file in turn, setting FILENAME to the name of the file whose records are currently being read.
With that you can set your prefix according to whatever pattern you wish to search for in the file name. Finally you can print the prefix and the original record.
Here's a demonstration on simplified versions of your sample input:
$ cat CO\#ATH2000.dat
1
2
3
$ cat CO\#MAR2000.dat
A
B
C
$ awk 'FILENAME ~ /MAR/ {pre=22} FILENAME ~ /ATH/ {pre=3} { print pre "," $0 }' CO*.dat
3,1
3,2
3,3
22,A
22,B
22,C
can be done simply
for f in CO#*; do
case ${f:3:3} in
ATH) k=3 ;;
*) k=22 ;;
esac;
sed "s/^/$k,/" $f >> all;
done
${f:3:3} extract the code ATH or MAR from the filename it's bash substring function; case converts the code to numerical counterpart; sed insert the numerical value and comma at the beginning of each line.
I try to parse html page using XPath with xidel.
The page have a table with multiple rows and columns
I need to get values from each row from columns 2 and 5 (IP and port) and store them in csv-like file.
Here is my script
#!/bin/bash
for (( i = 2; i <= 100; i++ ))
do
xidel http://www.vpngate.net/en/ -e '//*[#id="vg_hosts_table_id"]/tbody/tr["'$i'"]/td[2]/span[1]' >> "$i".txt #get value from first column
xidel http://www.vpngate.net/en/ -e '//*[#id="vg_hosts_table_id"]/tbody/tr["'$i'"]/td[5]' >> "$i".txt #get value from second column
sed -i ':a;N;$!ba;s/\n/^/g' "$i".txt #replace newline with custom delimiter
sed -i '/\s/d' "$i".txt #remove blanks
cat "$i".txt >> ip_port_list #create list
zip -m ips.zip "$i".txt #archive unneeded texts
done
The perfomance is not issue
When i manually increment each tr - looks perfect. But not with variable from loop.
I want to receive a pair of values from each row.
Now i got only partial data or even empty file
I need to get values from each row from columns 2 and 5 (IP and port) and store them in csv-like file.
xidel -s "https://www.vpngate.net/en/" -e '
(//table[#id="vg_hosts_table_id"])[3]//tr[not(td[#class="vg_table_header"])]/concat(
td[2]/span[#style="font-size: 10pt;"],
",",
extract(
td[5],
"TCP: (\d+)",
1
)
)
'
220.218.70.177,443
211.58.36.54,995
1.239.223.190,1351
[...]
153.207.18.229,1542
(//table[#id="vg_hosts_table_id"])[3]: Select the 3rd table of its
kind. The one you want.
//tr[not(td[#class="vg_table_header"])]: Select all rows, except the headers.
td[2]/span[#style="font-size: 10pt;"]: Select the 2nd column and the <span> that contains just the IP-address.
extract(td[5],"TCP: (\d+)",1): Select the 5th column and extract (regex) the numerical value after "TCP ".
Maybe this xidel line will come in handy:
xidel -q http://www.vpngate.net/en/ -e '//*[#id="vg_hosts_table_id"]/tbody/tr[*]/concat(td[2]/span[1],",",substring-after(substring-before(td[5],"UDP:"),"TCP: "))'
This will only do one fetch (so the admins of vpngate won't block you) and it'll also create a CSV output (ip,port)... Hopefully that is what you were looking for?