Reverse Geocoding in Bash using GPS Position from exiftool - bash

I am writing a bash script that renames JPG files based on their EXIF tags. My original files are named like this:
IMG_2110.JPG
IMG_2112.JPG
IMG_2113.JPG
IMG_2114.JPG
I need to rename them like this:
2015-06-07_11-21-38_iPhone6Plus_USA-CA-Los_Angeles_IMG_2110.JPG
2015-06-07_11-22-41_iPhone6Plus_USA-CA-Los_Angeles_IMG_2112.JPG
2015-06-13_19-05-10_iPhone6Plus_Morocco-Fez_IMG_2113.JPG
2015-06-13_19-12-55_iPhone6Plus_Morocco-Fez_IMG_2114.JPG
My bash script uses exiftool to parse the EXIF header and rename the files. For those files that do not contain an EXIF create date, I am using the file modification time.
#!/bin/bash
IFS=$'\n'
for i in *.*; do
MOD=`stat -f %Sm -t %Y-%m-%d_%H-%m-%S $i`
model=$( exiftool -f -s3 -"Model" "${i}" )
datetime=$( exiftool -f -s3 -"DateTimeOriginal" "${i}" )
stamp=${datetime//:/-}"_"${model// /}
echo ${stamp// /_}$i
done
I am stuck on the location. I need to determine the country and city using the GPS information from the EXIF tag. exiftool provides a field called "GPS Position." Of all the fields, this seems the most useful to determine location.
GPS Position : 40 deg 44' 49.36" N, 73 deg 56' 28.18" W
Google provides a public API for geolocation, but it requires latitude/longitude coordinates in this format:
40.7470444°, -073.9411611°
The API returns quite a bit of information (click the link to see the results):
https://maps.googleapis.com/maps/api/geocode/json?latlng=40.7470444,-073.9411611
My question is:
How do I format the GPS Position to a latitude/longitude value that will provide acceptable input to a service such as Google geolocation?
How do I parse the JSON results to extract just the country and city, in a way that is consistent with many different kinds of locations? Curl, and then? Ideally, I’d like to handle USA locations one way, and non-USA locations, another. USA locations would be formatted USA-STATE-City, whereas non-USA locations would be formatted COUNTRY-City.
I need to do this all in a bash script. I've looked at pygeocoder and gpsbabel but they do not seem to do the trick. There are a few free web tools available but they don't provide an API (http://www.earthpoint.us/Convert.aspx).

Better later than never, right.
So, I just came across the same issue and I've managed to make the conversion using the EXIFTool itself. Try this:
exiftool -n -p '$GPSLatitude,$GPSLongitude' image_name.jpg
The converted coordinates are slightly longer than proposed by Google, but the API accepted it fine.
Cheers.

For #1, the awk should not be that complicated:
awk '/GPS Position/{
lat=$4; lat+=strtonum($6)/60; lat+=strtonum($7)/3600; if($8!="N,")lat=-lat;
lon=$9; lon+=strtonum($11)/60; lon+=strtonum($12)/3600; if($13!="E")lon=-lon;
printf "%.7f %.7f\n",lat,lon
}'

I ended up doing it in PHP, but thanks for the tip Marco I'll check it out!
function get_gps($gps_pos) {
$parts = explode(" ",str_replace(array("deg ",",","'","\""),"",$gps_pos));
$lat_deg = $parts[0];
$lat_min = $parts[1];
$lat_sec = $parts[2];
$lat_dir = $parts[3];
$lon_deg = $parts[4];
$lon_min = $parts[5];
$lon_sec = $parts[6];
$lon_dir = $parts[7];
if ($lat_dir == "N") {
$lat_sin = "+";
} else {
$lat_sin = "-";
}
if ($lon_dir == "E") {
$lon_sin = "+";
} else {
$lon_sin = "-";
}
$latitiude = $lat_sin.($lat_deg+($lat_min/60)+($lat_sec/3600));
$longitude = $lon_sin.($lon_deg+($lon_min/60)+($lon_sec/3600));
return $latitiude.",".$longitude;
}

From man exiftool (note the last line):
-c FMT (-coordFormat)
Set the print format for GPS coordinates. FMT uses the same syntax
as a "printf" format string. The specifiers correspond to degrees,
minutes and seconds in that order, but minutes and seconds are
optional. For example, the following table gives the output for
the same coordinate using various formats:
FMT Output
------------------- ------------------
"%d deg %d' %.2f"\" 54 deg 59' 22.80" (default for reading)
"%d %d %.8f" 54 59 22.80000000 (default for copying)
"%d deg %.4f min" 54 deg 59.3800 min
"%.6f degrees" 54.989667 degrees
And regarding "There are a few free web tools available but they don't provide an API"—geoapify.com offers a free web tool but also an API. Their API is free for up to three thousand requests per day. Their web service does five hundred at a time.

Related

Google Ads API: How to Send Batch Requests?

I'm using Google Ads API v11 to upload conversions and adjust conversions.
I send hundreds of conversions each day and want to start sending batch requests instead.
I've followed Google's documentation and I upload/ adjust conversions exactly the way they stated.
https://developers.google.com/google-ads/api/docs/conversions/upload-clicks
https://developers.google.com/google-ads/api/docs/conversions/upload-adjustments
I could not find any good explanation or example on how to send batch requests:
https://developers.google.com/google-ads/api/reference/rpc/v11/BatchJobService
Below is my code, an example of how I adjust hundreds of conversions.
An explanation of how to do so with batch requests would be very appreciated.
# Adjust the conversion value of an existing conversion, via Google Ads API
def adjust_offline_conversion(
client,
customer_id,
conversion_action_id,
gclid,
conversion_date_time,
adjustment_date_time,
restatement_value,
adjustment_type='RESTATEMENT'):
# Check that gclid is valid string else exit the function
if type(gclid) is not str:
return None
# Check if datetime or string, if string make as datetime
if type(conversion_date_time) is str:
conversion_date_time = datetime.strptime(conversion_date_time, '%Y-%m-%d %H:%M:%S')
# Add 1 day forward to conversion time to avoid this error (as explained by Google: "The Offline Conversion cannot happen before the ad click. Add 1-2 days to your conversion time in your upload, or check that the time zone is properly set.")
to_datetime_plus_one = conversion_date_time + timedelta(days=1)
# If time is bigger than now, set as now (it will be enough to avoid the original google error, but to avoid a new error since google does not support future dates that are bigger than now)
to_datetime_plus_one = to_datetime_plus_one if to_datetime_plus_one < datetime.utcnow() else datetime.utcnow()
# We must convert datetime back to string + add time zone suffix (+00:00 or -00:00 this is utc) **in order to work with google ads api**
adjusted_string_date = to_datetime_plus_one.strftime('%Y-%m-%d %H:%M:%S') + "+00:00"
conversion_adjustment_type_enum = client.enums.ConversionAdjustmentTypeEnum
# Determine the adjustment type.
conversion_adjustment_type = conversion_adjustment_type_enum[adjustment_type].value
# Associates conversion adjustments with the existing conversion action.
# The GCLID should have been uploaded before with a conversion
conversion_adjustment = client.get_type("ConversionAdjustment")
conversion_action_service = client.get_service("ConversionActionService")
conversion_adjustment.conversion_action = (
conversion_action_service.conversion_action_path(
customer_id, conversion_action_id
)
)
conversion_adjustment.adjustment_type = conversion_adjustment_type
conversion_adjustment.adjustment_date_time = adjustment_date_time.strftime('%Y-%m-%d %H:%M:%S') + "+00:00"
# Set the Gclid Date
conversion_adjustment.gclid_date_time_pair.gclid = gclid
conversion_adjustment.gclid_date_time_pair.conversion_date_time = adjusted_string_date
# Sets adjusted value for adjustment type RESTATEMENT.
if conversion_adjustment_type == conversion_adjustment_type_enum.RESTATEMENT.value:
conversion_adjustment.restatement_value.adjusted_value = float(restatement_value)
conversion_adjustment_upload_service = client.get_service("ConversionAdjustmentUploadService")
request = client.get_type("UploadConversionAdjustmentsRequest")
request.customer_id = customer_id
request.conversion_adjustments = [conversion_adjustment]
request.partial_failure = True
response = (
conversion_adjustment_upload_service.upload_conversion_adjustments(
request=request,
)
)
conversion_adjustment_result = response.results[0]
print(
f"Uploaded conversion that occurred at "
f'"{conversion_adjustment_result.adjustment_date_time}" '
f"from Gclid "
f'"{conversion_adjustment_result.gclid_date_time_pair.gclid}"'
f' to "{conversion_adjustment_result.conversion_action}"'
)
# Iterate every row (subscriber) and call the "adjust conversion" function for it
df.apply(lambda row: adjust_offline_conversion(client=client
, customer_id=customer_id
, conversion_action_id='xxxxxxx'
, gclid=row['click_id']
, conversion_date_time=row['subscription_time']
, adjustment_date_time=datetime.utcnow()
, restatement_value=row['revenue'])
, axis=1)
I managed to solve it in the following way:
The conversion upload and adjustment are not supported in the Batch Processing, as they are not listed here.
However, it is possible to upload multiple conversions in one request since the conversions[] field (list) could be populated with several conversions, not only a single conversion as I mistakenly thought.
So if you're uploading conversions/ adjusting conversions you can simply upload them in batch this way:
Instead of uploading one conversion:
request.conversions = [conversion]
Upload several:
request.conversions = [conversion_1, conversion_2, conversion_3...]
Going the same way for conversions adjustment upload:
request.conversion_adjustments = [conversion_adjustment_1, conversion_adjustment_2, conversion_adjustment_3...]

Print out a terminfo entry including capname descriptions?

What's the most straightforward way to print out a terminfo entry (e.g., for my current terminal: xterm-256color) that includes the short descriptions of each capname from the terminfo man page?
I know how to print out the terminfo entry for my terminal (with one capname per line) with:
infocmp -1
Generates:
# Reconstructed via infocmp from file: /usr/share/terminfo/78/xterm-256color
xterm-256color|xterm with 256 colors,
am,
bce,
ccc,
km,
mc5i
Etc.
And I can manually look up the descriptions of each capname in the terminfo man page (e.g., ccc represents "terminal can redefine existing colors"), but is there a way to display the descriptions for each capname without having to look each one up manually?
So, for example, I'd like to see something like this:
xterm-256color|xterm with 256 colors
am terminal has automatic margins
bce screen erased with background color
ccc terminal can redefine existing colors
km Has a meta key (i.e., sets 8th bit)
mc5i printer will not echo on screen
Etc.
The output from infocmp is consistently delimited and relatively easy to parse, but the tables listing the terminal capabilities on the terminfo man page, with varying column widths and capname descriptions that span multiple lines, are not. If they were, generating the output I describe would be more straightforward. Perhaps there's an alternative source for the content from the terminfo man page that's programmatically easier to manipulate?
I'm running GNU bash, version 4.4.23(1)-release (x86_64-apple-darwin18.0.0).
Probably not. Actually, the manual page and other files are constructed using scripts from a data file, but that is not installed.
Since it is generated, you could write a script to extract the information, though you'd find it challenging to do this as a bash script (perl yes, awk yes, sed...maybe). Here is a small chunk of the text (which is installed on your system):
.TS H
center expand;
c l l c
c l l c
lw25 lw6 lw2 lw20.
\fBVariable Cap- TCap Description\fR
\fBBooleans name Code\fR
auto_left_margin bw bw T{
cub1 wraps from column 0 to last column
T}
auto_right_margin am am T{
terminal has automatic margins
T}
back_color_erase bce ut T{
screen erased with background color
T}
can_change ccc cc
You can always list the long names using infocmp, and if the order were the same as for the (default) short names, you could combine those. But the listing for long-names is sorted alphabetically (in groups for boolean, numbers and strings, like the short names), while the short names are ordered by default to match the SVr4 terminfo data. You might see something like this:
xterm-256color|xterm with 256 colors
am auto_right_margin
bce back_color_erase
ccc backspaces_with_bs
km can_change
mc5i eat_newline_glitch
mir has_meta_key
msgr move_insert_mode
npc move_standout_mode
xenl no_pad_char
colors prtr_silent
cols columns
it init_tabs
lines lines
pairs max_colors
acsc max_pairs
bel acs_chars
blink back_tab
bold bell
Actually ncurses has an option allowing the names to be sorted, so that you could (almost) match the order of the right-column using the -sl option. You might see something like this:
xterm-256color|xterm with 256 colors
am auto_right_margin
bce back_color_erase
ccc backspaces_with_bs
xenl can_change
km eat_newline_glitch
mir has_meta_key
msgr move_insert_mode
npc move_standout_mode
mc5i no_pad_char
cols prtr_silent
it columns
lines init_tabs
colors lines
pairs max_colors
acsc max_pairs
cbt acs_chars
bel back_tab
cr bell
That's "almost", because the columns do not line up xenl with eat_newline_glitch because ncurses has an internal name for backspaces_with_bs which normally is not shown. With a change to the ncurses source to show that:
xterm-256color|xterm with 256 colors
am auto_right_margin
bce back_color_erase
OTbs backspaces_with_bs
ccc can_change
xenl eat_newline_glitch
Here's the perl script which I used to generate the examples:
#!/usr/bin/env perl
# $Id: infocmp2col,v 1.1 2018/12/20 22:35:57 tom Exp $
use strict;
use warnings;
sub infocmp($$) {
my $term = shift;
my $opts = shift;
my #data;
if ( open FP, "infocmp -1 $opts $term |" ) {
#data = <FP>;
close FP;
for my $n ( 0 .. $#data ) {
chomp $data[$n];
$data[$n] =~ s/,\s*$//;
$data[$n] =~ s/[#=].*//;
}
}
return \#data;
}
sub doit($) {
my $term = shift;
my #short_term = #{ &infocmp( $term, "-sl" ) };
my #long_term = #{ &infocmp( $term, "-L" ) };
for my $n ( 0 .. $#short_term ) {
if ( $short_term[$n] =~ /^\s/ ) {
printf "%s%s\n", $short_term[$n], $long_term[$n];
}
else {
printf "%s\n", $short_term[$n];
}
}
}
if ( $#ARGV >= 0 ) {
while ( $#ARGV >= 0 ) {
&doit( pop #ARGV );
}
}
else {
&doit( $ENV{TERM} );
}
1;
The minor fix that I mentioned is in ncurses 6.2 (see changes), so this "should work" for most users.

Reading freely available UVR data using gfortran on mac OSX

I would like to use fortran to read ultraviolet radiation data that has been produced by the Japan Aerospace Exploration Agency. This data is at a daily and monthly temporal resolution from 2000-2010 at a ~5 km spatial resolution. This question is worth answering as the data could be useful for a number of environment/health projects and is freely available, with proper acknowledgement of source and sharing of preprint of any subsequent publications, from:
ftp://suzaku.eorc.jaxa.jp/pub/GLI/glical/Global_05km/monthly/uvb/
There is a readme file available, which provides instructions on how to read data using fortran as follows:
Instructions for _le files
Header
Read header (size= pixel size *2byte):
character head*14400
read(10,rec=1) head
read(head,'(2i6,2f8.2,f8.4,2e12.5,a1,a8,a1,a40)')
& npixel,nline,lon_min,lat_max,reso,slope,offset,',',
& para,',',outfile
Read data (e.g., fortran77)
parameter(nl=7200, ml=3601)
... open file by "unformatted", "recl=nl*2(byte)" (,"bytereclen")
integer*2 i2buf(nl,ml)
do m=1,ml
read(10,rec=1+m) (i2buf(n,m), n=1,nl)
do n=1,nl
par=i2buf(n,m)*slope+offset
write(6,*) 'PAR[Ein/m^2/day]=',par
enddo
enddo
slope values
par__le : daily PAR [Ein/m^2/day] = DN * 0.01
dpar_le : direct PAR = DN * 0.01
swr__le : daily mean shortwave radiation [W/m^2] = DN * 0.01
tip__le : transmittance of instantaneous PAR at noon = DN * 0.0001
uva__le : daily mean UVA [W/m^2] = DN * 0.001
uvb__le : daily mean UVB [W/m^2] = DN * 0.0001
rpar_le : PAR-range surface reflectance (TOP of canopy/solid surfaces) = DN * 0.0001 (monthly data only)
error values
-1 as signed short integer (int16)
65535 as unsigned short integer (uint16)
Progress so far
I have downloaded and installed gfortran successfully on mac OSX. I have downloaded a test file (MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le.gz) and decompressed it. I have created a program file:
PROGRAM readuvr
IMPLICIT NONE
!some code
END PROGRAM
I will then type the following into the command line to create an executable and run it to extract the data.
gfortran -o executable
./executable
As a complete beginner to fortran, my question is: how can I use the instructions provided to build a program that can read the data and output it into a text file?
Well, that file expands to 51,868,800 bytes. The comments imply the header is 14,400 bytes, which leaves 51,854,400 bytes of actual data payload.
There seem to be 7200 lines of data, so that means there are 7202 bytes per line. There seem to be 2 bytes (16-bit samples) so if we assume 2 bytes/sample, that means there are 3601 samples per line, which matches the ml=3601.
So basically, you need to read 14,400 bytes of header, then 7200 lines of data, each line consisting of 3601 values, each of those being 2 bytes wide...
Actually, if you are that unfamiliar with FORTRAN, you may like to extract the data with Perl which is already installed and available on OS X anyway. I have started a VERY SIMPLISTIC Perl program that reads the dat and prints the first 2 values on each line:
#!/usr/bin/perl
use strict;
use warnings;
# Read 14,400 bytes of header
my $buffer;
my $nBytes = 14400;
my $bytesRead = read (STDIN, $buffer, $nBytes) ;
my ($npixel,$nline,$lon_min,$lat_max,$reso,$slope,$offset,$junk)=split(' ',$buffer);
print "npixel:$npixel\n";
print "nline:$nline\n";
print "lon_min:$lon_min\n";
print "lat_max:$lat_max\n";
print "reso:$reso\n";
print "slope:$slope\n";
$offset =~ s/,.*//; # strip trailing comma and junk
print "offset:$offset\n";
# Read actual lines of data
my $line;
for(my $m=1;$m<=$nline;$m++){
read(STDIN,$line,$npixel*2);
my $x=$npixel*2;
my #values=unpack("S$x",$line);
printf "Line: %d",$m;
for(my $j=0;$j<2;$j++){
printf ",%f",$values[$j]*$slope+$offset;
}
printf "\n"; # newline
}
Save it as go.pl and then in the Terminal, type the following once to make it executable
chmod +x go.pl
and then run it like this
./go.pl < MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le
Sample output extract:
npixel:7200
nline:3601
lon_min:0.00
lat_max:90.00
reso:0.0500
slope:0.10000E-03
offset:0.00000E+00
...
...
Line: 3306,0.099800,0.099800
Line: 3307,0.099900,0.099900
Line: 3308,0.099400,0.074200
Line: 3309,0.098900,0.098900
Line: 3310,0.098400,0.098400
Line: 3311,0.074300,0.074200
Line: 3312,0.071300,0.071200
fortran (f2003 or so) solution. (The linked instructions are awful by the way )
implicit none
character*80 para,outfile
character(len=:),allocatable::header,infile
integer npixel,nline,blen,i
c note kind=2 is not standard. This needs to be a 2-byte integer.
integer(kind=2),allocatable :: data(:,:)
real lon_min,lat_max,reso,slope,off
c header is plain text, so first open formatted and
c directly read header data
infile='MOD02SSH_A20000224Av6_v601_7200_3601_uvb__le'
open(10,file=infile)
read(10,*)npixel,nline,lon_min,lat_max,reso,slope,off,
$ para,outfile
close(10)
write(*,*)npixel,nline,lon_min,lat_max,reso,slope,off,
$ trim(para),' ',trim(outfile)
blen=2*npixel
allocate(character(len=blen)::header)
allocate(data(npixel,nline))
if( sizeof(data(1,1)).ne.2 )then
write(*,*)'error kind=2 did not give a 2 byte integer'
stop
endif
c now close and reopen for binary read.
c direct access approach:
open(20,file=infile,access='direct',recl=blen/4)
c note the granularity of the recl= specifier is not standard.
c ifort uses 4 bytes. (note this will break if npixel is not even )
read(20,rec=1)header
write(*,*)trim(header)
do i=1,nline
read(20,rec=i+1)data(:,i)
enddo
c note streams if available is simpler: (we don't need to know rec len )
c open(20,file=infile,access='stream')
c read(20)header,data
end
This is not actually validated because I don't have known file content to compare against.

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Resources