Linux shell script, search by an input string - bash

So, there is a large file where I have to conduct several search using bash shell scripting.
The file is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
First love and other stories, by Iván Turgénieff                         56878
Now I have to search it with etext no, author name and title..
Like If I search by an etext no: like etext 56900:
It should return
Aspects of plant life; with special reference to the British flora, 56900
Well I am new to shell scripting. And I can only read the file.
With this:
#!/bin/sh
read -p 'string to search ' searchstring
grep --color searchstring GUTINDEX.ALL | #condition
I don't know what kind of condition I should use to search by author name or etext no....

As others have already pointed out, the use of grep alone is not how you would really approach this. A rather substantial improvement could be accomplished by using Awk instead of grep, but for a real production system, you would parse out the fields into a relational database, and use SQL to search instead. With database indexing, searching will then be much much quicker than sequentially scanning the entire index file for each search.
But if you are confined to just grep, here is a quick and dirty attempt.
author () { grep -E "(by|par|di|mennessä) $#" GUTINDEX.ALL; }
index () { grep " $#\$" GUTINDEX.ALL; }
title () { grep "^$#" GUTINDEX.ALL; }
This declares three shell functions which search different parts of the file, by way of supplying an anchor expression (^ matches beginning of line, $ matches end of line) or a suitable context.
Having the search expression as a command-line argument instead of requiring interactive input is generally a huge usability improvement. Now, you can use your shell's history mechanism to recall and possibly edit earlier searches, and build new scripts on top of these simple building blocks.
(By the way, "mennessä" is not at all a correct Finnish localization here. I have reported a bug to Project Gutenberg.)

You could start with something like this, but as #tom-fenech points out, it's rather unreliable in the absence of structured input.
For instance, the author names are not consistently prefixed, they appear sometimes under "Subtitle", and rarely under "Author" tag.
#!/bin/bash
CATALOG=/tmp/s
function usage()
{
echo "Usage:"
echo "$0 [etext <key>] [author <id>]"
exit 1;
}
function process_etext()
{
local searchKey=$1
egrep "${searchKey}" ${CATALOG} | awk -F"${searchKey}" '{print $1}'
}
function process_author()
{
local searchKey=$1
egrep -b1 "${searchKey}" ${CATALOG} | egrep "[[:digit:]]{5}"
}
for key in "$#"
do
key="$1"
case $key in
etext|author)
process_${key} $2
shift; shift;
;;
*)
[ -z ${key} ] || usage
;;
esac
done

Related

Pythonic way to generate an IPv4 report

I am looking for the most Pythonic way to print a very generic report based on input data, to sort by number instances per IP address.
The incoming data looks like this
{'file3.txt': ['notice of IP abuse', '67.60.2.222', 'sender Bill Foobar'], 'file2.txt': ['DMCA notice 987654321', '101.143.22.22', 'Copyright Owner'], 'file1.txt': ['6.22.123.222', 'spam notification', 'Esmeralda Hopkins III'], 'file0.txt': ['DMCA notice 123456789', '67.60.2.222', 'Ian Flemming']}
I want to generate output that looks like this:
(2) 67.60.2.222
Bill Foobar | notice of IP abuse
Ian Flemming | DMCA notice 123456789
(1) 101.143.22.22
Copyright Owner | DMCA notice 987654321
(1) 6.22.123.222
Esmerelda Hopkins III | spam notification
Unfortunately I am stuck with Python 2 on Windows for this task, and I am unable to install any modules like Panda because I don't have pip.
I have tried all of the following:
print dict(sorted(matrix.items(), key=lambda item: item[1]))
print matrix
print dict(sorted(matrix.items(), key=lambda item: ip))
print dict(sorted(matrix.items))
print sorted(matrix)
sorted_x = sorted(matrix.items(), key=operator.itemgetter(0))
print sorted_x
the result is always this or something very similar:
{'file3.txt': ['notice of IP abuse', '67.60.2.222', 'sender Bill Foobar'], 'file2.txt': ['DMCA notice 987654321', '101.143.22.22', 'Copyright Owner'], 'file1.txt': ['6.22.123.222', 'spam notification', 'Esmeralda Hopkins III'], 'file0.txt': ['DMCA notice 123456789', '67.60.2.222', 'Ian Flemming']}
edit: I am able to sort the lists using this:
sorted_x = sorted(matrix.items(), key=operator.itemgetter(1))
print sorted_x
Now I am looking for a way to print in a report format, without using any any specialized modules since I don't have pip. I am thinking the Naïve Method is probably the best way, so my question is what is the most pythonic way to format it pretty?

Need To filter output inrange between identifiers I.e grep between 1: this pattern and 2: that pattern and only show containing text

I need to filter the output of a dictionary I've managed to get the output down to the following example. I want to only output text between the Identifying pattern of 1: and 2:
I know of a solution with python but would like to accomplish this in the shell. I think it is possible with sed but not sure how.
n 1: feline mammal usually having thick soft fur and no ability
to roar: domestic cats; wildcats [syn: {cat}, {true cat}]
2: an informal term for a youth or man; "a nice guy"; "the guy's
only doing it for some doll" [syn: {guy}, {cat}, {hombre},
{bozo}]
3: a spiteful woman gossip; "what a cat she is!"
4: the leaves of the shrub Catha edulis which are chewed like
tobacco or used to make tea; has the effect of a euphoric
stimulant; "in Yemen kat is used daily by 85% of adults"
[syn: {kat}, {khat}, {qat}, {quat}, {cat}, {Arabian tea},
{African tea}]
5: a whip with nine knotted cords; "British sailors feared the
cat" [syn: {cat-o'-nine-tails}, {cat}]
6: a large tracked vehicle that is propelled by two endless
metal belts; frequently used for moving earth in construction
and farm work [syn: {Caterpillar}, {cat}]
v 1: beat with a cat-o'-nine-tails
2: eject the contents of the stomach through the mouth; "After
drinking too much, the students vomited"; "He purged
continuously"; "The patient regurgitated the food we gave him
last night" [syn: {vomit}, {vomit up}, {purge}, {cast},
{sick}, {cat}, {be sick}, {disgorge}, {regorge}, {retch},
{puke}, {barf}, {spew}, {spue}, {chuck}, {upchuck}, {honk},
{regurgitate}, {throw up}] [ant: {keep down}]
There is probably a better solution since I had an extra step after sed by using grep to remove unwanted lines. I still feel like I can accomplish this with only sed but here is my solution thus far:
dict -h dict.tu-chemnitz.de cat | sed -n '/1:/,/2:/p' | grep -v 2:

How to get time and date or specific product name using NLTK?

doc = '''Andrew Yan-Tak Ng is a Chinese American computer scientist.He is the former chief scientist at Baidu, where he led the company's
Artificial Intelligence Group. He is an adjunct professor (formerly associate professor) at Stanford University. Ng is also the co-founder
and chairman at Coursera, an online education platform. Andrew was born in the UK on 27th Sep 2.30pm 1976. His parents were both from Hong Kong.'''
# tokenize doc
tokenized_doc = nltk.word_tokenize (doc)
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = nltk.pos_tag (tokenized_doc)
ne_chunked_sents = nltk.ne_chunk (tagged_sentences)
When you process and extract chucks..I see we only get
[('Andrew', 'PERSON'), ('Chinese', 'GPE'), ('American', 'GPE'), ('Baidu', 'ORGANIZATION'), ("company's Artificial Intelligence Group", 'ORGANIZATION'), ('Stanford University', 'ORGANIZATION'), ('Coursera', 'ORGANIZATION'), ('Andrew', 'PERSON'), ('UK', 'ORGANIZATION'), ('Hong Kong', 'GPE')]
I need to get the time and date too?
Please suggest...
Thank you.
You need a more sophisticated tagger like the Stanford's Named Entity Tagger. Once you have it installed and configured, you can run it:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
stanfordClassifier = '/path/to/classifier/classifiers/english.muc.7class.distsim.crf.ser.gz'
stanfordNerPath = '/path/to/jar/stanford-ner/stanford-ner.jar'
st = StanfordNERTagger(stanfordClassifier, stanfordNerPath, encoding='utf8')
doc = '''Andrew Yan-Tak Ng is a Chinese American computer scientist.He is the former chief scientist at Baidu, where he led the company's Artificial Intelligence Group. He is an adjunct professor (formerly associate professor) at Stanford University. Ng is also the co-founder and chairman at Coursera, an online education platform. Andrew was born in the UK on 27th Sep 2.30pm 1976. His parents were both from Hong Kong.'''
result = st.tag(word_tokenize(doc))
date_word_tags = [wt for wt in result if wt[1] == 'DATE' or wt[1] == 'ORGANIZATION']
print date_word_tags
Where the output would be:
[(u'Artificial', u'ORGANIZATION'), (u'Intelligence', u'ORGANIZATION'), (u'Group', u'ORGANIZATION'), (u'Stanford', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'Coursera', u'ORGANIZATION'), (u'27th', u'DATE'), (u'Sep', u'DATE'), (u'2.30pm', u'DATE'), (u'1976', u'DATE')]
You will probably run into some issues when trying to install and set up everything, but I think it's worth the hassle.
Let me know if it helps.

Reverse Geocoding in Bash using GPS Position from exiftool

I am writing a bash script that renames JPG files based on their EXIF tags. My original files are named like this:
IMG_2110.JPG
IMG_2112.JPG
IMG_2113.JPG
IMG_2114.JPG
I need to rename them like this:
2015-06-07_11-21-38_iPhone6Plus_USA-CA-Los_Angeles_IMG_2110.JPG
2015-06-07_11-22-41_iPhone6Plus_USA-CA-Los_Angeles_IMG_2112.JPG
2015-06-13_19-05-10_iPhone6Plus_Morocco-Fez_IMG_2113.JPG
2015-06-13_19-12-55_iPhone6Plus_Morocco-Fez_IMG_2114.JPG
My bash script uses exiftool to parse the EXIF header and rename the files. For those files that do not contain an EXIF create date, I am using the file modification time.
#!/bin/bash
IFS=$'\n'
for i in *.*; do
MOD=`stat -f %Sm -t %Y-%m-%d_%H-%m-%S $i`
model=$( exiftool -f -s3 -"Model" "${i}" )
datetime=$( exiftool -f -s3 -"DateTimeOriginal" "${i}" )
stamp=${datetime//:/-}"_"${model// /}
echo ${stamp// /_}$i
done
I am stuck on the location. I need to determine the country and city using the GPS information from the EXIF tag. exiftool provides a field called "GPS Position." Of all the fields, this seems the most useful to determine location.
GPS Position : 40 deg 44' 49.36" N, 73 deg 56' 28.18" W
Google provides a public API for geolocation, but it requires latitude/longitude coordinates in this format:
40.7470444°, -073.9411611°
The API returns quite a bit of information (click the link to see the results):
https://maps.googleapis.com/maps/api/geocode/json?latlng=40.7470444,-073.9411611
My question is:
How do I format the GPS Position to a latitude/longitude value that will provide acceptable input to a service such as Google geolocation?
How do I parse the JSON results to extract just the country and city, in a way that is consistent with many different kinds of locations? Curl, and then? Ideally, I’d like to handle USA locations one way, and non-USA locations, another. USA locations would be formatted USA-STATE-City, whereas non-USA locations would be formatted COUNTRY-City.
I need to do this all in a bash script. I've looked at pygeocoder and gpsbabel but they do not seem to do the trick. There are a few free web tools available but they don't provide an API (http://www.earthpoint.us/Convert.aspx).
Better later than never, right.
So, I just came across the same issue and I've managed to make the conversion using the EXIFTool itself. Try this:
exiftool -n -p '$GPSLatitude,$GPSLongitude' image_name.jpg
The converted coordinates are slightly longer than proposed by Google, but the API accepted it fine.
Cheers.
For #1, the awk should not be that complicated:
awk '/GPS Position/{
lat=$4; lat+=strtonum($6)/60; lat+=strtonum($7)/3600; if($8!="N,")lat=-lat;
lon=$9; lon+=strtonum($11)/60; lon+=strtonum($12)/3600; if($13!="E")lon=-lon;
printf "%.7f %.7f\n",lat,lon
}'
I ended up doing it in PHP, but thanks for the tip Marco I'll check it out!
function get_gps($gps_pos) {
$parts = explode(" ",str_replace(array("deg ",",","'","\""),"",$gps_pos));
$lat_deg = $parts[0];
$lat_min = $parts[1];
$lat_sec = $parts[2];
$lat_dir = $parts[3];
$lon_deg = $parts[4];
$lon_min = $parts[5];
$lon_sec = $parts[6];
$lon_dir = $parts[7];
if ($lat_dir == "N") {
$lat_sin = "+";
} else {
$lat_sin = "-";
}
if ($lon_dir == "E") {
$lon_sin = "+";
} else {
$lon_sin = "-";
}
$latitiude = $lat_sin.($lat_deg+($lat_min/60)+($lat_sec/3600));
$longitude = $lon_sin.($lon_deg+($lon_min/60)+($lon_sec/3600));
return $latitiude.",".$longitude;
}
From man exiftool (note the last line):
-c FMT (-coordFormat)
Set the print format for GPS coordinates. FMT uses the same syntax
as a "printf" format string. The specifiers correspond to degrees,
minutes and seconds in that order, but minutes and seconds are
optional. For example, the following table gives the output for
the same coordinate using various formats:
FMT Output
------------------- ------------------
"%d deg %d' %.2f"\" 54 deg 59' 22.80" (default for reading)
"%d %d %.8f" 54 59 22.80000000 (default for copying)
"%d deg %.4f min" 54 deg 59.3800 min
"%.6f degrees" 54.989667 degrees
And regarding "There are a few free web tools available but they don't provide an API"—geoapify.com offers a free web tool but also an API. Their API is free for up to three thousand requests per day. Their web service does five hundred at a time.

Extracting plain text output from binary file

I am working with Graphchi's pagerank example: https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps#pagerank-easy
The example app writes a binary file with vertex information that I would like to read/convert to a plan text file (to later call into R or some other language).
The documentation states that:
"GraphChi will write the values of the edges in a binary file, which is easy to handle in other programs. Name of the file containing vertex values is GRAPH-NAME.4B.vout. Here "4B" refers to the vertex-value being a 4-byte type (float)."
The 'easy to handle' part is what I'm struggling with - I have experience with high level languages but not C++ or dealing with binary files. I have found a few things through searching stackoverflow but no luck yet in reading this file. Ideally this would be done through bash or python.
thanks very much for your help on this.
Update: hexdump graph-name.4B.vout | head -5 gives:
0000000 999a 3e19 7468 3e7f 7d2a 3e93 d8e0 3ec4
0000010 cec6 3fe4 d551 3f08 eff2 3e54 999a 3e19
0000020 999a 3e19 3690 3e8c 0080 3f38 9ea3 3ef5
0000030 b7d6 3f66 999a 3e19 10e3 3ee1 400c 400d
0000040 a3df 3e7c 999a 3e19 979c 3e91 5230 3f18
Here is example code how you can use GraphCHi to write the output out as a string:
https://github.com/GraphChi/graphchi-cpp/wiki/Vertex-Aggregators
But the array is simple byte array. Here is example how to read it in python:
import struct
from array import array as binarray
import sys
inputfile = sys.argv[1]
data = open(inputfile).read()
a = binarray('c')
a.fromstring(data)
s = struct.Struct("f")
l = len(a)
print "%d bytes" %l
n = l / 4
for i in xrange(0, n):
x = s.unpack_from(a, i * 4)[0]
print ("%d %f" % (i, x))
I was having the same trouble. Luckily I work with a bunch of network engineers who helped me out! On Mac Linux, the following command works to print the 4B.vout data one line per node, with the integer values the same as is given in the summary file. If your file is called eg, filename.4B.vout, then some command line perl gets you:
cat filename.4B.vout | LANG= perl -0777 -e '$,=\"\n\"; print unpack(\"L*\",<>),\"\";'
Edited to add: this is for the assignments of connected component ID and community ID, written implicitly the 1st line is the ID of the node labeled 0, the 2nd line is the node labeled 1 etc. But I am copypasting here so I'm not sure how it would need to change for floats. It works great for the integer values per node.

Resources