Bash - extract substring in a string with special characters - bash

I'm trying to extract 43 (downloader/request_count result) in the string below:
OUT="[scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 21394, 'downloader/request_count': 43, 'downloader/request_method_count/GET': 43, 'downloader/response_bytes': 1030981, 'downloader/response_count': 43, 'downloader/response_status_count/200': 43, 'item_scraped_count': 41"
In a first step, I did value=${OUT#*request_count\':}, which return:
43, 'downloader/request_method_count/GET': 43, 'downloader/response_bytes': 1030981, 'downloader/response_count': 43, 'downloader/response_status_count/200': 43, 'item_scraped_count': 41
But when I try to delete right part, I get an error:
value2=${value%,*}
or
value2=$(cut -d, -f1 $value)
Any ideas?
Thanks in advance for your help!

% removes the shortest possible string, use %% to remove the longest possible one:
value2=${value%%,*}
# ~~
Also, you might want to remove the space after the colon, too:
value=${OUT#*request_count\': }
# ~

Related

Elixir POST file to Heroku file Attachment Scanner add-on

I am trying to scan uploaded documents for viruses when a user uploads, using the Heroku Add-on Attachment Scanner.
I am attempting to encode the file directly with Poison.encode, but it is throwing an error so am not sure this is the correct method. Any help appreciated, below is my attempted HTTPoison post request, and the error from Poison.encode!.
def scan do
url = System.get_env("ATTACHMENT_SCANNER_URL") <> "/requests"
token = System.get_env("ATTACHMENT_SCANNER_API_TOKEN")
headers =
[
"Authorization": "bearer " <> token,
"Content-Type": "multipart/form-data",
]
file_path = local_path_to_pdf_file
file = file_path |> File.read!
body = Poison.encode!(%{file: file})
res = HTTPoison.post(url, body, headers, recv_timeout: 40_000)
end
Poison.encode(file) error:
iex(3)> Poison.encode(file)
** (FunctionClauseError) no function clause matching in Poison.Encoder.BitString.chunk_size/3
The following arguments were given to Poison.Encoder.BitString.chunk_size/3:
# 1
<<226, 227, 207, 211, 13, 10, 49, 48, 51, 32, 48, 32, 111, 98, 106, 13, 60, 60,
47, 76, 105, 110, 101, 97, 114, 105, 122, 101, 100, 32, 49, 47, 76, 32, 50,
53, 50, 53, 51, 52, 51, 47, 79, 32, 49, 48, 53, 47, 69, 32, ...>>
# 2
nil
# 3
1
ps. I need to send the file directly, and am unable to host the image publicly, so the node.js examples in the docs will not work.
file = "/some/path/video.mp4"
HTTPoison.post( "api.vid.me/video/upload";, {:multipart, [{:file, file, {"form-data", [name: "filedata", filename: Path.basename(file)]}, []}]}, ["AccessToken": "XXXXX"] )
will this help you?.. reference
Following on from Dinesh' answer, here is the code snippet which I went for:
headers =
[
"Authorization": "bearer " <> token,
"Content-Type": "multipart/form-data",
]
file_path = Ev2.Lib.MergerAPI.get_timecard_document_path
body = {:multipart, [{:file, file_path}]}
res = HTTPoison.post(url, body, headers)

Having both NER and RegexNER tags in StanfordCoreNLPServer output?

I am using the StanfordCoreNLPServer to extract some informations from text (such as surfaces, street names)
The street is given by a specifically trained NER model, and the surface by a simple regex via the RegexNER.
Each of them work fine separately but when used together, only the NER result is present in the output, under the ner tag. Why isn't there a regexnertag? Is there a way to also have the RegexNER result?
For information:
StanfordCoreNLP v3.6.0
the URL used:
'http://127.0.0.1:9000/'
'?properties={"annotators":"tokenize,ssplit,pos,ner,regexner", '
'"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",'
'"tokenize.language":"fr",'
'"ner.model":"ner-model.ser.gz", ' # custom NER model with STREET labels
'"regexner.mapping":"rules.tsv", ' # SURFACE label
'"outputFormat": "json"}'
as suggested here, the regexner annotator is after the ner, but still...
The current output (extract):
{u'index': 4, u'word': u'dans', u'lemma': u'dans', u'pos': u'P', u'characterOffsetEnd': 12, u'characterOffsetBegin': 8, u'originalText': u'dans', u'ner': u'O'}
{u'index': 5, u'word': u'la', u'lemma': u'la', u'pos': u'DET', u'characterOffsetEnd': 15, u'characterOffsetBegin': 13, u'originalText': u'la', u'ner': u'O'}
{u'index': 6, u'word': u'rue', u'lemma': u'rue', u'pos': u'NC', u'characterOffsetEnd': 19, u'characterOffsetBegin': 16, u'originalText': u'rue', u'ner': u'STREET'}
{u'index': 7, u'word': u'du', u'lemma': u'du', u'pos': u'P', u'characterOffsetEnd': 22, u'characterOffsetBegin': 20, u'originalText': u'du', u'ner': u'STREET'}
[...]
{u'index': 43, u'word': u'165', u'lemma': u'165', u'normalizedNER': u'165.0', u'pos': u'DET', u'characterOffsetEnd': 196, u'characterOffsetBegin': 193, u'originalText': u'165', u'ner': u'NUMBER'}
{u'index': 44, u'word': u'm', u'lemma': u'm', u'pos': u'NC', u'characterOffsetEnd': 198, u'characterOffsetBegin': 197, u'originalText': u'm', u'ner': u'O'}
{u'index': 45, u'word': u'2', u'lemma': u'2', u'normalizedNER': u'2.0', u'pos': u'ADJ', u'characterOffsetEnd': 199, u'characterOffsetBegin': 198, u'originalText': u'2', u'ner': u'NUMBER'}
Expected output : I would like the last 3 items to be labelled with SURFACE, ie the RegexNER result.
Let me know if more details are needed.
Here's what the RegexNER documentation says about this:
RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE
Lalor LOCATION PERSON
Labor ORGANIZATION
I'm not sure what your mapping file exactly looks like, but if it just maps entities to labels, then the original NER will label your entities as NUMBER, and RegexNER won't be able to overwrite them. If you explicitly declare that some NUMBER entities should be overwritten as SURFACE in your mapping file, then it should work.
Ok, things seem to work as I want if I put the regexner first:
"annotators":"regexner,tokenize,ssplit,pos,ner",
seems there is an ordering problem at some stage of the process?
Update for coreNLP 3.9.2 server via python:
When using coreNLP 3.9.2 server via python, the regexner can also now be initiated as part of ner as per the docs. For example:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
properties={"annotators":"tokenize,ssplit,pos,lemma,ner,coref,openie",
"outputFormat": "json",
"ner.fine.regexner.mapping":"rules.txt",}
output = nlp.annotate(text,properties=properties)
I could not get regexner annotator to work by calling it directly. I think this is due to reloading of dependencies and or the method used to translate outputs to JSON e.g. this issue

How to set multiple values for a key in a hash in prototype.js?

How to set multiple values for a key in a hash in prototype.js?
For Eg:
var hash={2005: [107, 31, 635, 203, 2]}
you can take exactly what you have written and convert to a Hash like this
var hash=$H({2005: [107, 31, 635, 203, 2]});
console.log(hash.get(2005));
//returns [107, 31, 635, 203, 2] as an array

Can you optimize this descending sorting array with Ruby?

Do you know a better, faster, smarter, efficent or just more elegat way of doing the following ?
due this array
a = [171, 209, 3808, "723", "288", "6", "5", 27, "22", 207, 473, "256", 67, 1536]
get this
a.map{|i|i.to_i}.sort{|a,b|b<=>a}
=> [3808, 1536, 723, 473, 288, 256, 209, 207, 171, 67, 27, 22, 6, 5]
You can use in-place mutations to avoid creating new arrays:
a.map!(&:to_i).sort!.reverse!
Hard to know if it's faster or more efficient without a benchmark, though.
Here's one using symbol#to_proc
a.map(&:to_i).sort.reverse
This is faster than using in-place modifier (!) methods but uses more memory. As a bonus, it keeps the original array a intact if you want to do anything else with it.

Create binary data using Ruby?

i was palying with the ruby sockets, so i ended up trying to put an IP packet togather, then i took an ip packet and try to make a new one just like it.
now my problem is: if the packet is: 45 00 00 54 00 00 40 00 40 01 06 e0 7f 00 00 01 7f 00 00 01, and this is obviously hexadecimal, so i converted it into a decimal, then into a binary data using the .pack method, and pass it up to the send method, then the Wireshark shows me a very strange different thing from what i created, i doing something wrong ???, i know that, but can't figure it out:
#packet = 0x4500005400004000400106e07f0000017f000001 #i converted each 32 bits together, not like i wrote
#data = ""
#data << #packet.to_s
#socket.send(#data.unpack(c*).to_s,#address)
and is there another way to solve the whole thing up, can i for example write directly to the socket buffer the data i want to send??
thanks in advance.
Starting with a hex Bignum is a novel idea, though I can't immediately think of a good way to exploit it.
Anyway, trouble starts with the .to_s on the Bignum, which will have the effect of creating a string with the decimal representation of your number, taking you rather further from the bits and not closer. Somehow your c* seems to have lost its quotes, also.
But putting them back, you then unpack the string, which gets you an array of integers which are the ascii values of the digits in the decimal representation of the numeric value of the original hex string, and then you .to_s that (which IO would have done anyway, so, no blame there at least) but this then results in a string with the printable representation of the ascii numbers of the unpacked string, so you are now light-years from the original intention.
>> t = 0x4500005400004000400106e07f0000017f000001
=> 393920391770565046624940774228241397739864195073
>> t.to_s
=> "393920391770565046624940774228241397739864195073"
>> t.to_s.unpack('c*')
=> [51, 57, 51, 57, 50, 48, 51, 57, 49, 55, 55, 48, 53, 54, 53, 48, 52, 54, 54, 50, 52, 57, 52, 48, 55, 55, 52, 50, 50, 56, 50, 52, 49, 51, 57, 55, 55, 51, 57, 56, 54, 52, 49, 57, 53, 48, 55, 51]
>> t.to_s.unpack('c*').to_s
=> "515751575048515749555548535453485254545052575248555552505056505249515755555157565452495753485551"
It's kind of interesting in a way. All the information is still there, sort of.
Anyway, you need to make a binary string. Either just << numbers into it:
>> s = ''; s << 1 << 2
=> "\001\002"
Or use Array#pack:
>> [1,2].pack 'c*'
=> "\001\002"
First check your host byte order because what you see in wireshark is in network byte order (BigEndian). Then in wireshark you will be seeing protocol headers (depends upon whether it is TCP socket or a UDP one) followed by data. You can not directly send IP packets. So you can see this particular data in the particular's packet's data section i.e. (data section of TCP/UDP packet).

Resources