I have some old rrdtool databases, for which the exact creation recipe has long been since lost. I need to create a new database with the same characteristics as the current ones. I've dumped a couple of old databases and pored over the contents but I'm not sure how to interpret the metadata. I think it appears in the following stanzas
<cf> AVERAGE </cf>
<pdp_per_row> 360 </pdp_per_row> <!-- 1800 seconds -->
<xff> 5.0000000000e-01 </xff>
There are four such stanzas, which correspond to the way I recall the round-robin cascading was set up. Has anyone already done this, or can give me pointers as to how to clone a new empty rrd database from an existing one? Or show me where I missed this in the documentation.
I use the command rrdcreate. It can create a new rrd based in an existing one. The -t parameter indicate a existing rrd as template.
rrdcreate new.rrd -t existing.rrd
rrdtools' rrdinfo is your friend!
It will tell you how the rrd file's data source(s) and archive(s) were created. Example
$ rrdtool info random.rrd
filename = "random.rrd"
rrd_version = "0001"
step = 300
last_update = 955892996
ds[a].type = "GAUGE"
ds[a].minimal_heartbeat = 600
ds[a].min = NaN
ds[a].max = NaN
ds[a].last_ds = "UNKN"
ds[a].value = 2.1824421548e+04
ds[a].unknown_sec = 0
ds[b].type = "GAUGE"
ds[b].minimal_heartbeat = 600
ds[b].min = NaN
ds[b].max = NaN
ds[b].last_ds = "UNKN"
ds[b].value = 3.9620838224e+03
ds[b].unknown_sec = 0
rra[0].cf = "AVERAGE"
rra[0].pdp_per_row = 1
rra[0].cdp_prep[0].value = nan
rra[0].cdp_prep[0].unknown_datapoints = 0
rra[0].cdp_prep[1].value = nan
rra[0].cdp_prep[1].unknown_datapoints = 0
You can try use the clone script described here. It's very basic but it works for simple rrd files. I used it to figure out a schema that was generated by munin. I needed to insert old data into munin so I reverse engineered the schema, set the --start to a date prior to the start of my old data and re-imported data into the rrd.
$ python rrdinfo-parser.py -f test.rrd
rrdtool create test.rrd --start 920804400 --step 300 \
DS:speed:COUNTER:600:U:U \
RRA:AVERAGE:0.5:1:24 \
RRA:AVERAGE:0.5:6:10 \
Related
I'm searching for a way to check if a file exists before using the OPEN DATASET command to open it. The OPEN DATASET command takes up to 30 seconds to trigger an exception, which is too slow for my liking.
This is the code:
TRY.
OPEN DATASET lv_file FOR OUTPUT IN TEXT MODE
ENCODING DEFAULT
WITH SMART LINEFEED.
CONCATENATE ` ` lv_resultdata INTO lv_resultdata.
TRANSFER lv_resultdata TO lv_file.
CLOSE DATASET lv_file.
CATCH cx_sy_file_access_error.
MESSAGE 'Placeholder-message. File cannot be reached'.
EXIT.
ENDTRY.
Try this:
DATA: filepath TYPE epsf-epsdirnam VALUE '/tmp'.
CALL FUNCTION 'EPS_GET_DIRECTORY_LISTING'
EXPORTING
dir_name = filepath
file_mask = 'somefile.txt'
EXCEPTIONS
invalid_eps_subdir = 1
sapgparam_failed = 2
build_directory_failed = 3
no_authorization = 4
read_directory_failed = 5
too_many_read_errors = 6
empty_directory_list = 7
OTHERS = 8.
CHECK sy-subrc = 0.
" writing dataset
It can also be used for remote servers.
When I write my .gpkg I am losing the CRS. I have tried setting the CRS with .set_crs, or adding the CRS when writing the .gpkg (which creates a fault - "fiona._env - WARNING - dataset filename.gpkg does not support layer creation option EPSG"
My code
for layername in fiona.listlayers(file):
vector = geopandas.read_file(file, layer=layername)
vector.set_crs(4326)
vector.to_file(filename + ".gpkg", layer = layername, driver='GPKG')
or
for layername in fiona.listlayers(file):
vector = geopandas.read_file(file, layer=layername)
vector.to_file(filename + ".gpkg", layer = layername, driver='GPKG', epsg=4326)
neither works.
vector.set_crs(4326) does not work in place by default. You either need to assign it or specify inplace=True.
for layername in fiona.listlayers(file):
vector = geopandas.read_file(file, layer=layername)
# vector.set_crs(4326, inplace=True) # one option
vector = vector.set_crs(4326) # other option
vector.to_file(filename + ".gpkg", layer = layername, driver='GPKG')
Your second attempt does not work because to_file does not have espg keyword you are trying to lose and that gets lost among arguments passed to Fiona and GDAL (which silently ignores it).
[Disclaimer: I have published this question 3 weeks ago in biostars, with no answers yet. I really would like to get some ideas/discussion to find a solution, so I post also here.
biostars post link: https://www.biostars.org/p/447413/]
For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.
Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/
From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:
# first get UIDs for clinvar records of the same position
# credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
chr = variants["chr"].split("chr")[1]
start, end = str(variants["start"]), str(variants["end"])
es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
genomic_pos = chr + "[chr]" + " AND " + start + ":" + end # + "[chrpos37]"
entrez_query = es.inquire(
{'db': 'clinvar',
'term': genomic_pos,
'retmax': 100000,
'retstart': 0,
'rettype': 'uilist'}) # 'usehistory': False
entrez_uids = entrez_query.get_result().uids
Then I have used Entrez from BioPython to get the available ClinVar records:
# process each VariationArchive of each UID
handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
clinvar_records = {}
tree = ET.parse(handle)
root = tree.getroot()
This approach is working. However, I have two main drawbacks:
entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
# first snippet's first lines go here
entrez_uids = entrez_query.get_result().uids
current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
self.all_entrez_uids_gsvar_file += current_entrez_uids
Does anyone have suggestion(s) on how to address these two presented drawbacks?
I'm trying to create a new flume agent like source spooldir and puts them in HDFS. This is my config file:
agent.sources = file
agent.channels = channel
agent.sinks = hdfsSink
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
# SINKS CONFIGURATION
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = /HADOOP/PATH/%Y/%m/%d/%H/
agent.sinks.hdfsSink.hdfs.filePrefix = common
agent.sinks.hdfsSink.hdfs.fileSuffix = .json
agent.sinks.hdfsSink.hdfs.rollInterval = 300
agent.sinks.hdfsSink.hdfs.rollSize = 5242880
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.maxOpenFiles = 2
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.callTimeout = 100000
agent.sinks.hdfsSink.hdfs.batchSize = 1000
agent.sinks.hdfsSink.channel = channel
# CHANNELS CONFIGURATION
agent.channels.channel.type = memory
agent.channels.channel.capacity = 10000
agent.channels.channel.transactionCapacity = 1000
I'm getting an error that describes Expected timestamp in the Flume event headers, but it was null. The files that I'm reading contains JSON structure, which has a field named timestamp.
Is there a way to add this timestamp in the header?
As per my earlier comment, now I am sharing the entire steps which I followed and performed for spooling header enable json file, putting it to hadoop hdfs cluster using flume, creating a external file over json file and later performed DML query over it -
Created flume-spool.conf
//Flume Configuration Starts
erum.sources =source-1
erum.channels =file-channel-1
erum.sinks =hdfs-sink-1
erum.sources.source-1.channels =file-channel-1
erum.sinks.hdfs-sink-1.channel =file-channel-1
//Define a file channel called fileChannel on erum
erum.channels.file-channel-1.type =file
erum.channels.file-channel-1.capacity =2000000
erum.channels.file-channel-1.transactionCapacity =100000
//Define a source for erum
erum.sources.source-1.type =spooldir
erum.sources.source-1.bind =localhost
erum.sources.source-1.port =44444
erum.sources.source-1.inputCharset =UTF-8
erum.sources.source-1.bufferMaxLineLength =100
//Spooldir in my case is /home/arif/practice/flume_sink
erum.sources.source-1.spoolDir =/home/arif/practice/flume_sink/
erum.sources.source-1.fileHeader =true
erum.sources.source-1.fileHeaderKey=file
erum.sources.source-1.fileSuffix =.COMPLETED
//Sink is flume_import under hdfs
erum.sinks.hdfs-sink-1.pathManager =DEFAULT
erum.sinks.hdfs-sink-1.type =hdfs
erum.sinks.hdfs-sink-1.hdfs.filePrefix =common
erum.sinks.hdfs-sink-1.hdfs.fileSuffix =.json
erum.sinks.hdfs-sink-1.hdfs.writeFormat =Text
erum.sinks.hdfs-sink-1.hdfs.fileType =DataStream
erum.sinks.hdfs-sink-1.hdfs.path =hdfs://localhost:9000/user/arif/flume_sink/products/
erum.sinks.hdfs-sink-1.hdfs.batchSize =1000
erum.sinks.hdfs-sink-1.hdfs.rollSize =2684354560
erum.sinks.hdfs-sink-1.hdfs.rollInterval =5
erum.sinks.hdfs-sink-1.hdfs.rollCount =5000
Now we are running the flume-spool using agent - erum
bin/flume-ng agent -n erum -c conf -f conf/flume-spool.conf -Dflume.root.logger=DEBUG,console
Copied the products.json file inside the erum.sources.source-1.spoolDir flume configured specified directory.
Contents inside the products.json file is as follows as it were -
{"productid":"5968dd23fc13ae04d9000001","product_name":"sildenafilcitrate","mfgdate":"20160719031109","supplier":"WisozkInc","quantity":261,"unit_cost":"$10.47"}
{"productid":"5968dd23fc13ae04d9000002","product_name":"MountainJuniperusashei","mfgdate":"20161003021009","supplier":"Keebler-Hilpert","quantity":292,"unit_cost":"$8.74"}
{"productid":"5968dd23fc13ae04d9000003","product_name":"DextromathorphanHBr","mfgdate":"20161101041113","supplier":"Schmitt-Weissnat","quantity":211,"unit_cost":"$20.53"}
{"productid":"5968dd23fc13ae04d9000004","product_name":"MeophanHBr","mfgdate":"20161101061113","supplier":"Schmitt-Weissnat","quantity":198,"unit_cost":"$18.73"}
Download the hive-serdes-sources-1.0.6.jar from below url-
https://www.dropbox.com/s/lsjgk2zaqz8uli9/hive-serdes-sources-1.0.6.jar?dl=0
After spooling the json file to hdfs cluster using flume-spool, we will start the hive server, login to hive shell and then do the following-
hive> add jar /home/arif/applications/hadoop/apache-hive-2.1.1-bin/lib/hive-serdes-sources-1.0.6.jar;
hive> create external table products (productid string, product_name string, mfgdate string, supplier string, quantity int, unit_cost string)
> row format serde 'com.cloudera.hive.serde.JSONSerDe' location '/user/arif/flume_sink/products/';
OK
Time taken: 0.211 seconds
hive> select * from products;
OK
5968dd23fc13ae04d9000001 sildenafilcitrate 20160719031109 WisozkInc 261 $10.47
5968dd23fc13ae04d9000002 MountainJuniperusashei 20161003021009 Keebler-Hilpert 292 $8.74
5968dd23fc13ae04d9000003 DextromathorphanHBr 20161101041113 Schmitt-Weissnat 211 $20.53
5968dd23fc13ae04d9000004 MeophanHBr 20161101061113 Schmitt-Weissnat 198 $18.73
Time taken: 0.291 seconds, Fetched: 4 row(s)
and I have completed these entire steps without any single error, hope this will help you, thanks.
as explained in this post:
http://shzhangji.com/blog/2017/08/05/how-to-extract-event-time-in-apache-flume/
the changes needed is to include an interceptor and serializer to it:
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
agent.sources.file.interceptors = i1
agent.sources.file.interceptors.i1.type = regex_extractor
agent.sources.file.interceptors.i1.regex = <regex_for_timestamp>
agent.sources.file.interceptors.i1.serializers = s1
agent.sources.file.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
agent.sources.file.interceptors.i1.serializers.s1.name = timestamp
agent.sources.file.interceptors.i1.serializers.s1.pattern = <pattern_that_matches_your_regex>
thanks for pointing out that besides the link i needed to include a proper snippet :)
sorry for the noob question. I'm trying to write a simple bash script, based on newsbeuter. Basically I'm trying to get the first 5 articles I haven't read yet, once I got them, I send them to my phone with pushover and I need so set them as read on newsbeuter.
#!/bin/bash --
urls=$( sqlite3 /home/pi/.newsbeuter/cache.db <<END
select url from rss_item where unread = 1 limit 5;
END
)
This is the first query. I send the message variable through the pushover api.
Now I need to get how to update the table and set the articles as read.
Any ideas? (I'm totally new to bash syntax).
I tried both to recreate a query like
UPDATE rss_item set unread = 0 where url = '$url'
I looped it but it didn't work, then I tried to make
`UPDATE rss_item set unread = 0 where url in ($urls)`
but I keep getting errors I can't even understand! I really need a syntax lecture!
Try this:
#!/bin/bash --
urls="$(
sqlite3 /home/pi/.newsbeuter/cache.db \
'select url from rss_item where unread = 1 limit 5' \
)"
for url in $urls; do
sqlite3 /home/pi/.newsbeuter/cache.db \
"UPDATE rss_item set unread = 0 where url = '$url'"
done