XPATH for all child nodes with different names - xpath

I have a parent element with various child elements that I need to keep a count on. The problem I'm having is each child element is different name, so everytime I use count(*) the numbering restarts. I need the numbering to go 1.1, 1.2, 1.3...
The parent tag is <application> would be 1, <ident> would be 1.1, <kitapplic> would be 1.2, and <tctoproof> would be 1.3
I thought I could just do a count(child::application) but that didn't work. Your help is appreciated.
<application>
<ident>
<para>This Technical Order is applicable.</para>
</ident>
<kitapplic>
<kitapptbl>
<kitapptblrow>
<model>Model</model>
<serialno>Serial Number</serialno>
<kitno>Kit Required</kitno>
</kitapptblrow>
</kitapptbl>
</kitapplic>
<tctoproof>
<para>Time Compliance Technical Order (TCTO) verification, in accordance
with TO 00-5-15, was accomplished 28 August 2019 at Nellis Air Force
Base with GCS serial number 5147.</para>
</tctoproof>
</application>

With XPath, you can use count preceding-sibling and concat to get the desired numbers. Example with kitapplic :
concat("1.",count(application/kitapplic/preceding-sibling::*)+1)
Output : 1.2
If you need a list with 1.1, 1.2, 1.3 for each child of application element you can do (example in Python) :
data = """<application>
<ident>
<para>This Technical Order is applicable.</para>
</ident>
<kitapplic>
<kitapptbl>
<kitapptblrow>
<model>Model</model>
<serialno>Serial Number</serialno>
<kitno>Kit Required</kitno>
</kitapptblrow>
</kitapptbl>
</kitapplic>
<tctoproof>
<para>Time Compliance Technical Order (TCTO) verification, in accordance
with TO 00-5-15, was accomplished 28 August 2019 at Nellis Air Force
Base with GCS serial number 5147.</para>
</tctoproof>
</application>"""
import lxml.html
tree = lxml.html.fromstring(data)
for el in tree.xpath("//application/*"):
print(el.xpath("concat(name(.),' 1.',count(./preceding-sibling::*)+1)"))
Output :
ident 1.1
kitapplic 1.2
tctoproof 1.3

Related

Performance for adfuller and SARIMAX

This is somewhat of a continuation of a previous post but I am trying to forecast weekly revenues. My program seems to hang on the adfuller test. It has run before and appears stationary via p-value but not consistently. I have added SARIMAX in as well and the code just hangs. If I cancel out I get a message towards the bottom (periodically) that says the problem is unconstrained.
Data:
Week | Week_Start_Date |Amount |year
Week 1 2018-01-01 42920 2018
Week 2 2018-01-08 37772 2018
Week 3 2018-01-15 41076 2018
Week 4 2018-01-22 38431 2018
Week 5 2018-01-29 101676 2018
Code:
x = organic_search.groupby('Week_Start_Date').Amount.sum()
# Augmented Dickey-Fuller test
ad_fuller_result = adfuller(x)
print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')
# SARIMA Model
plt.figure(2)
best_model = SARIMAX(x, order=(2, 1, 1), seasonal_order=(2, 1, 1, 52)).fit(dis=1)
print(best_model.summary())
best_model.plot_diagnostics(figsize=(15,12))
I am only working with 185 or so rows. I don't understand why code is just hanging. Any optimization suggestions welcome (for adfuller and SARIMAX).
Fixed via passing organic_search['Amount'] instead of organic_search.groupby('Week_Start_Date').Amount.sum()

data.frame with Date column ouput in RStudio console,preview, but not below the chunk

Using Rstudio 3.3.2's notebook :
---
title: "R Notebook"
output: html_notebook
---
When trying to display a data.frame with a Date column, the data.frame is displayed in the Viewer tab, but not below the chunk itself :
```{r}
df <- data.frame(date=c("31/08/2011", "31/07/2011", "30/06/2011"),values=c(0.8378,0.8457,0.8147))
#no Date format ->OK, output below the chunk
df
df$dateformatted<-as.Date(strptime(df$date,'%d/%m/%Y'))
#with Date format -> NOK, no output below the chunk,only in Viewer.
df
```
RStudio diagnostics :
26 Feb 2017 20:42:00 [rsession-x] ERROR r error 7 (Unexpected data type); OCCURRED AT: rstudio::core::Error rstudio::r::json::{anonymous}::jsonValueFromVectorElement(SEXP, int, rstudio::core::json::Value*) /home/ubuntu/rstudio/src/cpp/r/RJson.cpp:149; LOGGED FROM: void rstudio::session::modules::rmarkdown::notebook::enqueueChunkOutput(const string&, const string&, const string&, unsigned int, ChunkOutputType, const rstudio::core::FilePath&, const Value&) /home/ubuntu/rstudio/src/cpp/session/modules/rmarkdown/NotebookOutput.cpp:449
relates to this question.
Does anyone knows what did I do wrong ? Thanks a lot in advance.
This is indeed a bug in the current release of RStudio: data.frames containing Date objects are not rendered properly in notebooks. You might try installing the latest daily build of RStudio and confirming the issue is resolved there:
http://dailies.rstudio.com
I appreciate Rigoberta's and Kevin's posts. I'm having the same problem (rstudio 1.0.136).
I'm afraid of using the latest daily build as per described in http://dailies.rstudio.com: "Daily builds are intended for testing purposes, and are not recommended for general use. For stable builds, please visit rstudio.com."
As I never used "unstable" versions of rstudio it seems a better approach to rollback rstudio version for now but opinions are appreciated.
While waiting to decide whether to move back to RStudio 1.0.44 or move forward to an "unstable" version, I found out the issue doesn't happen with matrix objects, to, temporarily, I'm using print(as.matrix()):
```{r}
df <- data.frame(date = c("31/08/2011", "31/07/2011", "30/06/2011"), values = c(0.8378, 0.8457, 0.8147))
df$dateformatted <- as.Date(strptime(df$date, '%d/%m/%Y'))
print(as.matrix(df), quote = FALSE)
```
date values dateformatted
[1,] 31/08/2011 0.8378 2011-08-31
[2,] 31/07/2011 0.8457 2011-07-31
[3,] 30/06/2011 0.8147 2011-06-30
To simulate the head()'s behaviour:
print(as.matrix(df), quote = FALSE, max = length(df) * 6)
You can use this function
bf <- function(x) x %>% ungroup() %>% mutate_if(is.Date, as.character)
to make dataframes containing dates display as expected
```{r}
data.frame(date = as.Date(Sys.time()), num = 1:3) %>% bf
```
date num
2017-03-18 1
2017-03-18 2
2017-03-18 3
3 rows

WINBUGS : adding time and product fixed effects in a hierarchical data

I am working on a Hierarchical panel data using WinBugs. Assuming a data on school performance - logs with independent variable logp & rank. All schools are divided into three categories (cat) and I need beta coefficient for each category (thus HLM). I am wanting to account for time-specific and school specific effects in the model. One way can be to have dummy variables in the list of variables under mu[i] but that would get messy because my number of schools run upto 60. I am sure there must be a better way to handle that.
My data looks like the following:
school time logs logp cat rank
1 1 4.2 8.9 1 1
1 2 4.2 8.1 1 2
1 3 3.5 9.2 1 1
2 1 4.1 7.5 1 2
2 2 4.5 6.5 1 2
3 1 5.1 6.6 2 4
3 2 6.2 6.8 3 7
#logs = log(score)
#logp = log(average hours of inputs)
#rank - rank of school
#cat = section red, section blue, section white in school (hierarchies)
My WinBUGS code is given below.
model {
# N observations
for (i in 1:n){
logs[i] ~ dnorm(mu[i], tau)
mu[i] <- bcons +bprice*(logp[i])
+ brank[cat[i]]*(rank[i])
}
}
}
# C categories
for (c in 1:C) {
brank[c] ~ dnorm(beta, taub)}
# priors
bcons ~ dnorm(0,1.0E-6)
bprice ~ dnorm(0,1.0E-6)
bad ~ dnorm(0,1.0E-6)
beta ~ dnorm(0,1.0E-6)
tau ~ dgamma(0.001,0.001)
taub ~dgamma(0.001,0.001)
}
As you can see in the data sample above, I have multiple observations for school over time. How can I modify the code to account for time and school specific fixed effects. I have used STATA in the past and we get fe,be,i.time options to take care of fixed effects in a panel data. But here I am lost.

Ruby Regex to match multiple portions of a string

Using Ruby: ruby 1.9.3dev (2011-09-23 revision 33323) [i686-linux]
I have the following string:
str = 'Message relates to activity TU4 Sep 5 Activity 1 <img src="/images/layout/placeholder.png" width="222" height="149"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1.'
I want to match the following:
35 (a number which is part of href attribute value)
TU4 Sep 5 Activity (the text for tag)
First question from Manager on TU4 Sep 5 Activity 1. (the remaining text after last <br/><br/> tags)
For achieving the same I have written the following regex
result = str.match(/<a href="\/activities\/(?<activity_id>\d+)">(?<activity_title>.*)<\/a>.*<br\/><br\/>(?<message>.*)/)
This produces following result:
#<MatchData "TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
activity_id:"35"
activity_title:"TU4 Sep 5 Activity 1"
message:"First question from Manager on TU4 Sep 5 Activity 1.">
But I guess this is not efficient.
Is it possible that somehow only the required values(as mentioned above under what I want to match) is returned in the matched result and the following
value gets excluded from matched result:
"TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
Thanks,
Jignesh
The appropriate way to do this is NOT to use regexen. Instead, use the Nokogiri library to easily parse your html:
require 'nokogiri'
doc = Nokogiri::HTML.parse(str)
activity_id = doc.css('[href^="/activities"]').attr('href').value[/\d+$/]
activity_title = doc.css('[href^="/activities"]')[0].inner_text
message = doc.search("//text()").last
This will do exactly what your regexp was attempting, with much lower chance of random failure.

How can I know if a TIFF image is in the format CCITT T.6(Group 4)?

How can I know if a TIFF image is in the format CCITT T.6(Group 4)?
You can use this (C#) code example.
It returns a value indicating the compression type:
1: no compression
2: CCITT Group 3
3: Facsimile-compatible CCITT Group 3
4: CCITT Group 4 (T.6)
5: LZW
public static int GetCompressionType(Image image)
{
int compressionTagIndex = Array.IndexOf(image.PropertyIdList, 0x103);
PropertyItem compressionTag = image.PropertyItems[compressionTagIndex];
return BitConverter.ToInt16(compressionTag.Value, 0);
}
You can check these links
The TIFF File Format
TIFF Tag Compression
TIFF File Format Summary
The tag 259 (hex 0x0103) store the info about the Compression method.
--- Compression
Tag = 259 (103)
Type = word
N = 1
Default = 1.
1 = No compression, but pack data into bytes as tightly as possible, with no
unused bits except at the end of a row. The bytes are stored as an array
of bytes, for BitsPerSample <= 8, word if BitsPerSample > 8 and <= 16, and
dword if BitsPerSample > 16 and <= 32. The byte ordering of data >8 bits
must be consistent with that specified in the TIFF file header (bytes 0
and 1). Rows are required to begin on byte boundaries.
2 = CCITT Group 3 1-Dimensional Modified Huffman run length encoding.
See ALGRTHMS.txt BitsPerSample must be 1, since this type of compression
is defined only for bilevel images (like FAX images...)
3 = Facsimile-compatible CCITT Group 3, exactly as specified in
"Standardization of Group 3 facsimile apparatus for document
transmission," Recommendation T.4, Volume VII, Fascicle VII.3,
Terminal Equipment and Protocols for Telematic Services, The
International Telegraph and Telephone Consultative Committee
(CCITT), Geneva, 1985, pages 16 through 31. Each strip must
begin on a byte boundary. (But recall that an image can be a
single strip.) Rows that are not the first row of a strip are
not required to begin on a byte boundary. The data is stored as
bytes, not words - byte-reversal is not allowed. See the
Group3Options field for Group 3 options such as 1D vs 2D coding.
4 = Facsimile-compatible CCITT Group 4, exactly as specified in
"Facsimile Coding Schemes and Coding Control Functions for Group
4 Facsimile Apparatus," Recommendation T.6, Volume VII, Fascicle
VII.3, Terminal Equipment and Protocols for Telematic Services,
The International Telegraph and Telephone Consultative Committee
(CCITT), Geneva, 1985, pages 40 through 48. Each strip must
begin on a byte boundary. Rows that are not the first row of a
strip are not required to begin on a byte boundary. The data is
stored as bytes, not words. See the Group4Options field for
Group 4 options.
5 = LZW Compression, for grayscale, mapped color, and full color images.
You can run identify -verbose from the ImageMagick suite on the image. Look for "Compression: Group4" in the output.
UPDATE:
SO, I downloaded the libtiff library from the link I mentioned before, and from what I've seen, you can do the following: (untested)
int isTIFF_T6(const char* filename)
{
TIFF* tif= TIFFOpen(filename,"r");
TIFFDirectory *td = &tif->tif_dir;
if(td->td_compression == COMPRESSION_CCITTFAX4) return 1;
return 0;
}
PREVIOUS:
This page has a lot of information about this format and links to some code in C:
Here's an excerpt:
The following paper covers T.4, T.6
and JBIG:
"Review of standards for electronic
imaging for facsimile systems" in
Journal of Electronic Imaging, Vol. 1,
No. 1, pp. 5-21, January 1992.
Source code can be obtained as part of
a TIFF toolkit - TIFF image
compression techniques for binary
images include CCITT T.4 and T.6:
ftp://ftp.sgi.com/graphics/tiff/tiff-v3.4beta035-tar.gz
Contact: sam#engr.sgi.com
Read more: http://www.faqs.org/faqs/compression-faq/part1/section-16.html#ixzz0TYLGKnHI

Resources