I am trying to set up an Access database and will be importing the data from Excel. We do our analysis in R and the current Excel worksheet we use is formatted and arranged to work well for exporting to R and doing analysis there.
The format is as follows:
The first 12 columns of data describe date, location and other information which then applies to the following 12 columns. The trouble is that for a single set of observations the information in the first 12 columns doesn't change from row to row but the values for the second 12 columns does change from row to row.
year mm dd loc start end obs sess test object success
2013 5 15 park 1600 1700 MTM MTM1 1 ball y
2013 5 15 park 1600 1700 MTM MTM1 2 stick y
2013 5 15 park 1600 1700 MTM MTM1 3 rock n
2013 5 15 park 1600 1700 MTM MTM1 4 rock n
2013 5 15 park 1600 1700 MTM MTM1 5 stick y
2013 5 15 park 1600 1700 MTM MTM1 6 stick y
2013 6 24 yard 1500 1530 LFR LFR1 1 ball n
2013 6 24 yard 1500 1530 LFR LFR1 2 stick n
2013 6 24 yard 1500 1530 LFR LFR1 3 stick n
2013 6 24 yard 1500 1530 LFR LFR1 4 stick n
2013 6 24 yard 1500 1530 LFR LFR1 5 stick y
2013 6 24 yard 1500 1530 LFR LFR1 6 rock y
2013 6 24 yard 1500 1530 LFR LFR1 7 ball y
Above is an imaginary dataset which matches the format of the real one (the real one is too wide to fit here).
Notice that the entries for year, mm (month), dd (day), loc (location), start, end, obs (observer), and sess (session) all stay the same but test, object, and success change from row to row for a given set of observations.
In Access I would like to use a unique_ID (primary key) to relate tables so that the information for the first 8 columns need only be entered once and have it relate to each entry for the last 3 columns. In this example then, I have one Excel worksheet that will become two related Access tables (objects).
Before converting to Access though I would like to know that I will be able to export the data back to Excel (and/or directly to a text file) so that it will look just like this again. That is, I do NOT want to export multiple tables to separate Excel worksheets. I want all Access tables within my database to be exported to just one worksheet and in the format shown above. The reason for this is that we run analysis in R based on both the session and the instance levels (called different things in the real data, but that is the idea) so it is important for the location data and result data to be associated with every row in the output file (.xls or .csv).
Is this possible?
I am mostly looking for an outline of how this might be done. Specific code is not necessary, though your personal assessment of the complexity of the potential code (given my complete and absolute ignorance of VBA) that will be required would be appreciated.
The answer appears to be Yes. I just need to create a Query which will return the data to the original form and then export that query as a table.
Related
I was wondering if Cognos Framework Manager has the built-in function "Last" like in Dynamic Cubes?
Or does someone know how to model following case:
We have two dimensions - a time dimension with year, half-year, quarter and month and another dimension that categorises people depending how long they are attending a project (1-30 days, 31-60 d, 60-180, 180 -365, 1-2 years, +2 years). However the choice of the time dimension level (year, half-year etc.) influences the categorization of the other dimension).
An example:
A person attends a project starting from 15.11.2018 and ends 30.06.2020. The cognos user uses for the time dimension the year level thus 2018, 2019 & 2018 will be displayed.
For 2018 the person will be in the category 31-60 days, since 46 days have passed until 31.12.2018. For 2019 the person will be listed in category 1-2 years as 46 + 365 days will have been passed since 31.12.2019. For 2020 the person will also be in that category as 46 + 365 + 180 day have gone by.
The categories will change if the user selects another time dimension level e.g. half-years:
2nd HY 2018: 31-60 (46 days passed)
1st HY 2019: 180-365 days (46 + 180 --> End of HY2019)
2nd HY 2019: 1-2 years (46 + 180 + 180)
1st HY 2020: 1-2 years (46 + 180 + 180 + 180)
Does someone know how to model dynamic dimension categories based on selection of another dimension (here time dimension)?
The fact table contains monthly data and for the mentioned peroson above there will be 20 seperate records (for each month between november 2018 and june 2020).
For any period, a person may or may not be working on a project.
Without knowing exactly what your data and metadata is it would be somewhat difficult to prescribe an exact solution but the approach would probably be somewhat similar to a degenerate dimension scenario.
You would want to model the project dimension as a fact as well as a dimension. You would have relationships between it and time and whatever other dimensions you need.
Depending on the data and the metadata you might need to do some gymnastics to get there.
If the data was in a form similar to this it would be not too difficult. This is an example to get you an idea about some ways of approaching the problem.
Date_Key Person_Key Project_Key commitment_status, which would be the measure.
20200101 1 1 1
20200101 1 2 0
20200101 1 3 0
20200102 1 1 1
20200102 1 2 0
20200102 1 3 0
20200103 1 1 0
20200103 1 2 1
20200103 1 3 0
In the above, person 1 was working on project 1 for 2 days and then put onto project 2 for a day. By aggregating the commitment status, which is done by setting the aggregate rule property, you would be able to determine the number of days a person has been working on a project no matter what time period you have set in your query.
I have a line chart ( x represents date, y represents amount of car rentals on that date) that needs to be connected at all times, since the values are all valid - there is always at least one car rental per that date. The only time that the line shouldn't be connected, but should make a gap between two valid values/points is when the two successive dates are too wide apart. I have to figure out the best alghorithm for what this 'two wide apart' means and, based on these dates (or something), set a parameter.I don't know all the possible combinations of dates, but I think they can be anything:
2010 2011 2013 2018 2019
or
1990 2001 2002 2012 2015
or
possibly anything else
Is there any standard way to deal with this kind of problem?
The problem is to characterize what it means to be too wide apart. One solution is to build a histogram (i.e. a probability density function) of the date differences of the x coordinates of the data points, and then to consider as too wide, those differences that are in, say the top 33% (or whatever other proportion you wish).
For example, suppose the x coordinates are the years:
1990 1995 2001 2002 2003 2010 2011 2012 2013 2017 2019
Let say we calculate date differences in years (we could choose any other duration unit). We calculate the differences between the values above and build the histogram below.
Counts: 5 1 0 1 1 1 1
Diff.: 1 2 3 4 5 6 7
Now, if we choose to keep disconnected differences in the top 33%, from the histogram, this means that differences greater than or equal to 5 years would be disconnected.
I want to generate a report in obiee , by grouping countries and showing there cumulative sum
I tried creating bins like for china I created a bin which contains singapore, taiwan and china. another bin for japan containing some countries. using pivot table i can show the sum of customers in a region by dates for these two bins. but when I need a cumulative sum for every bin it is giving weird values
Number of employee by region and date where china and japan are bins
china japan
01-Nov-18 1 3
02-Nov-18 2 4
03-Nov-18 1 1
04-Nov-18 2 5
05-Nov-18 4 7
06-Nov-18 5 7
where as result i want( how can I achieve this)
China Japan
01-Nov-18 1 3
02-Nov-18 3 7
03-Nov-18 4 8
04-Nov-18 6 13
05-Nov-18 10 20
06-Nov-18 15 27
The measure has to be a running sum in that case. RSUM(YourMeasure)
Make a duplicate layer of the measure column. Click on the duplicated column and select the option 'Display as Running Sum'.
When comparing two objects of the same size, Javers compares 1-to-1. However, if a new change is added such as new row to one of the objects, the comparison reports changes that are NOT changes. Is it possible to have Javers ignore the addition/deletion for the sake of just comparing like objects?
Basically the indices get out of sync.
Row Name Age Phone(Cell/Work)
1 Jo 20 123
2 Sam 25 133
3 Rick 30 152
4 Rick 30 145
New List
Row Name Age Phone(Cell/Work)
1 Jo 20 123
2 Sam 25 133
3 Bill 30 170
4 Rick 30 152
5 Rick 30 145
Because Bill is added the new comparison result will say that Rows 4,5 have changed when they actually didn't.
Thanks.
I'm guessing that your 'rows' are objects representing rows in an excel table and that you have mapped them as ValueObjects and put them into some list.
Since ValueObjects don't have its own identity, it's unclear, even for a human, what was the actual change. Take a look at your row 4:
Row Name Age Phone(Cell/Work)
before:
4 Rick 30 145
after:
4 Rick 30 152
Did you changed Phone at row 4 from 145 to 152? Or maybe you inserted a new data to row 4? How can we know?
We can't. By default, JaVers chooses the simplest answer, so reports value change at index 4.
If you don't care aboute the indices, you can change the list comparision algorithm from Simple to Levenshtein distance. See https://javers.org/documentation/diff-configuration/#list-algorithms
SIMPLE algorithm generates changes for shifted elements (in case when elements are inserted or removed in the middle of a list). On the contrary, Levenshtein algorithm calculates short and clear change list even in case when elements are shifted. It doesn’t care about index changes for shifted elements.
But, I'm not sure if Levenshtein is implemented for ValueObjects, if it is not implemented yet, it's a feature request to javers-core.
I have a panel dataset and I want to drop those respondents that were aged 40 years and over in their first round of the survey.
I tried doingdrop if age>40 and also drop if age>40 & t==1 where t is an identifier of the survey wave the person is in. However when I do the second I am left with people over the age of 40.
Here is an example of how my data looks like:
pid age wave year of survey
1 20 1 2005
1 21 2 2006
1 22 3 2007
1 23 4 2008
2 37 1 2006
2 38 2 2007
2 39 3 2008
2 40 4 2009
3 40 1 2008
3 41 2 2009
3 42 3 2010
3 43 4 2011
My aim is not to lose the 3rd respondent given that he/she was within my target age group when they were first surveyed but they were not in the following survey years (rather than just being left with his/her first wave of data and dropping the other 3 that is what is being done if I simply do drop if age<=40).
Is there another way so as to be left with only people up to the age of 40 while keeping those who were 40 in their first wave even if they turn 41, 42 etc in subsequent waves? I basically want to constrain my panel into the up to 40 years of age group while keeping those who were 40 in their wave but might be over 40 in subsequent waves (I only have 4 waves).
Stata gives you exactly what you're asking for. With drop if age > 40 you simply lose any observation for which age > 40. With drop if age > 40 & wave == 1 you add an additional condition: drop it if it simultaneously has wave == 1. I think that's clear.
I find your explanation somewhat contradictory. You don't want to lose any observation from respondent 3 because in her first wave she's not over 40, although she is in her following waves. But then you say you want to be left with only people up to the age of 40.
The following just drops all observations for any person who in her first wave is over 40. Let us know if this is not what you seek.
clear all
set more off
input ///
pid age wave survyear
1 20 1 2005
1 21 2 2006
1 22 3 2007
1 23 4 2008
2 37 1 2006
2 38 2 2007
2 39 3 2008
2 40 4 2009
3 40 1 2008
3 41 2 2009
3 42 3 2010
3 43 4 2011
4 42 1 2009
4 43 2 2010
4 44 3 2011
4 45 4 2012
end
list, sepby(pid)
*-----
bysort pid (age): drop if age[1] > 40
list, sepby(pid)
You probably want to read Speaking Stata: How to move step by: step, by Nick Cox. See also help subscripting.
Edit
With no knowledge of the database structure, sorting by wave should be a more general approach. That involves bysort pid (wave): ... in the previous code. Imagine a case where a person has the same age for two consecutive waves. If so, sorting by age would not give consistent results. The wave variable is likely to be the one that uniquely identifies cases, for each person. Read help sort and help isid carefully, including the manual entries.