Linq to Excel ignoring header rows and using subheaders - linq

I'm looking at Linq to Excel tutorials and they all seem pretty simple and straightforward excpet all of them assume the excel table being used has all column headers neatly placed on row 1 and starting at column A.
I need to query data from excel files where the tables not only start around row 6 (some may start at lower rows) and have headers and subheaders (headers represent a specific place/company; subheaders represent column values for that place like id, stock remaining, sales made, etc.).
Is there any way to specify for the query which row holds the headers I want to use so it only takes information from beneath them?

Can you just skip the number of rows you don't care about?
rows.Skip(1).Select(r => // Rest of your stuff here...
Better yet, query the appropriate range from the start like the LinqToExcel wiki suggests:
//Selects data within the B3 to G10 cell range
var indianaCompanies = from c in excel.WorksheetRange<Company>("B3", "G10")
where c.State == "IN"
select c;

Related

Google Sheet Query: Select misses data when there are different data type in a column?

I have a table like this:
a
b
c
1
2
abc
2
3
4.00
note c2 is text while c3 is a number.
When I do
=QUERY(A1:C,"select *")
The result is like
a
b
c
1
2
2
3
4.00
The "text" in C2 has been missed. You can see the live sheet here:
https://docs.google.com/spreadsheets/d/1UOiP1JILUwgyYUsmy5RzQrpGj7opvPEXE46B3xfvHoQ/edit?usp=sharing
How to deal with this issue?
QUERY is very useful, but it has a main limitation: only can handle one kind of data per column. The other data is left as blank. There are usually ways to try to overcome this from inside the QUERY, but I've found them unfruitful. What you can do is just to use:
={A:C}
You can work with filters by its own, but as a step-by-step to adapt the main features of query: If you need to add conditions, use LAMBDA INDEX and FILTER
For example, to check where A is not null:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}) --> with INDEX(quer,,1), I've accesed the first column
Where B is more than one cell and less than other:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,2)>D1,INDEX(quer,,2)<D2))({A:C})
For sorting and limiting an amount of items, use SORTN. For example, you want to sort by 3rd column and limit to 5 higher values in that column:
=LAMBDA(quer,SORTN(FILTER(quer,INDEX(quer,,1)<>""),5,1,3,0))({A:C})
Or, to limit to 5 elements without sorting use ARRAY_CONSTRAIN:
=ARRAY_CONSTRAIN(LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}),5)
There are other options, you can use REGEXMATCH and other options, and emulate QUERYs functions without missing data. Let me know!
shenkwen,
If you are comfortable with adding an Google App Script in your sheet to give you a custom function, I have a QUERY replacement function that supports all standard SQL SELECT syntax. I don't analyze the column data to try and force to one type based on which is the most common data in the column - so this is not an issue.
The custom function code - is one file and is at:
https://github.com/demmings/gsSQL/tree/main/dist
After you save, you have a new function from your sheet. In your example, the syntax would be
=gsSQL("select a,b,c from testTable", {{"testTable", "F150:H152", 60, true}})
If your data is on a separate tab called 'testTable'(or whatever you want), the second parameter is not required.
I have typed in your example data into my test sheet (see line 150)
https://docs.google.com/spreadsheets/d/1Zmyk7a7u0xvICrxen-c0CdpssrLTkHwYx6XL00Tb1ws/edit?usp=sharing

Power query - strategy for handling repeating rows

Given a report which as a table with repeated row headings, is there a good strategy for using Power Query/M to extract the data in a clean format?
For example the report available here, has an excel file (which at time of writing is pointing to August 2021):
https://www.opec.org/opec_web/static_files_project/media/downloads/publications/MOMR%20Appendix%20Tables%20(August%202021).xlsx
In this example:
we have the World demand table portion
Non-OPEC Liquids production portion
both of these have rows: Americas/Europe/Asia Pacific:
which makes it hard to distinguish them in Power Query
What is right approach which would allow extraction of data from this type of table?
I would add a column ... custom column ... with formula
=if [2018] = null then [Column] else null
and then right click the new column and fill down
That would put World Demand and non-OPEC as a column that you could additionally filter on

Matching (querying?) criteria with IMPORTRANGE

Forgive me if I am not using the correct terminology, I short of crash-coursed myself in Google sheets a few days ago.
Is there a way that I could using IMPORTRANGE to import a data range from spreadsheet 2 into spreadsheet 1, where the range selected from spreadsheet 2 can be matched against criteria in spreadsheet 1 that corresponds to criteria in spreadsheet 2? I have a specific set of data in spreadsheet 1 that, while the same in content, is not in the same order as spreadsheet 2 (which I don't myself maintain) or spreadsheet 3 (which is maintained by someone other than myself or the person that maintains spreadsheet 2), but am being given access to spreadsheet 2 and spreadsheet 3 data that I didn't previously have.
EXAMPLE:
https://docs.google.com/spreadsheets/d/1ByN9Ju8QiiHTfFgow7lDF4VN-zBRqP1gzpAK73ZRBNg/edit?usp=sharing
You work with IMPORTRANGE content the same way as you do with any range within your spreadsheet. Good practice is to use columns with unique content as ID's for searching, filtering, etc.
If you want put the content of somebody's spreadsheet into yours, you can control it.
For example:
In order to get REGISTRATION number from sheet3
Think of VLOOKUP construction:
=VLOOKUP(key,table with key value on the leftmost column;number of column to take value from,false)
You use vlookup formula that takes name in your table as a key (first parameter of formula), then you must rebuild your importrange to have key in leftmost column.
2nd parameter of VLOOKUP will look like this:
{importrange("Sheet3url";"Sheet!Columnwithname"),importrange("Sheet3url";"Sheet!Columnwithregistration")}
This is your temporary table made of 2 importranges.
You want 2nd column of this construction - which is column with registration.
Whole vlookup looks like this:
=VLOOKUP(key,{importrange("Sheet3url";"Sheet!Columnwithname"),importrange("Sheet3url";"Sheet!Columnwithregistration")},2,false)
It's much easier when key is on the left. If you want to extract SEX and DOB you use:
=VLOOKUP(key,importrange("Sheet3url";"Sheet!Columnsfromname to DOB"),2 and then 3,false)
Beware - using multiple importrange makes your sheet slow.
If you have hundreds of rows, you should wrap it around with arrayformula to work with all rows in one go.
Also you can first importrange somebodys table into your sheet on a side and operate inside your sheet.
It's advised when using big datasets and not that many files.

Excel 2017 Formula - Average data by month, while being filterable

I'm not a VBA coder, and I would prefer an excel formula if possible, the easiest solution will be the best one.
Test workbook screenshot
As you can see, I have plenty of columns, which are filterable.
I am attempting to retrieve an average of Column L, but I want the data to be calculated for the correct month in G3:R3.
The resulting calculation needs to be recalculated when filtered, between customers, sites, status, job type etc.
I am referencing the resulting cells in another sheet, which gives an idea of trends I can glance at, as such filtering by month in each sheet, is not an option.
=AVERAGE(IF(MONTH(E9:E1833)=1,(J9:J1833)))
This one does not update with the filtered data.
=SUM(IF(MONTH(E9:E1833)=1,J9:J1833,0)) /SUM(IF(MONTH(E9:E1833)=1,1))
This one does not update with the filtered data.
I have tried 5 different SUBTOTAL formulas, some with OFFSET, none of these produce the same result I get when checking manually.
Each worksheet has over 1,500 hundred rows, the largest is 29148 rows. The data goes back as far as 2005.
Please can someone help me find a solution?
One possible solution is to create a helper column which returns 1 if the row is visible and returns 0 if the row is invisible (or blank). This allows a bit more freedom in your formulas.
For example, if you want to create a helper column in column X, type this into cell X9 and drag down:
= SUBTOTAL(103,A9)
Now you can create a custom average formula, for example:
= SUMPRODUCT((MONTH(E9:E1833)=1)*(X9:X1833)*(J9:J1833))/
SUMPRODUCT((MONTH(E9:E1833)=1)*(X9:X1833))
Not exactly pretty but it gets the job done. (Note this is an array formula, so you must press Ctrl+Shift+Enter on your keyboard instead of just Enter after typing this formula.)
With even more helper columns you could avoid SUMPRODUCT altogether and just accomplish this by doing a single AVERAGEIFS.
For example if you type into cell Y9 and drag down:
= MONTH(E9)
Then your formula could be:
= AVERAGEIFS(J9:J1833,X9:X1833,1,Y9:Y1833,1)
There isn't a clean way to do this without at least one helper function (if you want to avoid VBA).

How could I quickly look-up items in a List of loaded entities

I have built an MVC 5 application, using EF 6 to query the database. One page show a cross table of two dimensions: substances against properties of these substances. It is rendered as an html table.Many cells do not have a value. This is what it looks like:
sub 1 sub 2 sub 3
prop A 1.0
prop B 1.5 X
prop C 0.6 Y
The cell values are actually more complex, including tool tips, footnotes, etc.
I implemented the generation of the html table, by the following steps:
create a list of unique properties;
create a list of unique substances;
loop through the properties;
render a row for each;
loop through the substances;
See if there is a value for the combination of property and substances;
render the cell's value or an empty one.
Using the ANTS performance profiler, I found out that step 6 has a huge performance issue with increasing numbers of substances and properties, the hit count exploding to hundreds of millions, with a few hundred substances and a few tens of properties (the largest selection the user can make). The execution time is many minutes. It seems to scale N(substances)^2 * N(properties)^2.
The code looks like:
Value currentValue =
values.Where(val => val.substance.Id == currentSubstanceId
&& val.property.Id == currentPropertyId).SingleOrDefault();
where values is a List and Value is an entity, which I read from to render the cells. values had been pre-loaded from the database and no queries are shown by the SQL Server Profiler.
Since not all cells have a value, I thought it best to loop through the row and columns and see if there is a value. I cannot just loop through the list of values.
What could I try to improve this? I thought about:
Create some sort of C# object, using the substance.Id and property.Id as a compound key and fill it from the List object. Which would be fastest?
Create some Linq query which returns an object which already contains the empty cells, like (substance cross join properties) left join values. I could do this in SQL easily, but could this be done with Linq? Could the object which stores the result have the Value as a member field, so I can still use it to render the cells?
Stop pre-loading and just run a database query for the value of each combination, possibly benefiting from database indexes.
I am considering restricting the number of substances and properties the user may select, but I would rather not do that.
Addtional info
As requested by C.Zonnenberg, some more info about the query.
The query to fill the list of values is basically as follows:
I create an IQueryable to which I add filters for requested substances and properties. I then include the substances, property and value details, found in related entities. I then execute query.ToList(). The actual SQL query, as seen by the SQL Profiler looks complex, involving SubstanceId IN () and PropertyId IN (), but it takes far less then a second to execute.
It returns a list of proxies, like: {System.Data.Entity.DynamicProxies.SubstancePropertyValue_078F758A4FF9831024D2690C4B546F07240FAC82A1E9D95D3826A834DCD91D1E}
I think your best bet is your first option. But to do that efficiently I would also modify the source data (values) and turn it into a dictionary, so you have a structure that's optimized for indexed lookup:
var dict = values.ToDictionary(e =>
Tuple.Create(e.substance.id, e.propertyid),
e => e.Value);
Then for each cell:
Value currentValue ;
dict.TryGetValue(Tuple.Create(currentSubstanceId, currentPropertyId),
out currentValue );
Further, you may benefit from parallelization by fetching the cell values in a Parallel.ForEach looping through all substances, for instance.

Resources