I have the following data frame (called finaldf) that looks like this. However the dates
are all out of order. How would i be able to sort this data to retain it all but make the dates be in order?
I'm assuming you are using python , and if its a pandas dataframe then
in order to sort data based on 'date' column you can use the following command:
finaldf = finaldf.sort_values(by="date")
Related
I'm working on a data warehouse project. I'm reading input data into a frame, and then I want to filter out the bad rows. However, I want to print some sample bad rows. What I have now is
df_good = df_input.filter(((df_input.info.isNull()) | (df_input.info == '')))
This filter works, but I cannot print out a sample of the dropped records. What I would like is something like:
df_keep, df_reject = df_input.filter_split(((df_input.info.isNull()) | (df_input.info == '')))
print("Sample rejected records:")
df_reject.show(5)
I found one method which involves running the filter, then joining the good data back to the original data with an outer join, then filtering to find original data not-in the good data set. But this iterates over the original data twice; I would like to pass through the list just once.
Any ideas? I am doing this in AWS Glue, so I may be able to use a Dynamic Frame function.
I'm trying to bind a table and a graph using d3 and jqGrid library. For that I have to get the search typed by the user in the searchbox (my table looks like this : http://www.guriddo.net/demo/guriddojs/)
I've found this function :
grid.getGridParam("postData").filters
but I don't know how to use it. I thought about the trigger event "jqGridToolbarAfterSearch" to get the data after each search but doesn't seems to work...
If someone has an idea I'll be very grateful!
Thanks.
Ps : if the same method exist to set data, I'm interested too.
I hope that I correctly understand your problem. I suppose that you first converts the CSV data of the demo to some more continent data format: array of items with some properties (name, economy, cylinders, displacement, power, weight, mph, year). Then you can use datatype: "local" and data as the input data. I suppose that the user apply the local filter and then you want to get the filtered data
If you use free jqGrid fork of jqGrid (it's the fork which I develop) then you can get lastSelectedData parameter (var filteredData = $grid.jqGrid("getGridParam", "lastSelectedData");) to have the array of filtered items (see the demo). After that you can use d3 with the filtered items.
I have a relation in Pig that looks like this:
([account_id#100,
timestamp#1434,
id#900],
[account_id#100,
timestamp#1434,
id#901],
[account_id#100,
timestamp#1434,
id#902])
As you can see, I have three map objects within a tuple. All of the data above is within the $0'th field in the relation. So the data above in a relation with a single bytearray column.
The data is loaded as follows:
data = load 's3://data/data' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
DESCRIBE data;
data: {bytearray}
How do I split this data structure into three rows so that the output is as follows?
data: {account_id:chararray, timestamp:chararray, id:int}
(100, 1434,900)
(100, 1434,901)
(100, 1434,902)
It is very difficult to guess your problem without having a sample input data. If this is an intermediate result, then write it out using a STORE and put the output file as something that we can input to try out. I was able to solve this using STRSPLIT but am not sure if you meant that the input is a single column and a single row or are these three different rows with the same column.
In either case, Flattening out the data using the FLATTEN operator and using STRSPLIT later should help. If I get more information and input data for the problem, I can give a working example.
Data -> FLATTEN to get out of bag -> STRSPLIT over "," in a FOREACH,GENERATE
Right now, I am running a sum and sort on a DataFrame object:
games_tags.groupby(['GameID', 'GameName', 'Tag']).sum().sort(['Count'], ascending=False)
The issue I'm running into is that afterwards, I want to be able to still grab each row's GameID, GameName, and Tag via row['GameID'], etc. However, I noticed that after I use the sum() method, it creates a column named 'Count', but I can no longer access any of the original columns.
I was wondering if anyone knows a work around or some intricacy to the sum() method that I am missing. Any help is appreciated. Thanks!
You can reset the index after the groupby to restore the columns back:
game_tags.reset_index(inplace=True)
To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.
I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/