I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks:
1. Use Coalesce
2. Extend CombineFileInputFormat
But I'm not clear of performance implications with bot and when to use one over other.
Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize
Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation
Related
I am trying to implement a gem5 version of HybCache as described in HYBCACHE: Hybrid Side-Channel-Resilient Caches for Trusted Execution Environments (which can be found at https://www.usenix.org/system/files/sec20spring_dessouky_prepub.pdf).
A brief summary of HybCache is that a subset of all the cache is reserved for use by secure processes and are isolated. This is achieved by using a limited subset of cache ways when the process is in 'isolated' mode. Non-isolated processes uses the cache operations normally, having access to the entire cache and using the replacement policy and associativity given in the configuration. The isolated subset of cache ways uses random replacement policy and is fully associative. Here is a picture demonstrating the idea.
The ways 6 and 7 are grey and represent the isolated cache ways.
So, I need to manipulate the placement of data into these ways. My question is, since I have found no mention of cache ways in the gem5 code, does that mean that the cache ways only exist logically? That is, do I have to manually calculate the location of each cache way? If cache ways are used in gem5, then were are they used? What is the file name?
Any help would be greatly appreciated.
This answer is only valid for the Classic cache model (src/mem/cache/).
In gem5 the number of cache ways is determined automatically from the cache size and the associativity. Check the files in src/mem/cache/tags/indexing_policies/ for the relevant code (specifically, the constructor of base.cc).
There are two ways you could tackle this implementation:
1 - Create a new class that inherits from BaseTags (e.g., HybCacheTags). This class will contain the decision of whether it should work in secure mode or not, and how to do so (i.e., when to call which indexing and replacement policy). Depending on whatever else is proposed in the paper, you may also need to derive from Cache to create a HybCache.
The new tags need one indexing policy per operation mode. One is the conventional (SetAssociative), and the other is derived from SetAssociative, where the parameter assoc makes the numSets become 1 (to make it fully associative). The derived one will also have to override at least one function, getPossibleEntries(), to only allow selecting the ways that you want. You can check skewed_assoc.cc for an example of a more complex location selection.
The new tags need one replacement policy per operation mode. You will likely just use the ones in the replacement_policies folder.
2 - You could create a HybCache based on the Cache class that has two tags, one conventional (i.e., BaseSetAssoc), and the other based on the FALRU class (rewritten to work as a, e.g., FARandom).
I believe the first option is easier and less hardcoded. FALRU has not been split into an indexing policy and replacement policy, so if you need to change one of these, you will have to reimplement it.
While implementing you may encounter coherence faults. If it happens it is much likely a problem in the indexing logic, and I wouldn't look into trying to find issues in the coherence model.
When creating Apache NiFi controller services, I'm interested in hearing about when it makes sense to create new ones and when to re-share existing ones.
Currently I have a CsvReader and CSVRecordSetWriter at the root process group and they are reused heavily in child process groups. I have tried to set them up to be as dynamic and flexible as possible to cover the widest number of use cases possible. I am setting the Schema Text property in each currently like this:
Reader Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.reader})}
Writer Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.writer})}
A very common pattern I have is to map files with different fields from different sources into a common format (common schema). So one thought is to use the ConvertRecord or UpdateRecord processors with avro.schema.reader and avro.schema.writerattributes set to the input and output schemas. Then I would have the writer always set the avro.schema attribute so any time I read records again further along in a flow it would default to using avro.schema. This feels dirty to leave the reader and writer schema attributes hanging around. Is there a better way from an architecture standpoint? Why have tons of controller services hanging around at different levels? Aside from some settings that may need to be different for some use cases, am I missing anything?
Also curious in hearing about how others organize their schemas? I don't have a need to reuse them at disparate locations across different processor blocks or reference different versions so it seems like a waste to centralize them or maintain a schema registry server that will also require upgrades and maintenance when I can just use AvroSchemaRegistry.
In the end, I decided it made more sense to split the controller into two controllers. One for conversions from Schema A to Schema B and another for using the same avro.schema property as normal/default readers and writers do when adding new ones. This allows for explicitly choosing the right pattern at processor block configuration time rather than relying on the implicit configuration of a single processor. Plus you get the added benefit of not stopping all flows (just a subset) when you only need to tweak settings on one of those two patterns.
Hadoop's Distributed Cache lets the developer add small files to the MR context which can be used to obtain additional information during Map or Reduce phases. However, I did not find a way to access this cache in a Partitioner. I need the contents of a small file (the output of an earlier MR job) in a custom Partitioner to determine how the keys are sent to the reducers.
Unfortunately, I cannot find any useful documentation on this, and my only idea is currently a somewhat "hackish" approach, which involves serializing the contents of the file to a Base64 string and putting it into the Configuration. Configurations can be used in a partitioner by letting it implement Configurable. While the file is small enough for this approach (around 50KB) I suppose the distributed cache is better suited for this.
EDIT:
I found another approach which I consider slightly better. Since the file I need to access in the partitioner is on HDFS, I put its fully-qualified URI into the Configuration. In my Partitioner's setConf method I can then re-create the Path via new Path(new URI(conf.get("some.file.key"))) and read it with the help of the Configuration. Still hackish though...
I'm writing an application that uses Hector to access a Cassandra database. I have some situations where I only need to query one column, and some where I need to query multiple columns at once. Writing one method that takes an array of column names and returns a list of columns using SliceQuery would be simplest in terms of code, but I'm wondering whether there's a significant drawback to using SliceQuery for one column compared to using ColumnQuery.
In short, are there enough (or any) performance benefits of using ColumnQuery over SliceQuery for one column to make it worth the extra code to deal with a one-column case separately?
By looking at Hector's code , the difference between using a ColumnQuery (ThriftColumnQuery.java) and a SliceQuery (ThriftSliceQuery.java) is the different thrift command being sent - "get" or "get_slice" (respectively).
I didn't find an exact documentation of how each of those operations are implemented by Cassandra's server, but I took a quick look in Cassandra's sources and after examining CassandraServer.java I got the impression that the "get" operation is there more for client's convenience than for better performance when querying a single column:
For a "get" request, a SliceByNamesReadCommand instance is created and executed.
For a "get_slice" request (assuming you're using Hector's setColumnNames method and not setRange), a SliceByNamesReadCommand instance is created for each of the wanted columns and then executed (the row is read only once though).
Bottom line, as far as I see it there's not much more than the (negligible) overhead of creating some collections meant for handling the multiple columns.
If you're still worried however, I believe it shouldn't be too difficult to handle the two cases differently when wrapping the use of Hector in your DAOs.
Hope I managed to help.
I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.