Usage#
Each dataset is defined by creating a subclass of DataInterface. The
class definition will contain information on:
where the data is located: on disk or in a remote store
how to load or write data: with which library and function, how to post-process it, etc.
The interface can then be re-used in different scripts so that eventually, you can get your data with two simple lines:
di = MyDataInterface()
di.get_data()
Each instance of that subclass corresponds to a set of parameters that can be used to change aspects of the interface on the fly: choose only files for a specific year, change the method to open data, etc.
Module system#
Features of the interface are split into individual modules that can be swapped
or modified.
The default DataInterface has four modules. Each module has an
attribute where the module instance can be accessed, and an attribute where its
type can changed:
Instance attribute |
Type attribute |
Class |
Function |
|---|---|---|---|
parameters |
Parameters |
manage parameters |
|
source |
Source |
manage data source |
|
loader |
Loader |
load data |
|
writer |
Writer |
write data |
To change a module, we only need to change the module type. It can be a simple attribute change, or a class definition with the appropriate name, like so:
class MyDataInterface(DataInterface):
# simple attribute change
Parameters = ParametersDict
# or a more complex definition
class Source(SimpleSource):
def get_source(self):
...
# we can then access the modules
di = MyDataInterface()
di.source.get_source()
Tip
Every module can access its parent interface via Module.di.
Parameters#
The parameters of the interface are stored in the Parameters module. They are given as argument to the interface on initialization.
Parameters can be stored in a simple dictionary, or using objects from the configuration part of Neba like a Section or an Application. See the existing parameters modules.
There are two ways to access parameters.
you can use the methods of the parameters module which works like a dictionary:
di.parameters["a"] = 0 # equivalent to di.parameters.set("a", 0) di.parameters["a"] # somewhat equivalent to di.parameters.get("a") di.parameters.update(a=1, b=2)
If a key is not defined,
di.parameters[key]will raise KeyError, anddi.parameters.get(key)will return None by default.Or you can access the parameters container directly at
di.parameters.direct. This is useful if you use aSectionas a container:di.parameters.direct.my_subsection.my_trait = 0
But using
di.parameters["my_subsection.my_trait"] = 0will still work!
Some modules use a cache that needs to be reset when parameters are changed. Both ways of accessing parameters will void the cache appropriately.
Note
This is done in two ways. The parameters module methods will simply call
DataInterface.trigger_callbacks() after their operation. For direct
access, the dictionary container has __setitem__ patched, and Sections
objects have an observer event
registered. Direct access will only void the cache if a new parameter is
added, or if the new value is different from the old
Important
This does not include in-place operations on mutable parameters:
di.parameters["my_list"].append(1)
di.parameters["my_dict"]["key"] = 1
# or
di.parameters.direct["my_list"].append(1)
di.parameters.direct["my_dict"]["key"] = 1
will not trigger a callback.
Tip
The parameters module is accessible from any other module at
Module.parameters.
It might be useful to quickly change parameters, eventually multiple times,
before returning to the initial set of parameters. To this end, the method
DataInterface.save_excursion() will return a context manager that will
save the initial parameters and restore them when exiting:
di.parameters["p"] = 0
with di.save_excursion():
# we change parameters
self.parameters["p"] = 2
self.get_data()
# we are back to di.parameters["p"] == 0
This is used by DataInterface.get_data_sets() that returns data for
multiple sets of parameters, for instance to get specific dates:
data = di.get_data_sets(
[
{"Y": 2020, "m": 1, "d": 15},
{"Y": 2021, "m": 2, "d": 24},
{"Y": 2022, "m": 6, "d": 2},
]
)
Source#
The Source module manages the location of data that will be read or written
by other modules. It could be files on disk, or the address of a remote
data-store, or the store object itself. It allows to use
DataInterface.get_source(), though other modules will typically call it
automatically when they need it.
Sometimes, you may have data split in different locations. To solve this, you can use a module mix to combine multiple source modules into one by taking the union (or intersection) of their results. Say you have data files in two locations with different naming conventions:
/data1/<year>/data1_<year><month><day>.nc
and
/data2/data2_<year><dayofyear>.nc
We combine two FileFinderSource by taking the union:
class MyDataInterface(DataInterface):
class Source1(FileFinderSource):
def get_root_directory(self):
return "data1"
def get_filename_pattern(self):
return "%(Y)/data1_%(Y)%(m)%(d).nc"
class Source2(FileFinderSource):
def get_root_directory(self):
return "data2"
def get_filename_pattern(self):
return "data2_%(Y)%(j).nc"
# SourceUnion a is type of module mix
Source = SourceUnion.create([Source1, Source2])
If we need to run a method on one of the source modules, for instance to generate a filename, we can specify a function to automatically select one module. That function receives the instance of the module mix and should return the class name of one base module. Let’s say our first dataset contains years up to 2010, and the second one the years after that.:
class MyDataInterface(DataInterface):
...
@staticmethod
def _select_source(mod: SourceBase, **kwargs):
year = mod.parameters.get("Y", None)
# if user specify a year in kwargs it gets precedence
year = kwargs.get("Y", year)
if year is None:
raise ValueError("Year not fixed")
if year <= 2010:
return "Source1"
else:
return "Source2"
Source = SourceUnion.create([Source1, Source2], select=_select_source)
We can then run a method on a selected module with
di.source.apply_select("get_filename", year=2015), we can specify the year
by hand or the year in the interface parameters will be used.
Tip
The module mix will also try to dispatch any attribute access to the selected
base module, so di.source.get_filename() will work.
More details on Module mixes.
Loader#
The Loader module loads the data using different libraries or functions. It
allows to use DataInterface.get_data(). By default, it uses the location
given by the Source module, but it can always be specified manually with
di.get_data(source="my_file"). It also allows to post-process your data,
ie run a function every time it is loaded. For instance say we need to change
units on a variable, we just need to implement the
postprocess() method:
class MyDataInterface(DataInterface):
class Loader(XarrayLoader):
def postprocess(self, data: xr.Dataset) -> xr.Dataset:
# go from Kelvin to Celsius
data["sst"] += 273.15
return data
Now, every time we load data (using DataInterface.get_data()), the
function is applied. You can always disable it by passing
di.get_data(ignore_postprocess=True).
New loaders should implement the method
load_data_concrete() that loads data from a given source.
LoaderAbstract.get_data() will deal with getting the source and applying
post-processing.
Writer#
The Writer writes data to the location given by the Source module. It allows to
use DataInterface.write().
The writer will generate one or more calls, each consisting of a location and data to write there. It will then execute calls serially or in parallel (for instance when using Xarray and Dask). The writer will check that no call point to the same target, and will create directories if needed.
Some writers are able to split your dataset into multiple files. They should
inherit SplitWriterMixin, and the source module should follow the
Splitable protocol. See XarraySplitWriter for an example.
Metadata#
Writers can generate metadata with a MetadataGenerator object. You can
modify the generator class via the
Writer.metadata_generator attribute
(an instance will be created when generating metadata with
writer.get_metadata).
Metadata items are generated by different MetadataMethod objects.
Each corresponds to a method of the generator and can return either a single
item that will be given the same name as the method, or return a mapping of
multiple items. Items can be renamed by with rename().
For instance:
di.writer.metadata_generator.creation_time.rename("created_on")
Note
This only changes the name of the items that end up in the metadata, not the method that generate them.
If an error is raised when running a method, the exception is only logged and
the generation continues. When all methods have run,
postprocess() is called. This is a good place to
slightly modify the metadata.
Users can specify options via the metadata_kwargs argument of appropriate
writer methods. In particular, one can manually specify methods to run via
methods, or skip groups of methods with
add_params and add_git_info
(or any option in methods_to_skip). Check the
documentation of MetadataOptions for all available options.
Default methods are:
written_with_interface: name of the interface classcreation_hostname: hostname of current machinecreation_script: filename of top-level script or notebookcreation_params: a dictionary or a string representation of the interface parameters, depending on theparams_stroptioncreation_time: date and time of creationcreation_commit: if found, the HEAD commit hashcreation_diff: if workdir is dirty, a list of modified files and full diff truncated atmax_diff_lines
To add new methods, subclass the generator and decorate your method with
method(). If your method return multiple items, you should
specify their names in the decorator. Methods can access the
metadata attribute which is progressively populated.
Note
Methods are run in the order of the methods option
if specified, or otherwise in the order they are defined in the generator
class.
Typing#
Modules may deal with different types of parameters, source and data. Module
classes specify their supported types as generics, so you can check their base
class to see what input/output they support. For instance,
XarrayLoader can receive str | os.PathLike and
returns xarray.Dataset.
However, since one of the use of Neba is to ease the management of multi-file
datasets, all modules are to be expected to receive either one source file, or
a list of them. XarrayLoader may receive a str or list thereof (that it
will concatenate into a single output).
The types of parameters, source, and data (in this order) are also left as
generics for the interface class. By specifying them you get type-checks for
some top-level methods like DataInterface.get_data() or
DataInterface.get_source(), and it allows to type-check compatibility
between modules.
class MyDataInterface(DataInterface[App, str, xr.Dataset]):
Parameters = ParametersApp
Source = FileFinderSource
Loader = XarrayLoader
# module instances must be type-hinted by hand :(
parameters: ParametersApp[App]
source: FileFinderSource
Loader: XarrayLoader
Note
Modules having union types can be tricky. You can think about it in terms of inputs and outputs:
source modules output source,
loader modules take in source and output data,
writer modules take in source and data.
For outputs, you should specify all types in your interface generic. For inputs, it’s okay not to list them all.
For example, if your source modules returns str | bytes you should list
them all. That way, if your loader modules only takes in str as source,
your type-checker should complain (since the loader might receive
bytes). And if your writer takes in str | bytes | os.PathLike, you
don’t need to list os.PathLike, since the source module will never
return that.
Module mixes#
Modules can be compounded together in some cases. The common API for this is
contained in ModuleMix. This generates a module with multiple ‘base
modules’. It will instantiate and initialize all modules and store them in
ModuleMix.base_modules.
Mix classes should be created with the class method create().
This is used for instance to obtain the union or
intersection of source files obtained by different
source modules. Or it could be used to write to multiple file format at once
(with different base writers).
Mixes can run methods on their base modules:
apply_all()will run on all the base modules of the mix and return a list of outputs.apply_select()will only run on a single module. It will be selected by a user defined function that can be set increate()or withModuleMix.set_select(). It chooses the appropriate base module based on the current state of the mix module, the interface parameters, and eventual keywords arguments it might receive. It should return the class name of one of the module.apply()will use the all or select version based on the value of theallargument. In all methods, args and kwargs are passed to the method that is run, and the select keyword argument is a passed to the selection function.
Tip
If an attribute access fails on a ModuleMix, it tries to select a base module and access that attribute on it. This allows to dispatch quickly to a base module.
Cache module#
Note
This section is aimed at module writers. Users can safely ignore it.
It might help for some modules to have a cache to write information into. For
instance source modules for multiple files leverage this. A module simply needs
to be a subclass of CachedModule. This will automatically create a
dictionary in the cache attribute. It will also register a callback in the
interface, so that this module cache will be voided on parameters change. This
can be disabled however by setting the class attribute _add_void_callback to
False (in the new module subclass).
If a module has a cache, you can use the autocached() decorator to make
the value of one of its method or property automatically cached. Watch out
for the order of decorators for properties:
class SubModule(SourceAbstract, CachedModule):
@property
@autocached
def something(self):
...
Defining new modules#
Users will typically only need to use existing modules, possibly redefining some of their methods, but in the case more dramatic changes are necessary, here are more details on the module system.
All modules inherit from abstract classes that define their API. Note that they
are not defined through the abc module, and thus will not
raise if methods lack an implementation. These classes are more guidelines than
strict protocols.
For developers
Nevertheless it is advised to keep a common signature for module subclasses, relying on keyword arguments if necessary. This helps ensure inter-operability between modules and easy substitution of module types.
To add more modules types, the correspondence between the attribute containing
the module instance and the one containing the module type must be indicated in
the mapping _modules_attributes.
Note
Modules are instantiated and setup in the order of that mapping.
Interfaces are initialized with an optional argument giving the parameters, and
additional keyword arguments. All modules are instantiated with the same
arguments. Once they are all instantiated, they are setup using the
Module.setup() method. This allow to be sure that all other modules exist
if there is need for interplay.
Interface store#
To help deal with numerous interface classes, we provide a
mapping allowing to store and easily access your
interfaces using their ID or
SHORTNAME attributes, or a custom name.
from neba.data import DataInterface, DataInterfaceStore
class MyDataInterface(DataInterface):
ID = "SomeLongID"
SHORTNAME = "SST"
store = DataInterfaceStore(MyDataInterface)
di_cls = store["SomeLongID"]
# or
di_cls = store["SST"]
If multiple interfaces have the same shortname, they can only be accessed by their ID. Trying to access with an ambiguous shortname will raise a KeyError.
You can directly register an interface with a decorator:
store = DataInterfaceStore()
@store.register()
class MyDataInterface(DataInterface):
...
You can also store an interface as an import string. When accessed, the store will automatically import your interface (and replace the string by the imported class for subsequent accesses).:
store.add("path.to.MyDataInterface")
di_cls = store["MyDataInterface"]
# an interface class