Usage#

Each dataset is defined by creating a subclass of DataInterface. The class definition will contain information on:

  • where the data is located: on disk or in a remote store

  • how to load or write data: with which library and function, how to post-process it, etc.

The interface can then be re-used in different scripts so that eventually, you can get your data with two simple lines:

di = MyDataInterface()
di.get_data()

Each instance of that subclass corresponds to a set of parameters that can be used to change aspects of the interface on the fly: choose only files for a specific year, change the method to open data, etc.

Module system#

Features of the interface are split into individual modules that can be swapped or modified. The default DataInterface has four modules. Each module has an attribute where the module instance can be accessed, and an attribute where its type can changed:

Instance attribute

Type attribute

Class

Function

parameters

Parameters

ParametersAbstract

manage parameters

source

Source

SourceAbstract

manage data source

loader

Loader

LoaderAbstract

load data

writer

Writer

WriterAbstract

write data

To change a module, we only need to change the module type. It can be a simple attribute change, or a class definition with the appropriate name, like so:

class MyDataInterface(DataInterface):

    # simple attribute change
    Parameters = ParametersDict

    # or a more complex definition
    class Source(SimpleSource):

        def get_source(self):
            ...

# we can then access the modules
di = MyDataInterface()
di.source.get_source()

Tip

Every module can access its parent interface via Module.di.

Parameters#

The parameters of the interface are stored in the Parameters module. They are given as argument to the interface on initialization.

Parameters can be stored in a simple dictionary, or using objects from the configuration part of Neba like a Section or an Application. See the existing parameters modules.

There are two ways to access parameters.

  • you can use the methods of the parameters module which works like a dictionary:

    di.parameters["a"] = 0
    # equivalent to
    di.parameters.set("a", 0)
    
    di.parameters["a"]
    # somewhat equivalent to
    di.parameters.get("a")
    
    di.parameters.update(a=1, b=2)
    

    If a key is not defined, di.parameters[key] will raise KeyError, and di.parameters.get(key) will return None by default.

  • Or you can access the parameters container directly at di.parameters.direct. This is useful if you use a Section as a container:

    di.parameters.direct.my_subsection.my_trait = 0
    

    But using di.parameters["my_subsection.my_trait"] = 0 will still work!

Some modules use a cache that needs to be reset when parameters are changed. Both ways of accessing parameters will void the cache appropriately.

Note

This is done in two ways. The parameters module methods will simply call DataInterface.trigger_callbacks() after their operation. For direct access, the dictionary container has __setitem__ patched, and Sections objects have an observer event registered. Direct access will only void the cache if a new parameter is added, or if the new value is different from the old

Important

This does not include in-place operations on mutable parameters:

di.parameters["my_list"].append(1)
di.parameters["my_dict"]["key"] = 1
# or
di.parameters.direct["my_list"].append(1)
di.parameters.direct["my_dict"]["key"] = 1

will not trigger a callback.

Tip

The parameters module is accessible from any other module at Module.parameters.

It might be useful to quickly change parameters, eventually multiple times, before returning to the initial set of parameters. To this end, the method DataInterface.save_excursion() will return a context manager that will save the initial parameters and restore them when exiting:

di.parameters["p"] = 0

with di.save_excursion():
    # we change parameters
    self.parameters["p"] = 2
    self.get_data()

# we are back to di.parameters["p"] == 0

This is used by DataInterface.get_data_sets() that returns data for multiple sets of parameters, for instance to get specific dates:

data = di.get_data_sets(
    [
        {"Y": 2020, "m": 1, "d": 15},
        {"Y": 2021, "m": 2, "d": 24},
        {"Y": 2022, "m": 6, "d": 2},
    ]
)

Source#

The Source module manages the location of data that will be read or written by other modules. It could be files on disk, or the address of a remote data-store, or the store object itself. It allows to use DataInterface.get_source(), though other modules will typically call it automatically when they need it.

Sometimes, you may have data split in different locations. To solve this, you can use a module mix to combine multiple source modules into one by taking the union (or intersection) of their results. Say you have data files in two locations with different naming conventions:

/data1/<year>/data1_<year><month><day>.nc
and
/data2/data2_<year><dayofyear>.nc

We combine two FileFinderSource by taking the union:

class MyDataInterface(DataInterface):

    class Source1(FileFinderSource):
        def get_root_directory(self):
            return "data1"

        def get_filename_pattern(self):
            return "%(Y)/data1_%(Y)%(m)%(d).nc"

    class Source2(FileFinderSource):
        def get_root_directory(self):
            return "data2"

        def get_filename_pattern(self):
            return "data2_%(Y)%(j).nc"

    # SourceUnion a is type of module mix
    Source = SourceUnion.create([Source1, Source2])

If we need to run a method on one of the source modules, for instance to generate a filename, we can specify a function to automatically select one module. That function receives the instance of the module mix and should return the class name of one base module. Let’s say our first dataset contains years up to 2010, and the second one the years after that.:

class MyDataInterface(DataInterface):

    ...

    @staticmethod
    def _select_source(mod: SourceBase, **kwargs):
        year = mod.parameters.get("Y", None)
        # if user specify a year in kwargs it gets precedence
        year = kwargs.get("Y", year)
        if year is None:
            raise ValueError("Year not fixed")
        if year <= 2010:
            return "Source1"
        else:
            return "Source2"

    Source = SourceUnion.create([Source1, Source2], select=_select_source)

We can then run a method on a selected module with di.source.apply_select("get_filename", year=2015), we can specify the year by hand or the year in the interface parameters will be used.

Tip

The module mix will also try to dispatch any attribute access to the selected base module, so di.source.get_filename() will work.

More details on Module mixes.

Loader#

The Loader module loads the data using different libraries or functions. It allows to use DataInterface.get_data(). By default, it uses the location given by the Source module, but it can always be specified manually with di.get_data(source="my_file"). It also allows to post-process your data, ie run a function every time it is loaded. For instance say we need to change units on a variable, we just need to implement the postprocess() method:

class MyDataInterface(DataInterface):

    class Loader(XarrayLoader):
        def postprocess(self, data: xr.Dataset) -> xr.Dataset:
            # go from Kelvin to Celsius
            data["sst"] += 273.15
            return data

Now, every time we load data (using DataInterface.get_data()), the function is applied. You can always disable it by passing di.get_data(ignore_postprocess=True).

New loaders should implement the method load_data_concrete() that loads data from a given source. LoaderAbstract.get_data() will deal with getting the source and applying post-processing.

Writer#

The Writer writes data to the location given by the Source module. It allows to use DataInterface.write().

The writer will generate one or more calls, each consisting of a location and data to write there. It will then execute calls serially or in parallel (for instance when using Xarray and Dask). The writer will check that no call point to the same target, and will create directories if needed.

Some writers are able to split your dataset into multiple files. They should inherit SplitWriterMixin, and the source module should follow the Splitable protocol. See XarraySplitWriter for an example.

Metadata#

Writers can generate metadata with a MetadataGenerator object. You can modify the generator class via the Writer.metadata_generator attribute (an instance will be created when generating metadata with writer.get_metadata).

Metadata items are generated by different MetadataMethod objects. Each corresponds to a method of the generator and can return either a single item that will be given the same name as the method, or return a mapping of multiple items. Items can be renamed by with rename(). For instance:

di.writer.metadata_generator.creation_time.rename("created_on")

Note

This only changes the name of the items that end up in the metadata, not the method that generate them.

If an error is raised when running a method, the exception is only logged and the generation continues. When all methods have run, postprocess() is called. This is a good place to slightly modify the metadata.

Users can specify options via the metadata_kwargs argument of appropriate writer methods. In particular, one can manually specify methods to run via methods, or skip groups of methods with add_params and add_git_info (or any option in methods_to_skip). Check the documentation of MetadataOptions for all available options.

Default methods are:

To add new methods, subclass the generator and decorate your method with method(). If your method return multiple items, you should specify their names in the decorator. Methods can access the metadata attribute which is progressively populated.

Note

Methods are run in the order of the methods option if specified, or otherwise in the order they are defined in the generator class.

Typing#

Modules may deal with different types of parameters, source and data. Module classes specify their supported types as generics, so you can check their base class to see what input/output they support. For instance, XarrayLoader can receive str | os.PathLike and returns xarray.Dataset.

However, since one of the use of Neba is to ease the management of multi-file datasets, all modules are to be expected to receive either one source file, or a list of them. XarrayLoader may receive a str or list thereof (that it will concatenate into a single output).

The types of parameters, source, and data (in this order) are also left as generics for the interface class. By specifying them you get type-checks for some top-level methods like DataInterface.get_data() or DataInterface.get_source(), and it allows to type-check compatibility between modules.

class MyDataInterface(DataInterface[App, str, xr.Dataset]):
    Parameters = ParametersApp
    Source = FileFinderSource
    Loader = XarrayLoader

    # module instances must be type-hinted by hand :(
    parameters: ParametersApp[App]
    source: FileFinderSource
    Loader: XarrayLoader

Note

Modules having union types can be tricky. You can think about it in terms of inputs and outputs:

  • source modules output source,

  • loader modules take in source and output data,

  • writer modules take in source and data.

For outputs, you should specify all types in your interface generic. For inputs, it’s okay not to list them all.

For example, if your source modules returns str | bytes you should list them all. That way, if your loader modules only takes in str as source, your type-checker should complain (since the loader might receive bytes). And if your writer takes in str | bytes | os.PathLike, you don’t need to list os.PathLike, since the source module will never return that.

Module mixes#

Modules can be compounded together in some cases. The common API for this is contained in ModuleMix. This generates a module with multiple ‘base modules’. It will instantiate and initialize all modules and store them in ModuleMix.base_modules. Mix classes should be created with the class method create().

This is used for instance to obtain the union or intersection of source files obtained by different source modules. Or it could be used to write to multiple file format at once (with different base writers).

Mixes can run methods on their base modules:

  • apply_all() will run on all the base modules of the mix and return a list of outputs.

  • apply_select() will only run on a single module. It will be selected by a user defined function that can be set in create() or with ModuleMix.set_select(). It chooses the appropriate base module based on the current state of the mix module, the interface parameters, and eventual keywords arguments it might receive. It should return the class name of one of the module.

  • apply() will use the all or select version based on the value of the all argument. In all methods, args and kwargs are passed to the method that is run, and the select keyword argument is a passed to the selection function.

Tip

If an attribute access fails on a ModuleMix, it tries to select a base module and access that attribute on it. This allows to dispatch quickly to a base module.

Cache module#

Note

This section is aimed at module writers. Users can safely ignore it.

It might help for some modules to have a cache to write information into. For instance source modules for multiple files leverage this. A module simply needs to be a subclass of CachedModule. This will automatically create a dictionary in the cache attribute. It will also register a callback in the interface, so that this module cache will be voided on parameters change. This can be disabled however by setting the class attribute _add_void_callback to False (in the new module subclass).

If a module has a cache, you can use the autocached() decorator to make the value of one of its method or property automatically cached. Watch out for the order of decorators for properties:

class SubModule(SourceAbstract, CachedModule):

    @property
    @autocached
    def something(self):
        ...

Defining new modules#

Users will typically only need to use existing modules, possibly redefining some of their methods, but in the case more dramatic changes are necessary, here are more details on the module system.

All modules inherit from abstract classes that define their API. Note that they are not defined through the abc module, and thus will not raise if methods lack an implementation. These classes are more guidelines than strict protocols.

For developers

Nevertheless it is advised to keep a common signature for module subclasses, relying on keyword arguments if necessary. This helps ensure inter-operability between modules and easy substitution of module types.

To add more modules types, the correspondence between the attribute containing the module instance and the one containing the module type must be indicated in the mapping _modules_attributes.

Note

Modules are instantiated and setup in the order of that mapping.

Interfaces are initialized with an optional argument giving the parameters, and additional keyword arguments. All modules are instantiated with the same arguments. Once they are all instantiated, they are setup using the Module.setup() method. This allow to be sure that all other modules exist if there is need for interplay.

Interface store#

To help deal with numerous interface classes, we provide a mapping allowing to store and easily access your interfaces using their ID or SHORTNAME attributes, or a custom name.

from neba.data import DataInterface, DataInterfaceStore

class MyDataInterface(DataInterface):
    ID = "SomeLongID"
    SHORTNAME = "SST"

store = DataInterfaceStore(MyDataInterface)

di_cls = store["SomeLongID"]
# or
di_cls = store["SST"]

If multiple interfaces have the same shortname, they can only be accessed by their ID. Trying to access with an ambiguous shortname will raise a KeyError.

You can directly register an interface with a decorator:

store = DataInterfaceStore()

@store.register()
class MyDataInterface(DataInterface):
    ...

You can also store an interface as an import string. When accessed, the store will automatically import your interface (and replace the string by the imported class for subsequent accesses).:

store.add("path.to.MyDataInterface")
di_cls = store["MyDataInterface"]
# an interface class