Existing modules#
The interface class is expected to be populated by modules (see Module system). Here is a quick description of modules that are already defined in Neba.
Parameters#
Parameters stored in a dictionnary. |
|
Parameters are stored in a Section object. |
|
Parameters are retrieved from an application instance. |
Dict#
ParametersDict stores the parameters in a dictionary. Technically it
is a subclass of dict that has a callback setup to void the modules cache when a
parameter is changed.
The callback is called only when setting a value that is new or different from the old value. Any change to a mutable (list or dict) will not register:
# Will void cache
di.parameters.direc["a"] = 0
di.parameters.direct["nested"] = {"b": 1}
# Will *not* void cache
di.parameters.direct["a"] = 0 # no change
di.parameters.direct["nested"]["b"] = 2 # mutable operation
Section#
ParametersSection stores the parameters in a Section object.
The section class is specified in the attribute
SECTION_CLS and defaults to an empty Section. When
initialized, the module creates a new SECTION_CLS and updates it with the
arguments passed to it.
It can be defined with:
class MyDataInterface(DataInterface):
Parameters = ParametersSection.new(MySection)
The section has a callback setup to it so that any modification will trigger a cache void (except mutable modification):
# Will void cache
di.parameters.direct["a"] = 0
di.parameters.direct.a = 1
di.parameters.direct.nested.b = 1
di.parameters.direct["my_list"] = [0]
# Will *not* void cache
di.parameters.direct["a"] = 1 # no change
di.parameters.direct["my_list"].append(1) # mutable operation
App#
ParametersApp stores its parameters in an Application
object. An application must be supplied as argument. It will be copied
(any modification of the interface parameters will not affect the original
application instance).
Tip
The specific application class does not need to be specified, but can be type-hinted with:
from neba.data.util import T_Source, T_Data
class MyDataInterface(DataInterface[MyApp, T_Source, T_Data]):
...
T_Source and T_Data can also be specified, see
Typing.
Source#
Simple module where data source is specified by class attribute. |
|
Find files using glob patterns. |
|
Multifiles manager using Filefinder. |
Simple#
For simple case, SimpleSource will just return its attribute
source_loc:
class MyDataInterface(DataInterface):
Source = SimpleSource
Source.source_loc = "/my_data/file.txt"
MultiFile#
For datasets consisting of multiple files the package provide two modules that
follow the abstract class MultiFileSource. For both of them the user
should implement get_root_directory() which returns the
directory containing the files (as a path, or a list of sub-folders that will be
joined).
Glob#
The module GlobSource can find files on disk that follow a pattern
defined by get_glob_pattern(), using glob. Files on
disk matching the pattern are cached and available at
datafiles(). For instance:
class MyDataInterface(DataInterface):
class Source(GlobSource):
def get_root_directory(self):
return ["/data", self.parameters["user"], "subfolder"]
def get_glob_pattern(self):
return "SST_*.nc"
files = MyDataInterface().get_source()
FileFinder#
For a similar scenario of a dataset split across many files (for different
dates, variables or parameters values) an even more precise solution is provided
by FileFinderSource. This module relies on the filefinder package to find files
according to a specific filename pattern. For instance:
class MyDataInterface(DataInterface):
class Source(FileFinderSource):
def get_root_directory(self):
return ["/data", self.parameters["user"], "subfolder"]
def get_glob_pattern(self):
return "SST_%(depth:fmt=.1f)_%(Y)%(m)%(d).nc"
This module has several advantages over a simple glob pattern. Its filename pattern can define parameters with specific formatting. Thus it can “fix” some parameters and restrict its search. With the same example as above we can select only the files for a specific depth:
MyDataInterface(depth=10.0).get_source()
If we fix all parameters we can also generate a filename for a given set of parameters:
MyDataInterface(depth=10.0).source.get_filename(Y=2015, m=5, d=1)
# or equivalent:
MyDataInterface(depth=10.0, Y=2015, m=5, d=1).source.get_filename()
See the filefinder documentation for more details on its features.
Xarray#
A compilation of modules for using Xarray is
available in neba.data.xarray. This submodule is not imported in the top
level package to avoid importing Xarray unless needed.
Load from single source with Xarray. |
|
Write Xarray dataset. |
|
Writer for Xarray datasets in multiple files automatically. |
Loaders#
XarrayLoader will load either from a single file or store with
open_dataset() or from multiple files using
open_mfdataset().
Options for these functions can be changed in the attributes
open_dataset_kwargs and
open_mfdataset_kwargs:
class MyDataInterface(DataInterface):
class Loader(XarrayLoader):
open_mfdataset_kwargs = dict(...)
Writers#
XarrayWriter allows to write to either a single file/store or multiple
files if given a sequence of datasets. It will guess the function to use from
the file extension. It currently supports Zarr and Netcdf.
Note
The write() method will automatically add metadata to the dataset
attributes via add_metadata().
When writing data across multiple files or stores, if given a Dask
client argument, it will use
send_calls_together() to execute multiple writing
operations in parallel.
Important
Doing so is not so straightforward. It may return permission errors on some
filesystems. Using the scratch filesystem on a cluster might solve this
issue. See send_calls_together() documentation for
details on the implementation.
When writing to multiple files, the XarrayWriter module needs multiple
datasets and their respective target file. XarraySplitWriter intends
to simplify further the writing process by splitting automatically a dataset
across files. It must be paired with a source module that implements the
Splitable protocol. Which means that some parameters can be left
unspecified along which the dataset will be split. It must also be able to
return a filename given values for those unspecified parameters. The
FileFinderSource can be used to that purpose. For instance we can
split a dataset along its depth dimension and automatically group by month,
using data along the lines of:
>>> ds
<xarray.Dataset>
Dimensions: (time: 365, depth: 50, lat: 4320, lon: 8640)
Coordinates:
* time (time) datetime64[ns] 2020-01-01 ... 2020-12-31
* depth (depth) int64 0 1 5 10 20 40 ... 500 750 1000
* lat (lat) float32 89.98 89.94 89.9 ... -89.9 -89.94 -89.98
* lon (lon) float32 -180.0 -179.9 -179.9 ... 179.9 180.0
Data variables:
temp (time, depth, lat, lon) float32 dask.array<chunksize=(1, 1, 4320, 8640), meta=np.ndarray>
and an interface defined as:
class MyDataInterface(DataInterface):
Writer = XarraySplitWriter
class Source(FileFinderSource):
def get_root_directory(self):
return "/data/directory/"
def get_filename_pattern(self):
"""Yearly folders, date as YYYYMM and depth as integer."""
return "%(Y)/temp_%(Y)%(m)_depth_%(depth:fmt=d).nc"
we can then simply call MyDataInterface().write(ds). Note this will detect
that the smallest time parameter in the filename pattern is the month and split
the dataset appropriately using xarray.Dataset.resample().
This can be specified manually or avoided altogether. See the
XarraySplitWriter.write() documentation for details.
Note
If the overall write() implementation is not
appropriate, it is possible to control more finely the splitting process by
using split_by_unfixed() and
split_by_time(). The “time” dimension is split
separately to account for the fact that a filename pattern will define
separate datetime elements (the year, the month, the day, …).