emd

Classes

class emdfile.Array(data: ndarray, name: str | None = 'array', units: str | None = '', dims: list | None = None, dim_names: list | None = None, dim_units: list | None = None, slicelabels=None)

A class which stores any N-dimensional array-like data, plus basic metadata: a name and units, as well as calibrations for each axis of the array, and names and units for those axis calibrations.

In the simplest usage, only a data array is passed:

>>> ar = Array(np.ones((20,20,256,256)))

will create an array instance whose data is the numpy array passed, and with automatically populated dimension calibrations in units of pixels.

Additional arguments may be passed to populate the object metadata:

>>> ar = Array(
>>>     np.ones((20,20,256,256)),
>>>     name = 'test_array',
>>>     units = 'intensity',
>>>     dims = [
>>>         [0,5],
>>>         [0,5],
>>>         [0,0.01],
>>>         [0,0.01]
>>>     ],
>>>     dim_units = [
>>>         'nm',
>>>         'nm',
>>>         'A^-1',
>>>         'A^-1'
>>>     ],
>>>     dim_names = [
>>>         'rx',
>>>         'ry',
>>>         'qx',
>>>         'qy'
>>>     ],
>>> )

will create an array with a name and units for its data, where its first two dimensions are in units of nanometers, have pixel sizes of 5nm, and are described by the handles ‘rx’ and ‘ry’, and where its last two dimensions are in units of inverse Angstroms, have pixels sizes of 0.01A^-1, and are described by the handles ‘qx’ and ‘qy’.

Arrays in which the length of each pixel is non-constant are also supported. For instance,

>>> x = np.logspace(0,1,100)
>>> y = np.sin(x)
>>> ar = Array(
>>>     y,
>>>     dims = [
>>>         x
>>>     ]
>>> )

generates an array representing the values of the sine function sampled 100 times along a logarithmic interval from 1 to 10. In this example, this data could then be plotted with, e.g.

>>> plt.scatter(ar.dims[0], ar.data)

If the slicelabels keyword is passed, the first N-1 dimensions of the array are treated normally, while the final dimension is used to represent distinct arrays which share a common shape and set of dim vectors. Thus

>>> ar = Array(
>>>     np.ones((50,50,4)),
>>>     name = 'test_array_stack',
>>>     units = 'intensity',
>>>     dims = [
>>>         [0,2],
>>>         [0,2]
>>>     ],
>>>     dim_units = [
>>>         'nm',
>>>         'nm'
>>>     ],
>>>     dim_names = [
>>>         'rx',
>>>         'ry'
>>>     ],
>>>     slicelabels = [
>>>         'a',
>>>         'b',
>>>         'c',
>>>         'd'
>>>     ]
>>> )

will generate a single Array instance containing 4 arrays which each have a shape (50,50) and a common set of dim vectors [‘rx’,’ry’], and which can be indexed into with the names assigned in slicelabels using

>>> ar.get_slice('a')

which will return a 2D (non-stack-like) Array instance with shape (50,50) and the dims assigned above. The Array attribute .rank is equal to the number of dimensions for a non-stack-like Array, and is equal to N-1 for stack-like arrays.

__init__(data: ndarray, name: str | None = 'array', units: str | None = '', dims: list | None = None, dim_names: list | None = None, dim_units: list | None = None, slicelabels=None)
Accepts:

data (np.ndarray): the data name (str): the name of the Array units (str): units for the pixel values dims (variable): calibration vectors for each of the axes of the data

array. Valid values for each element of the list are None, a number, a 2-element list/array, or an M-element list/array where M is the data array. If None is passed, the dim will be populated with integer values starting at 0 and its units will be set to pixels. If a number is passed, the dim is populated with a vector beginning at zero and increasing linearly by this step size. If a 2-element list/array is passed, the dim is populated with a linear vector with these two numbers as the first two elements. If a list/array of length M is passed, this is used as the dim vector, (and must therefore match this dimension’s length). If dims recieves a list of fewer than N arguments for an N-dimensional data array, the extra dimensions are populated as if None were passed, using integer pixel values. If the dims parameter is not passed, all dim vectors are populated this way.

dim_units (list): the units for the calibration dim vectors. If

nothing is passed, dims vectors which have been populated automatically with integers corresponding to pixel numbers will be assigned units of ‘pixels’, and any other dim vectors will be assigned units of ‘unknown’. If a list with length < the array dimensions, the passed values are assumed to apply to the first N dimensions, and the remaining values are populated with ‘pixels’ or ‘unknown’ as above.

dim_names (list): labels for each axis of the data array. Values

which are not passed, following the same logic as described above, will be autopopulated with the name “dim#” where # is the axis number.

slicelabels (None or True or list): if not None, must be True or a

list of strings, indicating a “stack-like” array. In this case, the first N-1 dimensions of the array are treated normally, in the sense of populating dims, dim_names, and dim_units, while the final dimension is treated distinctly: it indexes into distinct arrays which share a set of dimension attributes, and can be sliced into using the string labels from the slicelabels list, with the syntax array[‘label’] or array.get_slice(‘label’). If slicelabels is True or is a list with length less than the final dimension length, unassigned dimensions are autopopulated with labels array{i}. The flag array.is_stack is set to True and the array.rank attribute is set to N-1.

Returns:

A new Array instance

get_dim(n)

Return the n’th dim vector

dim(n)

Return the n’th dim vector

set_dim(n: int, dim: list | ndarray, units: str | None = None, name: str | None = None)

Sets the n’th dim vector, using dim as described in the Array documentation. If units and/or name are passed, sets these values for the n’th dim vector.

Accepts:

n (int): specifies which dim vector dim (list or array): length must be either 2, or equal to the

length of the n’th axis of the data array

units (Optional, str): name: (Optional, str):

get_dim_units(n)

Return the n’th dim vector units

set_dim_units(n: int, units: str)

Sets the n’th dim vector units to units.

Accepts:

n (int): specifies which dim vector units (str): new units

get_dim_name(n)

Get the n’th dim vector name

set_dim_name(n: int, name: str)

Sets the n’th dim vector name to name.

Accepts:

n (int): specifies which dim vector name (str): new name

to_h5(group)

Takes an h5py Group instance and creates a subgroup containing this Array, tags indicating its EMD type and Python class, and the array’s data and metadata.

Accepts:

group (h5py Group)

Returns:

(h5py Group) the new array’s Group

class emdfile.Custom(name='custom')
__init__(name='custom')
to_h5(group)

Constructs an h5 group, adds metadata, and adds all attributes which point to EMD nodes.

Accepts:

group (h5py Group)

Returns:

(h5py Group) the new node’s Group

class emdfile.Metadata(name: str | None = 'metadata', data: dict | None = None)

Stores metadata in the form of a flat (non-nested) dictionary. Keys are arbitrary strings. Values may be strings, numbers, arrays, or lists of the above types.

Usage:

>>> meta = Metadata()
>>> meta['param'] = value
>>> val = meta['param']

If the parameter has not been set, the getter methods return None.

__init__(name: str | None = 'metadata', data: dict | None = None)
Parameters:

name (Optional, string) –

copy(name=None)
to_h5(group)

Accepts an h5py Group which is open in write or append mode. Writes a new group with this object’s name and saves its metadata in it.

Accepts:

group (h5py Group)

classmethod from_h5(group)

Accepts an h5py Group which is open in read mode, confirms that it represents an EMD MetadataDict group, then loads and returns it as a Metadata instance.

Accepts:

group (HDF5 group)

Returns:

(Metadata)

class emdfile.Node(name: str | None = 'node')

Nodes contain attributes and methods paralleling the EMD 1.0 file specification in Python runtime objects.

EMD 1.0 is a singly-rooted file format. That is to say: An EMD data object can and must exist in one and only one EMD tree. An EMD file can contain any number of EMD trees, each containing data and metadata which is, within the limits of the EMD group specifications, of some arbitrary complexity. An EMD 1.0 file thus represents, stores, and enables access to some arbitrary data in long term storage on a file system in the form of an HDF5 file. The Node class provides machinery for building trees of data and metadata which mirror the EMD tree format but which exist in a live Python instance, rather than on the file system. This facilitates ease of transfer between Python and the file system.

Nodes are intended to be used a base class on which other, more complex classes can be biult. Nodes themselves contain the machinery for managing a tree heirarchy of other Nodes and Metadata instances, and for reading and writing those trees. They do not contain any particular data. Classes storing data and analysis methods which inherit from Node will inherit its tree management and EMD i/o functionality.

Below, the 4 elements of the node class are each described in turn: roots, trees, metadata, and i/o.

ROOTS

EMD data objects can and must exist in one and only one EMD tree, each of which must have a single, named root node. To parallel this in our runtime objects, each Node has a root property, which can be found by calling self.root.

By default new nodes have their root set to None. If a node with .root == None is saved to file, it is placed inside a new root with the same name as the object itself, and this is then saved to the file as a new (minimal) EMD tree.

A new root node can be instantiated by calling

>>> rootnode = Root(name=some_name).

Objects added to an existing rooted tree (including a new root node) automatically have their root assigned to the root of that tree. Adding objects to trees is discussed below.

TREES

The tree associated with a node can be manipulated with the .tree method. If we have some rooted node node1 and some unrooted node node2, the unrooted node can be added to the existing tree as a child of the rooted node with

>>> node1.tree(node2)

If we have a rooted node node1 and another rooted node node2, we can’t simply add node2 with the code above, as this would create a conflict between the two roots. In this case, we can move node2 from its current tree to the new tree using

>>> node1.tree(graft=node2)

The .tree method has various additional functionalities, including printing the tree, retrieving objects from the tree, and cutting branches from the tree. These are summarized below:

>>> .tree()             # show tree from current node
>>> .tree(show=True)    # show from root
>>> .tree(show=False)   # show from current node
>>> .tree(add=node)     # add a child node
>>> .tree(get='path')   # return a '/' delimited child node
>>> .tree(get='/path')  # as above, starting at root
>>> .tree(cut=True)     # remove/return a branch, keep root metadata
>>> .tree(cut=False)    # remove/return a branch, discard root md
>>> .tree(cut='copy')   # remove/return a branch, copy root metadata
>>> .tree(graft=node)   # remove/graft a branch, keep root metadata
>>> .tree(graft=(node,True))    # as above
>>> .tree(graft=(node,False))   # as above, discard root metadata
>>> .tree(graft=(node,'copy'))  # as above, copy root metadata

The show, add, and get methods can be accessed directly with

>>> .tree(arg)

for an arg of the appropriate type (bool, Node, and string), i.e. in most cases, the keyword can be dropped. So

>>> .tree()
>>> .tree(node)
>>> .tree(True)
>>> .tree('some/node')

will, respectively, print the tree from the current node to screen, add the node node to the tree, pring the tree from the root node to screen, and return the node at the emdpath ‘some/node’.

If a node needs to be added to a tree and it may or may not already have its own root, calling

>>> .tree(add=node, force=True)

or

>>> .tree(node, force=True)

will add the node to the tree, using a simple add if node has no root, and grafting it if it does have a root.

METADATA

Nodes can contain any number of Metadata instances, each of which wraps a Python dictionary of some arbitrary complexity (to within the limits of the Metadata group EMD specification, which limits permissible values somewhat).

The code:

>>> md1 = Metadata(name='md1')
>>> md2 = Metadata(name='md2')
>>> <<<  some code populating md1 + md2 >>>
>>> node.metadata = md1
>>> node.metadata = md2

will create two Metadata objects, populate them with data, then add them to the node. Note that Node.metadata is not a Python attribute, it is specially defined property, such that the last line of code does not overwrite the line before it - rather, assigning to the .metadata property adds the new metadata object to a running dictionary of arbitrarily many metadata objects. Both of these two metadata instances can therefore still be retrieved, using:

>>> x = node.metadata['md1']
>>> y = node.metadata['md2']

Note, however, that if the second metadata instance has an identical name to the first instance, then in will overwrite the old instance.

I/O

# TODO

__init__(name: str | None = 'node')
show_tree(root=False)

Display the object tree. If root is False, displays the branch of the tree downstream from this node. If root is True, displays the full tree from the root node.

add_to_tree(node)

Add an unrooted node as a child of the current, rooted node. To move an already rooted node/branch, use .graft(). To create a rooted node, use Root().

force_add_to_tree(node)

Add node node as a child of the current node, whether or not node is rooted. If it’s unrooted, performs a simple add. If it is rooted, performs a graft, excluding the root metadata from node.

get_from_tree(name)

Finds and returns an object from an EMD tree using the string key name, with ‘/’ delimiters between ‘parent/child’ nodes. Search from the root node by adding a leading ‘/’; otherwise, searches from the current node.

graft(node, merge_metadata=True)

Moves the branch beginning node onto this tree at this node.

For the reverse (i.e. grafting from this tree onto another tree) either use that tree’s .graft method, or use this tree’s ._graft.

Accepts:

node (Node): merge_metadata (True, False, or ‘copy’): if True adds the old root’s

metadata to the new root; if False adds no metadata to the new root; if ‘copy’ adds copies of all metadata from the old root to the new root.

Returns:

(Node) this tree’s root node

cut_from_tree(root_metadata=True)

Removes a branch from an object tree at this node.

A new root node is created under this object with this object’s name. Metadata from the current root is transferred/not transferred to the new root according to the value of root_metadata.

Accepts:
root_metadata (True, False, or ‘copy’): if True adds the old root’s

metadata to the new root; if False adds no metadata to the new root; if ‘copy’ adds copies of all metadata from the old root to the new root.

Returns:

(Node) the new root node

tree(arg=None, **kwargs)

Usages -

>>> .tree()             # show tree from current node
>>> .tree(show=True)    # show from root
>>> .tree(show=False)   # show from current node
>>> .tree(add=node)     # add a child node
>>> .tree(get='path')   # return a '/' delimited child node
>>> .tree(get='/path')  # as above, starting at root
>>> .tree(cut=True)     # remove/return a branch, keep root metadata
>>> .tree(cut=False)    # remove/return a branch, discard root md
>>> .tree(cut='copy')   # remove/return a branch, copy root metadata
>>> .tree(graft=node)   # remove/graft a branch, keeping root metadata
>>> .tree(graft=(node,True))    # as above
>>> .tree(graft=(node,False))   # as above, discard root metadata
>>> .tree(graft=(node,'copy'))  # as above, copy root metadata

The show, add, and get methods can be accessed directly with

>>> .tree(arg)

for an arg of the appropriate type (bool, Node, and string).

static newnode(method)

Decorator which may be added to node methods which product and return a new node. If such a method is decorated with

>>> @newnode

then the new node is added to the parent node’s tree, and a Metadata instance is added to the new node’s metadata which stores information about how the node was created, namely: method’s name, the parent’s class and name, and all the arguments passed to method.

classmethod from_h5(group)

Takes an h5py Group which is open in read mode. Confirms that a a Node of this name exists in this group, and loads and returns it with it’s metadata.

Accepts:

group (h5py Group)

Returns:

(Node)

to_h5(group)

Takes an h5py Group instance and creates a subgroup containing this node, tags indicating the groups EMD type and Python class, and any metadata in this node.

Accepts:

group (h5py Group)

Returns:

(h5py Group) the new node’s Group

class emdfile.PointList(data: ndarray, name: str | None = 'pointlist')

A wrapper around structured numpy arrays, with read/write functionality in/out of EMD formatted HDF5 files.

__init__(data: ndarray, name: str | None = 'pointlist')

Instantiate a PointList.

Parameters:
  • data (structured numpy ndarray) – the data; the dtype of this array will specify the fields of the PointList.

  • name (str) – name for the PointList

Returns:

a PointList instance

add(data)

Appends a numpy structured array. Its dtypes must agree with the existing data.

remove(mask)

Removes points wherever mask==True

sort(field, order='ascending')

Sorts the point list according to field, which must be a field in self.dtype. order should be ‘descending’ or ‘ascending’.

copy(name=None)

Returns a copy of the PointList. If name=None, sets to {name}_copy

add_fields(new_fields, name='')

Creates a copy of the PointList, but with additional fields given by new_fields.

Parameters:
  • new_fields – a list of 2-tuples, (‘name’, dtype)

  • name – a name for the new pointlist

add_data_by_field(data, fields=None)

Add a list of data arrays to the PointList, in the fields given by fields. If fields is not specified, assumes the data arrays are in the same order as self.fields

Parameters:

data (list) – arrays of data to add to each field

to_h5(group)

Takes an h5py Group instance and creates a subgroup containing this PointList, tags indicating its EMD type and Python class, and the pointlist’s data and metadata.

Accepts:

group (h5py Group)

Returns:

(h5py Group) the new pointlist’s group

class emdfile.PointListArray(dtype, shape, name: str | None = 'pointlistarray')

An 2D array of PointLists which share common coordinates.

__init__(dtype, shape, name: str | None = 'pointlistarray')

Creates an empty PointListArray.

Parameters:
  • dtype – the dtype of the numpy structured arrays which will comprise the data of each PointList

  • shape (2-tuple of ints) – the shape of the array of PointLists

  • name (str) – a name for the PointListArray

Returns:

a PointListArray instance

get_pointlist(i, j, name=None)

Returns the pointlist at i,j

copy(name='')

Returns a copy of itself.

add_fields(new_fields, name='')

Creates a copy of the PointListArray, but with additional fields given by new_fields.

Parameters:
  • new_fields – a list of 2-tuples, (‘name’, dtype)

  • name – a name for the new pointlist

to_h5(group)

Takes an h5py Group instance and creates a subgroup containing this PointListArray, tags indicating its EMD type and Python class, and the pointlistarray’s data and metadata.

Accepts:

group (h5py Group)

Returns:

(h5py Group) the new pointlistarray’s group

class emdfile.Root(name='root')

A Node instance with its .root property set to itself.

__init__(name='root')

Functions

emdfile._get_EMD_version(filepath, rootgroup=None)

Returns the version (major,minor,release) of an EMD file.

emdfile._is_EMD_file(filepath)

Returns True iff filepath points to a valid EMD 1.0 file.

emdfile._version_is_geq(current, minimum)

Returns True iff current version (major,minor,release) is greater than or equal to minimum.”

emdfile.dirname(p)

Returns the directory component of a pathname

emdfile.join(a, *p)

Join two or more pathname components, inserting ‘/’ as needed. If any component is an absolute path, all previous path components will be discarded. An empty last part will result in a path that ends with a separator.

emdfile.print_h5_tree(filepath, show_metadata=False)

Prints the contents of an h5 file from a filepath.

emdfile.read(filepath, emdpath: str | None = None, tree: bool | str | None = True, **legacy_options)

File reader for EMD 1.0+ files.

Parameters:
  • filepath (str or Path) – the file path

  • emdpath (str) – path to the node in an EMD object tree to read from. May be a root node or some downstream node. Use ‘/’ delimiters between node names. If emdpath is None, checks to see how many root nodes are present. If there is one, loads this tree. If there are several, returns a list of the root names.

  • tree (True or False or 'branch') – indicates what data should be loaded, relative to the node specified by emdpath. If set to False, only data/metadata in the specified node is loaded, plus any root metadata. If set to True, loads that node plus the subtree of data objects it contains (and their metadata, and the root metadata). If set to ‘branch’, loads the branch under this node as above, but does not load the node itself. If emdpath points to a root node, setting tree to ‘branch’ or True are equivalent - both return the whole data tree.

Returns:

(Root) returns a Root instance containing (1) any root metadata from

the EMD tree loaded from, and (2) a tree of one or more pieces of data/metadata

emdfile.save(filepath, data, mode='w', emdpath=None, tree=True)

Saves data to a .h5 file at filepath.

Calling

>>> save(path, data)

if data is a Root instance saves this root and its entire tree to a new file. If data is any other type of rooted node (i.e. a node inside of some runtime data tree), this code writes a new file with a single tree using this node’s root (even if this node is far downstream of the root node), placing this node and the tree branch underneath it inside that root. In both cases, the root metadata is stored in the new H5 root node. If data is an unrooted node (i.e. a freestanding node not connected to a tree), this code creates a new root node with no metadata and this node’s name, and places this node inside that root in a new file.

If data is a numpy array or Python dictionary, wraps data in either an emd.Array or emd.Metadata instance, assigns the name ‘np.array’ or ‘dictionary’, places the object in a root of this name and saves. If data is a list of objects which are all numpy arrays, Python dictionaries, or emd.Node instances, places all these objects into a single root, assigns the roots name according to the first object in the list, and saves.

To write a single node from a tree, set tree to False. To write the tree underneath a node but exclude the node itself set tree to None.

To add to an existing EMD file, use the mode argument to set append or appendover mode. If the emdpath variable is not set and data has a runtime root that does not exist in the EMD root groups already present, adds the new root and writes as described above. If emdpath is not set and the runtime root group matches a root group that’s already present, this function performs a diff operation between the root metadata and data nodes from data and those already in the H5 file. Append mode adds any data/metadata groups with no equivalent (i.e. same name and tree location) in the H5 tree, while skipping any data/metadata already found in the tree. Appendover adds any data/metadata with no equivalent already in the H5 tree, and overwrites any data/metadata groups that are already represented in the HDF5 with the new data. Note that this function does not attempt to take a diff between the contents of the groups and the runtime data groups - it only considers the names and their locations in the tree. If append or appendover mode are used and filepath is set to a location that does not already contain a file on the filesystem, behavior is identical to write mode. When appendover mode overwrites data, it is erasing the old links and creating new links to new data; however, the HDF5 file does not release the space on the filesystem. To free up storage, set mode to ‘appendover’, and this function will add a final step to re-write then delete the old file.

The emdpath argument is used to append to a specific location in an extant EMD file downstream of some extant root. If passed, it must point to a valid location in the EMD file. This function will then perform a diff and write as described in the prior paragraph, except beginning from the H5 node specified in emdpath. Note that in this case the root metadata is still compared to and added or overwritten in the H5 root node, even if the remaining data is being added to some downstream branch.

Parameters:
  • filepath – path where the file will be saved

  • data – an EMD data class instance

  • mode (str) –

    supported modes and their keys are:
    • write (‘w’,’write’)

    • overwrite (‘o’,’overwrite’)

    • append (‘a’,’+’,’append’)

    • appendover (‘ao’,’oa’,’o+’,’+o’,’appendover’)

    Write mode writes a new file, and raises an exception if a file of this name already exists. Overwrite mode deletes any file of this name that already exists and writes a new file. Append and appendover mode write a new file if no file of this name exists, or if a file of this name does exist, adds new data to the file. The specific behavior of append and appendover depend on the data,`emdpath`, and tree arguments as discussed in more detail above. Broadly, both modes attempt to detemine the difference between the data passed and that present in the extent HDF5 file tree, add any data not already in the H5, and then either skips or overwrites conflicting nodes in append or appendover mode, respectively.

  • tree – indicates how the object tree nested inside data should be treated. If True (default), the entire tree is saved. If False, only this object is saved, without its tree. If None, saves the entire tree underneath data, but not the node at data itself.

  • emdpath (str or None) – optional parameter used in conjunction with append or appendover mode; if passed in write or overwrite mode, this argument is ignored. Indicates where in an existing EMD file tree to place the data. Must be a ‘/’ delimited string pointing to an existing EMD file tree node.

emdfile.set_author(author)

Accepts a string, which will be written to the “authoring_user” field in any EMD file headers written during this Python session

emdfile.tqdmnd(*args, **kwargs)

An N-dimensional extension of tqdm providing an iterator and progress bar over the product of multiple iterators.

Example Usage:

>>> for x,y in tqdmnd(5,6):
>>>     <expression>

is equivalent to

>>> for x in range(5):
>>>     for y in range(6):
>>>         <expression>

with a tqdmnd-style progress bar printed to standard output.

Accepts:
*args: Any number of integers or iterators. Each integer N

is converted to a range(N) iterator. Then a loop is constructed from the Cartesian product of all iterables.

**kwargs: keyword arguments passed through directly to tqdm.

Full details are available at https://tqdm.github.io A few useful ones:

disable (bool): if True, hide the progress bar keep (bool): if True, delete the progress bar after completion unit (str): unit name for the display of iteration speed unit_scale (bool): whether to scale the displayed units and add

SI prefixes

desc (str): message displayed in front of the progress bar

Returns:

At each iteration, a tuple of indices is returned, corresponding to the values of each input iterator (in the same order as the inputs).