Collections

Collections within a single file can always be loaded with opencosmo.open(). Collections can be treated like read-only dictionaries. Dataset names can be retrieved with keys(), the datasets can be accessed with values() or Collection[key], and iteration can be done with items().

class opencosmo.Lightcone(datasets, z_range=None, hidden=None, ordered_by=None)

A lightcone contains two or more datasets that are part of a lightcone. Typically each dataset will cover a specific redshift range. The Lightcone object hides these details, providing an API that is identical to the standard Dataset API. Additionally, the lightcone contains some convinience functions for standard operations.

Parameters:

datasets (dict[str, Dataset])
z_range (tuple[float, float] | None)
hidden (set[str] | None)
ordered_by (tuple[str, bool] | None)

property header: OpenCosmoHeader

The header associated with this dataset.

OpenCosmo headers generally contain information about the original data this dataset was produced from, as well as any analysis that was done along the way.

Returns:: header
Return type:: opencosmo.header.OpenCosmoHeader

property columns: list[str]

The names of the columns in this dataset.

Returns:: columns
Return type:: list[str]

property cosmology: Cosmology

The cosmology of the simulation this dataset is drawn from as an astropy.cosmology.Cosmology object.

Returns:: cosmology
Return type:: astropy.cosmology.Cosmology

property dtype: str

The data type of this dataset.

Returns:: dtype
Return type:: str

property region: Region

The region this dataset is contained in. If no spatial queries have been performed, this will be the entire simulation box for snapshots or the full sky for lightcones

Returns:: region
Return type:: opencosmo.spatial.Region

property simulation: HaccSimulationParameters

The parameters of the simulation this dataset is drawn from.

Returns:: parameters
Return type:: opencosmo.parameters.hacc.HaccSimulationParameters

property z_range

The redshift range of this lightcone.

Returns:: z_range
Return type:: tuple[float, float]

get_data(output='astropy')

Get the data in this dataset as an astropy table/column or as numpy array(s). Note that a dataset does not load data from disk into memory until this function is called. As a result, you should not call this function until you have performed any transformations you plan to on the data.

You can get the data in two formats, “astropy” (the default) and “numpy”. “astropy” format will return the data as an astropy table with associated units. “numpy” will return the data as a dictionary of numpy arrays. The numpy values will be in the associated unit convention, but no actual units will be attached.

If the dataset only contains a single column, it will be returned as an astropy.table.Column or a single numpy array.

This method does not cache data. Calling “get_data” always reads data from disk, even if you have already called “get_data” in the past. You can use Dataset.data to return data and keep it in memory.

Parameters:: output (str, default="astropy") – The format to output the data in
Returns:: data – The data in this dataset.
Return type:: Table | Column | dict[str, ndarray] | ndarray

property data

Return the data in the dataset in astropy format. The value of this attribute is equivalent to the return value of Dataset.get_data("astropy"). However data retrieved via this attribute will be cached, meaning further calls to Dataset.data should be instantaneous.

However there is one caveat. If you modify the table, those modifications will persist if you later request the data again with this attribute. Calls to Lightcone.get_data will be unaffected, and datasets generated from this dataset will not contain the modifications. If you plan to modify the data in this table, you should use Lightcone.with_new_columns.

Returns:: data – The data in the dataset.
Return type:: astropy.table.Table or astropy.table.Column

with_redshift_range(z_low, z_high)

Restrict this lightcone to a specific redshift range. Lightcone datasets will always contain a column titled “redshift.” This function is always operates on this column.

This function also updates the value in Lightcone.z_range, so you should always use it rather than filteringo n the column directly.

Parameters:

z_low (float)
z_high (float)

bound(region, select_by=None)

Restrict the dataset to some subregion. The subregion will always be evaluated in the same units as the current dataset. For example, if the dataset is in the default “comoving” unit convention, positions are always in units of comoving Mpc. However Region objects themselves do not carry units. See Regions for details of how to construct regions.

Parameters:

region (opencosmo.spatial.Region) – The region to query.
select_by (str | None)

Returns:

dataset – The portion of the dataset inside the selected region

Return type:

opencosmo.Dataset

Raises:

ValueError – If the query region does not overlap with the region this dataset resides in
AttributeError: – If the dataset does not contain a spatial index

cone_search(center, radius)

Perform a search for objects within some angular distance of some given point on the sky. This is a convinience function around bound and is exactly equivalent to

region = oc.make_cone(center, radius)
ds = ds.bound(region)

Parameters:

center (tuple | SkyCoord) – The center of the region to search. If a tuple and no units are provided assumed to be RA and Dec in degrees.
radius (float | astropy.units.Quantity) – The angular radius of the region to query. If no units are provided, assumed to be degrees.

Returns:

new_lightcone – The rows in this lightcone that fall within the given region.

Return type:

opencosmo.Lightcone

evaluate(func, format='astropy', vectorize=False, insert=True, **evaluate_kwargs)

Iterate over the rows in this collection, apply func to each, and collect the result as new columns in the dataset. You may also choose to simply return thevalues instead of inserting them as a column

This function is the equivalent of with_new_columns for cases where the new column is not a simple algebraic combination of existing columns. Unlike with_new_columns, this method will evaluate the results immediately and the resulting columns will not change under unit transformations.

The function should take in arguments with the same name as the columns in this dataset that are needed for the computation, and should return a dictionary of output values. The dataset will automatically selected the needed columns to avoid reading unnecessarily reading data from disk. The new columns will have the same names as the keys of the output dictionary See Evaluating on Datasets for more details.

If vectorize is set to True, the full columns will be pased to the dataset. Otherwise, rows will be passed to the function one at a time.

This function behaves identically to Dataset.evaluate

Parameters:

func (Callable) – The function to evaluate on the rows in the dataset.
format (str, default = "astropy") – The format of the data that is provided to your function. If “astropy”, will be a dictionary of astropy quantities. If “numpy”, will be a dictionary of numpy arrays.
vectorize (bool, default = False) – Whether to provide the values as full columns (True) or one row at a time (False)
insert (bool, default = True) – If true, the data will be inserted as a column in this dataset. Otherwise the data will be returned.

Returns:

dataset – The new lightcone dataset with the evaluated column(s)

Return type:

Lightcone

filter(*masks, **kwargs)

Filter the dataset based on some criteria. See Querying Based on Column Values for more information.

Parameters:: *masks (Mask) – The masks to apply to dataset, constructed with opencosmo.col()
Returns:: dataset – The new dataset with the masks applied.
Return type:: Dataset
Raises:: ValueError – If the given refers to columns that are not in the dataset, or the would return zero rows.

rows()

Iterate over the rows in the dataset. Rows are returned as a dictionary For performance, it is recommended to first select the columns you need to work with.

Yields:: row (dict) – A dictionary of values for each row in the dataset with units.
Return type:: Generator[dict[str, float | Quantity], None, None]

select(columns)

Create a new dataset from a subset of columns in this dataset.

Parameters:: columns (str or list[str]) – The column or columns to select.
Returns:: dataset – The new dataset with only the selected columns.
Return type:: Dataset
Raises:: ValueError – If any of the given columns are not in the dataset.

drop(columns)

Produce a new dataset by dropping columns from this dataset.

Parameters:: columns (str or list[str]) – The column or columns to drop.
Returns:: dataset – The new dataset without the dropped columns
Return type:: Dataset
Raises:: ValueError – If any of the given columns are not in the dataset.

take(n, at='random')

Create a new dataset from some number of rows from this dataset.

Can take the first n rows, the last n rows, or n random rows depending on the value of ‘at’.

Parameters:

n (int) – The number of rows to take.
at (str) – Where to take the rows from. One of “start”, “end”, or “random”. The default is “random”.

Returns:

dataset – The new dataset with only the selected rows.

Return type:

Dataset

Raises:

ValueError – If n is negative or greater than the number of rows in the dataset, or if ‘at’ is invalid.

with_new_columns(**columns)

Create a new dataset with additional columns. These new columns can be derived from columns already in the dataset, a numpy array, or an Astropy quantity array. When a column is derived from other columns, it will behave appropriately under unit transformations. See Adding Custom Columns and Dataset.with_new_columns for examples.

Parameters:: columns (**)
Returns:: dataset – This dataset with the columns added
Return type:: opencosmo.Dataset

sort_by(column, invert=False)

Sort this dataset by the values in a given column. By default sorting is in ascending order (least to greatest). Pass invert = True to sort in descending order (greatest to least).

This can be used to, for example, select largest halos in a given dataset:

dataset = oc.open("haloproperties.hdf5")
dataset = dataset
            .sort_by("fof_halo_mass")
            .take(100, at="start")

Parameters:

column (str) – The column in the halo_properties or galaxy_properties dataset to order the collection by.
invert (bool, default = False) – If False (the default), ordering will be from least to greatest. Otherwise greatest to least.

Returns:

result – A new Dataset ordered by the given column.

Return type:

Dataset

with_units(convention)

Create a new dataset from this one with a different unit convention.

Parameters:: convention (str) – The unit convention to use. One of “physical”, “comoving”, “scalefree”, or “unitless”.
Returns:: dataset – The new dataset with the requested unit convention.
Return type:: Dataset

collect()

Given a dataset that was originally opend with opencosmo.open, return a dataset that is in-memory as though it was read with opencosmo.read.

This is useful if you have a very large dataset on disk, and you want to filter it down and then close the file.

For example:

import opencosmo as oc
with oc.open("path/to/file.hdf5") as file:
    ds = file.(ds["sod_halo_mass"] > 0)
    ds = ds.select(["sod_halo_mass", "sod_halo_radius"])
    ds = ds.collect()

The selected data will now be in memory, and the file will be closed.

If working in an MPI context, all ranks will recieve the same data.

Return type:: Lightcone

class opencosmo.SimulationCollection(datasets)

A collection of datasets of the same type from different simulations. In general this exposes the exact same API as the individual datasets, but maps the results across all of them.

Parameters:: datasets (Mapping[str, Dataset | Collection])

make_schema()

Return type:: DataSchema

property dtype: dict[str, str]

property header: dict[str, OpenCosmoHeader]

property cosmology: dict[str, Cosmology]

Get the cosmologies of the simulations in the collection

Returns:: cosmologies
Return type:: dict[str, astropy.cosmology.Cosmology]

property redshift: dict[str, float | tuple[float, float]]

Get the redshift slices or ranges for the simulations in the collection

Returns:: redshifts
Return type:: dict[str, float | tuple[float,float]]

property simulation: dict[str, HaccSimulationParameters]

Get the simulation parameters for the simulations in the collection

Returns:: simulation_parameters
Return type:: dict[str, opencosmo.parameters.HaccSimulationParameters]

bound(region, select_by=None)

Restrict the datasets to some region. Note that the SimulationCollection does not do any checking to ensure its members have identical boxes. As a result this method can in principle fail for some of the simulations in the collection and not others. This should never happen when working with official OpenCosmo data products.

See Regions for details of how to construct regions.

Parameters:

region (opencosmo.spatial.Region) – The region to query
select_by (str | None)

Returns:

dataset – The portion of the dataset inside the selected region

Return type:

opencosmo.Dataset

filter(*masks, **kwargs)

Filter the datasets in the collection. This method behaves exactly like opencosmo.Dataset.filter() or opencosmo.StructureCollection.filter(), but it applies the filter to all the datasets or collections within this collection. The result is a new collection.

Parameters:

filters – The filters constructed with opencosmo.col()
masks (ColumnMask)

Returns:

A new collection with the same datasets, but only the particles that pass the filter.

Return type:

SimulationCollection

select(*args, **kwargs)

Select a set of columns in the datasets in this collection. This method calls the underlying method in opencosmo.Dataset, or opencosmo.Collection depending on the context. As such its behavior and arguments can vary depending on what this collection contains.

Parameters:

args – The arguments to pass to the select method. This is usually a list of column names to select.
kwargs – The keyword arguments to pass to the select method. This is usually a dictionary of column names to select.

Return type:

Self

drop(*args, **kwargs)

Drop a set of columns from the datasets in the collection. This method calls the underlying method in opencosmo.Dataset, or opencosmo.Collection depending on the context. As such its behavior and arguments can vary depending on what this collection contains.

Parameters:

args – The arguments to pass to the select method. This is usually a list of column names to drop.
kwargs – The keyword arguments to pass to the select method. This is usually a dictionary of column names to select.

Return type:

Self

take(n, at='random')

Take a subest of rows from all datasets or collections in this collection. This method will delegate to the underlying method in opencosmo.Dataset, or opencosmo.StructureCollection depending on the context. As such, behavior may vary depending on what this collection contains. See their documentation for more info.

Parameters:

n (int) – The number of rows to take
at (str, default = "random") – The method to use to take rows. Must be one of “start”, “end”, “random”.

Return type:

Self

with_new_columns(*args, datasets=None, **kwargs)

Update the datasets within this collection with a set of new columns. This method simply calls opencosmo.Dataset.with_new_columns() or opencosmo.StructureCollection.with_new_columns(), as appropriate.

You can also optionally pass the “datasets” keyword argument to specify that the operation should only be performed on a subset of the datasets.

If passing in numpy arrays or astropy quantities, they should be provided as a dictionary where the keys are the same as the keys in this dataset.

Parameters:

datasets (str | list[str], optional) – The datasets to add the columns to.
columns (**) – The new columns

evaluate(func, datasets=None, format='astropy', vectorize=False, insert=False, **evaluate_kwargs)

Evaluate the function func on each of the datasets or collections held by this SimulationCollection. This function simply delegates to the either StructureCollection.evaluate or Dataset.evaluate as appropriate. Refer to Evaluating Complex Expressions on Datasets and Collections for more details.

If “datasets” is provided, the evaluation will only be performed on the provided datasets.

Parameters:

func (Callable) – The function to evaluate
datasets (str | list[str], optional) – The datasets to evaluate on. If not provided, will be evaluated on all datasets
format (str, default = "astropy") – The format of the data that is provided to your function. If “astropy”, will be a dictionary of astropy quantities. If “numpy”, will be a dictionary of numpy arrays.
vectorize (bool, default = False) – Whether to vectorize the computation. See StructureCollection.evaluate and/or Dataset.evaluate for more details.
insert (bool, default = True) – Whether or not to insert the results as columns in the datasets. If false, the results will be returned directly. If true, this method will return a new Simulation Collection.

sort_by(column, invert=False)

Re-order the individual datasets in the collection based on a column. See Dataset.sort_by for usage details.

Parameters:

column (str) – The column in the halo_properties or galaxy_properties dataset to order the collection by.
invert (bool, default = False) – If False (the default) ordering will be done from least to greatest. Otherwise greatest to least.

Returns:

result – A new SimulationCollection with the datasets ordered by the given column.

Return type:

SimulationCollection

with_units(convention)

Transform all datasets or collections to use the given unit convention. This method behaves exactly like opencosmo.Dataset.with_units().

Parameters:: convention (str) – The unit convention to use. One of “unitless”, “scalefree”, “comoving”, or “physical”.
Return type:: Self

class opencosmo.StructureCollection(source, header, datasets, links, hide_source=False, **kwargs)

A collection of datasets that contain both high-level properties and lower level information (such as particles) for structures in the simulation. Currently these structures include halos and galaxies.

Every structure collection has a halo_properties or galaxy_properties dataset that contains the high-level measured attribute of the structures. Certain operations (e.g. sort_by operate on this dataset.

Parameters:

source (oc.Dataset)
header (oc.header.OpenCosmoHeader)
datasets (Mapping[str, oc.Dataset | StructureCollection])
links (dict[str, LinkedDatasetHandler])
hide_source (bool)

property header

property dtype

property cosmology: Cosmology: The cosmology of the structure collection

property properties: list[str]: The high-level properties that are available as part of the halo_properties or galaxy_properties dataset.

property redshift: float | tuple[float, float]

For snapshots, return the redshift or redshift range this dataset was drawn from.

Returns:: redshift
Return type:: float | tuple[float, float]

property simulation: HaccSimulationParameters

Get the parameters of the simulation this dataset is drawn from.

Returns:: parameters
Return type:: opencosmo.parameters.HaccSimulationParameters

keys()

Return the names of the datasets in this collection.

Return type:: list[str]

values()

Return the datasets in this collection.

Return type:: list[Dataset | StructureCollection]

items()

Return the names and datasets as key-value pairs.

Return type:: Generator[tuple[str, Dataset | StructureCollection], None, None]

property region

bound(region, select_by=None)

Restrict this collection to only contain structures in the specified region. Querying will be done based on the halo or galaxy centers, meaning some particles may fall outside the given region.

See Regions for details of how to construct regions.

Parameters:

region (opencosmo.spatial.Region)
select_by (str | None)

Returns:

dataset – The portion of the dataset inside the selected region

Return type:

opencosmo.Dataset

Raises:

ValueError – If the query region does not overlap with the region this dataset resides in
AttributeError: – If the dataset does not contain a spatial index

evaluate(func, dataset=None, format='astropy', vectorize=False, insert=True, **evaluate_kwargs)

Iterate over the structures in this collection and apply func to each, collecting the results into a new column. These values will be computed immediately rather than lazily. If your new column can be created from a simple algebraic combination of existing columns, use with_new_columns.

You can substantially improve the performance of this method by specifying which data is actually needed to do the computation. This method will automatically select the requested data, avoiding reading unneeded data from disk.

The function passed to this method must take arguments that match the names of datasets that are stored in this collection. You can specify specific columns that are needed with keyword arguments to this function. For example:

import opencosmo as oc
import numpy as np
collection = oc.open("haloproperties.hdf5", "haloparticles.hdf5")

def computation(halo_properties, dm_particles):
    dx = np.mean(dm_particles.data["x"]) - halo_properties["fof_halo_center_x"]
    dy = np.mean(dm_particles.data["y"]) - halo_properties["fof_halo_center_y"]
    dz = np.mean(dm_particles.data["z"]) - halo_properties["fof_halo_center_z"]
    offset = np.sqrt(dx**2 + dy**2 + dz**2)
    return offset / halo_properties["sod_halo_radius"]

collection = collection.evaluate(
    computation,
    name="offset",
    halo_properties=[
        "fof_halo_center_x",
        "fof_halo_center_y",
        "fof_halo_center_z"
        "sod_halo_radius"
    ],
    dm_particles=["x", "y", "z"]
)

The collection will now contain a column named “offset” with the results of the computation applied to each halo in the collection. Columns produced in this way will not respond to changes in unit convention.

It is not required to pass a list of column names for a given dataset. If a list is not provided, all columns will be passed to the computation function.

For more details and advanced usage see Evaluating on Structure Collections

Parameters:

func (Callable) – The function to evaluate on the rows in the dataset.
dataset (Optional[str], default = None) – The dataset inside this collection to evaluate the function on. If none, assumes the function requires data from multiple datasets.
vectorize (bool, default = False) – Whether to provide the values as full columns (True) or one row at a time (False) if evaluating on aa single dataset. Has no effect if evaluating over structures, since structures require input from multiple datasets which will not in general be the same length.
insert (bool, default = True) – If true, the data will be inserted as a column in the specified dataset, or the main “properties” dataset if no dataset is specified. The new column will have the same name as the function. Otherwise the data will be returned directly.
format (str, default = astropy) – Whether to provide data to your function as “astropy” quantities or “numpy” arrays/scalars. Default “astropy”
**evaluate_kwargs (any,) – Any additional arguments that are required for your function to run. These will be passed directly to the function as keyword arguments. If a kwarg is an array of values with the same length as the dataset, it will be treated as an additional column.

filter(*masks, on_galaxies=False)

Apply a filter to the halo or galaxy properties. Filters are constructed with opencosmo.col() and behave exactly as they would in opencosmo.Dataset.filter().

If the collection contains both halos and galaxies, the filter can be applied to the galaxy properties dataset by setting on_galaxies=True. However this will filter for halos that host galaxies that match this filter. As a result, galxies that do not match this filter will remain if another galaxy in their host halo does match.

See Querying In Collections for some examples.

Parameters:

*filters (Mask) – The filters to apply to the properties dataset constructed with opencosmo.col().
on_galaxies (bool, optional) – If True, the filter is applied to the galaxy properties dataset.

Returns:

A new collection filtered by the given masks.

Return type:

StructureCollection

Raises:

ValueError – If on_galaxies is True but the collection does not contain a galaxy properties dataset.

select(columns, dataset)

Update a dataset in the collection collection to only include the columns specified.

Parameters:

columns (str | Iterable[str]) – The columns to select from the dataset.
dataset (str) – The dataset to select from.

Returns:

A new collection with only the selected columns for the specified dataset.

Return type:

StructureCollection

Raises:

ValueError – If the specified dataset is not found in the collection.

drop(columns, dataset=None)

Update the linked collection by dropping the specified columns in the given dataset. If no dataset is specified, the properties dataset is used. For example, if this collection contains galaxies, calling this function without a “dataset” argument will select columns from the galaxy_properties dataset.

Parameters:

columns (str | Iterable[str]) – The columns to select from the dataset.
dataset (str, optional) – The dataset to select from. If None, the properties dataset is used.

Returns:

A new collection with only the selected columns for the specified dataset.

Return type:

StructureCollection

Raises:

ValueError – If the specified dataset is not found in the collection.

sort_by(column, invert=False)

Re-order the collection based on one of the structure collection’s properties. Each StructureCollection contains a halo_properties or galaxy_properties dataset that contains the high-level measured properties of the structures in this collection. This method always operates on that dataset.

Parameters:

column (str) – The column in the halo_properties or galaxy_properties dataset to order the collection by.
invert (bool, default = False) – If False (the default), ordering will be from least to greatest. Otherwise greatest to least.

Returns:

result – A new StructureCollection ordered by the given column.

Return type:

StructureCollection

with_units(convention)

Apply the given unit convention to the collection. See opencosmo.Dataset.with_units()

Parameters:: convention (str) – The unit convention to apply. One of “unitless”, “scalefree”, “comoving”, or “physical”.
Returns:: A new collection with the unit convention applied.
Return type:: StructureCollection

take(n, at='random')

Take some number of structures from the collection. See opencosmo.Dataset.take().

Parameters:

n (int) – The number of structures to take from the collection.
at (str, optional) – The method to use to take the structures. One of “random”, “first”, or “last”. Default is “random”.

Returns:

A new collection with the structures taken from the original.

Return type:

StructureCollection

take_range(start, end)

Parameters:

start (int)
end (int)

with_new_columns(dataset, **new_columns)

Add new column(s) to one of the datasets in this collection. This behaves exactly like oc.Dataset.with_new_columns(), except that you must specify which dataset the columns should refer too.

pe = oc.col("phi") * oc.col("mass")
collection = collection.with_new_columns("dm_particles", pe=pe)

Structure collections can hold other structure collections. For example, a collection of Halos may hold a structure collection that contians the galaxies of those halos. To update datasets within these collections, use dot syntax to specify a path:

pe = oc.col("phi") * oc.col("mass")
collection = collection.with_new_columns("galaxies.star_particles", pe=pe)

You can also pass numpy arrays or astropy quantities:

random_value = np.random.randint(0, 90, size=len(collection))
random_quantity = random_value*u.deg

collection = collection.with_new_columns("halo_properties",
    random_quantity=random_quantity)

See Adding Custom Columns for more examples.

Parameters:

dataset (str) – The name of the dataset to add columns to
columns (**) – The new columns
new_columns (DerivedColumn)

Returns:

new_collection – This collection with the additional columns added

Return type:

opencosmo.StructureCollection

Raises:

ValueError – If the dataset is not found in this collection

with_index(index)

Parameters:: index (DataIndex)

objects(data_types=None, ignore_empty=True)

Iterate over the objects in this collection as pairs of (properties, datasets). For example, a halo collection could yield the halo properties and datasets for each of the associated partcles.

If you don’t need all the datasets, you can specify a list of data types for example:

for row, particles in
    collection.objects(data_types=["gas_particles", "star_particles"]):
    # do work

At each iteration, “row” will be a dictionary of halo properties with associated units, and “particles” will be a dictionary of datasets with the same keys as the data types.

Parameters:: data_types (Iterable[str] | None)
Return type:: Iterable[dict[str, Any]]

with_datasets(datasets)

Create a new collection out of a subset of the datasets in this collection. It is also possible to do this when you iterate over the collection with StructureCollection.objects, however doing it up front may be more desirable if you don’t plan to use the dropped datasets at any point.

Parameters:: datasets (list[str])

halos(*args, **kwargs): Alias for “objects” in the case that this StructureCollection contains halos.

galaxies(*args, **kwargs): Alias for “objects” in the case that this StructureCollection contains galaxies

make_schema()

Return type:: StructCollectionSchema