Dataset

opencosmo.col(column_name)

Create a reference to a column with a given name. These references can be combined to produce new columns or express queries that operate on the values in a given dataset. For example:

import opencosmo as oc
ds = oc.open("haloproperties.hdf5")
query = oc.col("fof_halo_mass") > 1e14
px = oc.col("fof_halo_mass") * oc.col("fof_halo_com_vx")
ds = ds.with_new_columns(fof_halo_com_px = px).filter(query)

For more advanced usage, see Working with Columns

Parameters:

column_name (str)

Return type:

Column

class opencosmo.Dataset(handler, header, state, tree=None)
Parameters:
  • handler (DatasetHandler)

  • header (OpenCosmoHeader)

  • state (DatasetState)

  • tree (Optional[Tree])

property header: OpenCosmoHeader

The header associated with this dataset.

OpenCosmo headers generally contain information about the original data this dataset was produced from, as well as any analysis that was done along the way.

Returns:

header

Return type:

opencosmo.header.OpenCosmoHeader

property columns: list[str]

The names of the columns in this dataset.

Returns:

columns

Return type:

list[str]

property cosmology: Cosmology

The cosmology of the simulation this dataset is drawn from as an astropy.cosmology.Cosmology object.

Returns:

cosmology

Return type:

astropy.cosmology.Cosmology

property dtype: str

The data type of this dataset.

Returns:

dtype

Return type:

str

property redshift: float | tuple[float, float]

The redshift slice or range this dataset was drawn from

Returns:

redshift

Return type:

float

property region: Region

The region this dataset is contained in. If no spatial queries have been performed, this will be the entire simulation box for snapshots or the full sky for lightcones

Returns:

region

Return type:

opencosmo.spatial.Region

property simulation: HaccSimulationParameters

The parameters of the simulation this dataset is drawn from.

Returns:

parameters

Return type:

opencosmo.parameters.hacc.HaccSimulationParameters

property data: QTable | Quantity

Return the data in the dataset in astropy format. The value of this attribute is equivalent to the return value of Dataset.get_data("astropy"). However data retrieved via this attribute will be cached, meaning further calls to Dataset.data should be instantaneous.

However there is one caveat. If you modify the table, those modifications will persist if you later request the data again with this attribute. Calls to Dataset.get_data will be unaffected, and datasets generated from this dataset will not contain the modifications. If you plan to modify the data in this table, you should use Dataset.with_new_columns.

Returns:

data – The data in the dataset.

Return type:

astropy.table.Table or astropy.table.Column

get_data(output='astropy', unpack=True, attach_index=False)

Get the data in this dataset as an astropy table/column or as numpy array(s). Note that a dataset does not load data from disk into memory until this function is called. As a result, you should not call this function until you have performed any transformations you plan to on the data.

You can get the data in two formats, “astropy” (the default) and “numpy”. “astropy” format will return the data as an astropy table with associated units. “numpy” will return the data as a dictionary of numpy arrays. The numpy values will be in the associated unit convention, but no actual units will be attached.

If the dataset only contains a single column, it will be returned as an astropy quantity (if it has units) or numpy array.

This method does not cache data. Calling “get_data” always reads data from disk, even if you have already called “get_data” in the past. You can use Dataset.data to return data and keep it in memory.

Parameters:

output (str, default="astropy") – The format to output the data in

Returns:

data – The data in this dataset.

Return type:

Table | Quantity | dict[str, ndarray] | ndarray

bound(region, select_by=None)

Restrict the dataset to some subregion. The subregion will always be evaluated in the same units as the current dataset. For example, if the dataset is in the default “comoving” unit convention, positions are always in units of comoving Mpc. However Region objects themselves do not carry units. See Regions for details of how to construct regions.

Parameters:
  • region (opencosmo.spatial.Region) – The region to query.

  • select_by (str | None)

Returns:

dataset – The portion of the dataset inside the selected region

Return type:

opencosmo.Dataset

Raises:
  • ValueError – If the query region does not overlap with the region this dataset resides in

  • AttributeError: – If the dataset does not contain a spatial index

evaluate(func, vectorize=False, insert=False, format='astropy', **evaluate_kwargs)

Iterate over the rows in this dataset, apply func to each, and collect the result as new columns in the dataset.

This function is the equivalent of with_new_columns for cases where the new column is not a simple algebraic combination of existing columns. Unlike with_new_columns, this method will evaluate the results immediately and the resulting columns will not change under unit transformations. You may also choose to simply return the result instead of adding it as a column.

The function should take in arguments with the same name as the columns in this dataset that are needed for the computation, and should return a dictionary of output values. The dataset will automatically selected the needed columns to avoid reading unnecessarily reading data from disk. You may also include all columns in the dataset by providing a function with a single import argument with the same name as the data type of this dataset (see Dataset.dtype In this case, the data will be provided as a dictionary of astropy quantity arrays or numpy arrays

The new columns will have the same names as the keys of the output dictionary See Evaluating on Datasets for more details.

If vectorize is set to True, the full columns will be pased to the dataset. Otherwise, rows will be passed to the function one at a time.

If the function returns None, this method will also return None as output. For example, the function could simply produce plots and save the to files.

Parameters:
  • func (Callable) – The function to evaluate on the rows in the dataset.

  • vectorize (bool, default = False) – Whether to provide the values as full columns (True) or one row at a time (False)

  • insert (bool, default = True) – If true, the data will be inserted as a column in this dataset. The new column will have the same name as the function. Otherwise the data will be returned directly.

  • format (str, default = astropy) – Whether to provide data to your function as “astropy” quantities or “numpy” arrays/scalars. Default “astropy”

  • **evaluate_kwargs (any,) – Any additional arguments that are required for your function to run. These will be passed directly to the function as keyword arguments. If a kwarg is an array of values with the same length as the dataset, it will be treated as an additional column.

Returns:

result – The new dataset with the evaluated column(s) or the results as numpy arrays or astropy quantities

Return type:

Dataset | dict[str, np.ndarray | astropy.units.Quantity]

filter(*masks)

Filter the dataset based on some criteria. See Querying Based on Column Values for more information.

Parameters:

*masks (Mask) – The masks to apply to dataset, constructed with opencosmo.col()

Returns:

dataset – The new dataset with the masks applied.

Return type:

Dataset

Raises:

ValueError – If the given refers to columns that are not in the dataset, or the would return zero rows.

rows(output='astropy', attach_index=False)

Iterate over the rows in the dataset. Rows are returned as a dictionary For performance, it is recommended to first select the columns you need to work with.

Parameters:

output (str, default = "astropy") – Whether to return values as “astropy” quantities or “numpy” scalars

Yields:

row (dict) – A dictionary of values for each row in the dataset with units.

Return type:

Generator[Mapping[str, float | Quantity | ndarray], None, None]

select(columns)

Create a new dataset from a subset of columns in this dataset

Parameters:

columns (str or list[str]) – The column or columns to select.

Returns:

dataset – The new dataset with only the selected columns.

Return type:

Dataset

Raises:

ValueError – If any of the given columns are not in the dataset.

drop(columns)

Create a new dataset without the provided columns.

Parameters:

columns (str or list[str]) – The columns to drop

Returns:

dataset – The new dataset without the droppedcolumns

Return type:

Dataset

Raises:

ValueError – If any of the provided columns are not in the dataset.

sort_by(column, invert=False)

Sort this dataset by the values in a given column. By default sorting is in ascending order (least to greatest). Pass invert = True to sort in descending order (greatest to least).

This can be used to, for example, select largest halos in a given dataset:

dataset = oc.open("haloproperties.hdf5")
dataset = dataset
            .sort_by("fof_halo_mass")
            .take(100, at="start")
Parameters:
  • column (str) – The column in the halo_properties or galaxy_properties dataset to order the collection by.

  • invert (bool, default = False) – If False (the default), ordering will be from least to greatest. Otherwise greatest to least.

Returns:

result – A new Dataset ordered by the given column.

Return type:

Dataset

take(n, at='random')

Create a new dataset from some number of rows from this dataset.

Can take the first n rows, the last n rows, or n random rows depending on the value of ‘at’.

Parameters:
  • n (int) – The number of rows to take.

  • at (str) – Where to take the rows from. One of “start”, “end”, or “random”. The default is “random”.

Returns:

dataset – The new dataset with only the selected rows.

Return type:

Dataset

Raises:

ValueError – If n is negative or greater than the number of rows in the dataset, or if ‘at’ is invalid.

take_range(start, end)

Create a new dataset from a row range in this dataset.

Parameters:
  • start (int) – The first row to get.

  • end (int) – The last row to get.

Returns:

table – The table with only the rows from start to end.

Return type:

astropy.table.Table

Raises:

ValueError – If start or end are negative or greater than the length of the dataset or if end is greater than start.

with_new_columns(**new_columns)

Create a new dataset with additional columns. These new columns can be derived from columns already in the dataset, a numpy array, or an astropy quantity array. When a column is derived from other columns, it will behave appropriately under unit transformations. Columns provided directly as astropy quantities will not change under unit transformations. See Adding Custom Columns for examples.

Parameters:
  • columns (**)

  • new_columns (DerivedColumn | ndarray | Quantity)

Returns:

dataset – This dataset with the columns added

Return type:

opencosmo.Dataset

with_units(convention)

Create a new dataset from this one with a different unit convention.

Parameters:

convention (str) – The unit convention to use. One of “physical”, “comoving”, “scalefree”, or “unitless”.

Returns:

dataset – The new dataset with the requested unit convention.

Return type:

Dataset

collect()

Given a dataset that was originally opend with opencosmo.open, return a dataset that is in-memory as though it was read with opencosmo.read.

This is useful if you have a very large dataset on disk, and you want to filter it down and then close the file.

For example:

import opencosmo as oc
with oc.open("path/to/file.hdf5") as file:
    ds = file.(ds["sod_halo_mass"] > 0)
    ds = ds.select(["sod_halo_mass", "sod_halo_radius"])
    ds = ds.collect()

The selected data will now be in memory, and the file will be closed.

If working in an MPI context, all ranks will recieve the same data.

Return type:

Dataset