Main Transformations API ========================= :code:`opencosmo` provides a simple but powerful API for transforming and querying datasets and collections. Both the main :py:class:`opencosmo.Dataset` type and the various collection types will have these transformations available., although the details of how they behave will differ slightly. Individual collection types may also have additional convinience methods based on their purpose, see :doc:`collections` for more info. The main transformations are: - :code:`with_units`: Change the unit convention of the dataset or collection. - :code:`filter`: Filter a dataset based on the value of one more more columns. - :code:`select`: Select a subset of columns from a dataset. - :code:`take`: Select a subset of rows from a dataset. - :code:`sort_by`: Sort a dataset by one of its columns - :code:`bound`: Limit a dataset or collection to a given spatial region. - :code:`with_new_columns`: Combine columns in a dataset into a new column with automatic unit handling. - :code:`evaluate`: Evaluate a computation over all the rows in a dataset or collection. Each of these transformations is returns a new dataset or collection with the transformations applied. Because transformations are applied lazily, chaining them together is efficient: .. code-block:: python import opencosmo as oc # Load a dataset ds = oc.open() # Apply a series of transformations ds = ds.with_units("scalefree") ds = ds.filter(oc.col("fof_halo_mass") > 1e13) ds = ds.select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) ds = ds.take(100, at="random") ds = ds.with_units("physical") data = ds.get_data() In this example, we are we are applying a cut in halo mass using scalefree coordinates, meaning this filter will include all galaxies over 1e13 Msun/h. We then select a subset of the columns and transform them into physical units, removing the factors of h in the final values. See below for more information about unit conventions. When writing queries like this, it can feel a bit redundant to write :code:`ds = ds.transform(...)` over and over. In practice it is often more readable to simply apply transformations on top of each other: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .with_units("scalefree") .filter(oc.col("fof_halo_mass") > 1e13) .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) .take(100, at="random") .with_units("physical") data = ds.get_data() Note that if you're working in a Jupyter notebook, you'll need to use the line continuation character to split the query across multiple lines: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds \ .with_units("scalefree") \ .filter(oc.col("fof_halo_mass") > 1e13) \ .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) \ .take(100, at="random") \ .with_units("physical") data = ds.get_data() You are also free to create multiple derivative datasets from the same original dataset: .. code-block:: python ds = oc.open("haloproperties.hdf5") low_mass_ds = ds .filter(oc.col("fof_halo_mass") > 1e13, os.col("fof_halo_mass") < 1e14) .with_units("phsyical") .select(["fof_halo_mass", "fof_halo_cdelta"]) high_mass_ds = ds .filter(oc.col("fof_halo_mass") > 1e14) .with_units("physical") .select(["fof_halo_mass", "fof_halo_cdelta"]) data1 = ds1.data data2 = ds2.data However you may also be interested in including all data that passes *either* filter in a single dataset. You can combine filters with boolean logic using the & and | operators: .. code-block:: python ds = oc.open("haloproperties.hdf5") high_mass_cut = oc.col("fof_halo_mass") > 1e14 low_mass_cut = oc.col("fof_halo_mass") < 1e12 low_concentration_cut = oc.col("sod_halo_cdelta") < 5 my_filter = (high_mass_cut | low_mass_cut) & low_concentration_cut filtered_ds = ds.filter(my_filter) Because transformations are evaluated lazily, you can have many derivative datasets without incurring a large memory overhead. Unit Conventions ---------------- The :code:`with_units` transformation is used to change the unit convention of the dataset. :code:`opencosmo` supports the following unit conventions: - :code:`unitless`: The dataset is read without applying any units - :code:`scalefree`: The dataset is in "scale-free" units, meaning all lengths are in comoving Mpc/h and all masses are in Msun/h. This is the unit convention that the raw values are stored in. - :code:`comoving`: Factors of `h` are absorbed into the values, but positions and velocities still use comoving coordinates. - :code:`physical`: Factors of `h` are absorbed into the values, and positions and velocities are converted to physical coordinates. When you initially load a dataset, it always uses the "comoving" unit convention. You can change this at any time on any dataset or collection by simply calling :code:`with_units` with the desired unit convention. For more information, see :ref:`Working with Units` Adding Columns -------------- You can add new columns to a given that are derived from pre-existing columns using the :meth:`oc.col` to construct new columns and passing them to :code:`with_new_columns`. The new columns will inherit the cosmological dependence of the columns they are created from, and can be used throughout the transformations API as usual. .. code-block:: python ds = oc.open("haloproperties.hdf5") fof_halo_vtotal = (oc.col("fof_halo_com_vx")**2 + oc.col("fof_halo_com_vy")**2 + ("fof_halo_com_vz")**2)**(0.5) fof_halo_com_p = oc.col("fof_halo_mass") * fof_halo_vtotal ds = ds.with_new_columns(fof_halo_com_p = fof_halo_com_p) The dataset will now contain a "fof_halo_com_p" column that can be used for filtering and selections as usual. Because the column definition was created outside the dataset itself, it can be used across multiple datasets as needed. You can also simply pass values as a numpy array or astropy quantity: .. code-block:: python import astropy.units as u import numpy as np random_angle = np.random.uniform(10, 50, len(ds))*u.arcmin ds = ds.with_new_columns(angle = random_angle) Columns can be added to collections as well, but there are some subtelties. See :doc:`collections` for more information. Filtering --------- Filters operate on columns of a given dataset and return a new dataset that only contains the rows that satisfy the filter. Filters are constructed using the :meth:`opencosmo.col` function, so they can be constructed independently of any single dataset. Available filters include: - Equality: :code:`col("column_name") == value` - Inequality: :code:`col("column_name") != value` - Greater than: :code:`col("column_name") > value` - Greater than or equal to: :code:`col("column_name") >= value` - Less than: :code:`col("column_name") < value` - Less than or equal to: :code:`col("column_name") <= value` - Membership: :code:`col("column_name").isin([value1, value2, ...])` Filters do not need to include units, however a filter with *incorrect* units will raise an error: .. code-block:: python import astropy.units as u from astropy.cosmology import units as u import opencosmo as oc ds = oc.open("haloproperties.hdf5") # This will work fine min_mass = oc.col("fof_halo_mass") > 1e13 ds = ds.filter(min_mass) # This will work fine min_mass_unitful = oc.col("fof_halo_mass") > 1e13 * u.Msun ds = ds.filter(min_mass_unitful) # This will fail, because the masses are not in Msun / h min_mass_unitful = oc.col("fof_halo_mass") > 1e13 * u.Msun / cu.littleh ds = ds.filter(min_mass_unitful) The behavior of filters on collections depends on the collection type. See the :doc:`collections` page for more information. Selecting Columns ----------------- For small datasets, it is usually not an issue to request all the columns in a given dataset. However for large datasets, loading everything into memory is slow and consumes singificant quantities of memory. We can use the :meth:`opencosmo.Dataset.select` transformation to select only the subset of columns that are useful for our analysis. Select transformations can be applied sequentially, in which case the second select will only work if it contains columns that were selected in the first select. For example: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) .select(["fof_halo_mass", "fof_halo_center_x"]) # This is fine .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) .select(["fof_halo_mass", "sod_halo_cdelta"]) # This will raise an error, because sod_halo_cdelta was not in the first select :code:`select` also accepts wildcards, allowing you to select large subsets of related columns without typing them all out explicitly: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds.select(["fof*", "*com*"]) There also is an equivalent :code:`drop` function, which drops columns instead of selecting them: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds.drop(["fof*", "block"]) Filters and selects generally behave as you might expect. If you select *after* filtering, the resulting dataset will only have the columns that were selected for the rows that passed the filter. If you select *before* filtering, the filter can only use columns that were included in the select. For example, this works: .. code-block:: python import opencosmo as oc ds = oc.open("haloproperties.hdf5") ds = ds .select(["fof_halo_mass", "fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) .filter(oc.col("fof_halo_mass") > 1e13) as does this: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .filter(oc.col("fof_halo_mass") > 1e13) .select(["fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) # This is also fine but this will raise an error: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .select(["fof_halo_center_x", "fof_halo_center_y", "fof_halo_center_z"]) .filter(oc.col("fof_halo_mass") > 1e13) # fof_halo_mass is not in the dataset when "filter" is called. Taking Rows ----------- The :meth:`opencosmo.Dataset.take` transformation is used to select a subset of rows from a dataset. The :code:`at` argument can be used to specify how the rows are selected. The available options are: - :code:`at="random"`: Select a random subset of n rows from the dataset (default). - :code:`at="start"`: Select the first n rows from the dataset. - :code:`at="end"`: Select the last n rows from the dataset. As with the `select` transformations, `take` transformations can be chained together. However you cannot take more rows than are present in the dataset: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .take(100, at="random") .take(500, at="random") # This will raise an error You can also take a range of rows with :meth:`opencosmo.Dataset.take_range`. As with all other transformations, this creates a new dataset so the following is valid: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds .take_range(500, 1000) .take(100, at="start") This will take the rows 500-1000 from the original dataset, and then take the first 100 rows from that new dataset. The original dataset is unchanged. Sorting ------- You can re-order a dataset based on the value of some column with :meth:`opencosmo.Dataset.sort_by`. By default, this sorts in ascending order (from lowest to highest), however you can sort in descending order by passing :code:`invert = True`. For example, to get the 100 most massive halos in a given simulation, ordered from most to least massive: .. code-block:: python ds = oc.open("haloproperties.hdf5") ds = ds.sort_by("fof_halo_mass", invert=True).take(100, at="start") Or, to get the 100 *least* massive halos, ordered from least to most massive: .. code-block:: python ds = ds.sort_by("fof_halo_mass").take(100, at="start") You can also use :py:meth:`take ` in clever ways to get other results. For example, to get the 100 *most* massive halos but ordered from *least to most massive*: .. code-block:: python ds = ds.sort_by("fof_halo_mass").take(100, at="end") Spatial Querying ----------------- OpenCosmo data contains a spatial index which makes it efficient to perform spatial queries on the data. These queries can be performed by defining a region, and then passing it into :meth:`opencosmo.Dataset.bound`: .. code-block:: python ds = oc.open("haloproperties.hdf5") region = oc.make_box((20,20,20), (40,40,40)) bound_ds = ds.bound(region) For lightcone data, spatial queries are performed using two dimensional regions on the sky. For example: .. code-block:: python import astropy.units as u from astropy.coordinates import SkyCoord ds = oc.open("lc_haloproperties.hdf5") center = SkyCoord(45*u.deg, -30*u.deg) radius = 30*u.arcmin region = opencosmo.make_cone(center, radius) bound_ds = ds.bound(region) See :doc:`spatial_ref` for more information about constructing regions. As with other transformations, spatial queries can be chained together to build complex query pipelines. If a given region contains no data, the spatial query will return a dataset with length zero. There are some complications that arise when working with spatial queries in an MPI context. See :doc:`mpi` for more details. Iterating Over Rows -------------------- If you want to work row-by-row, you can always iterate over the dataset with :meth:`opencosmo.Dataset.rows` .. code-block:: python ds = oc.open("haloproperties.hdf5") for row in ds.rows(): # Do something with the row print(row["fof_halo_mass"], row["fof_halo_center_x"]) At each iteration, the row will be a dictionary of values for the specified rows with units applied. If you only need a subset of the columns, consider using :meth:`opencosmo.Dataset.select` to select only those columns before iteration. Evaluating Complex Expressions ------------------------------ Generally, basic data manipulation is not sufficient for science. We need to fit models and perform complex operations. The :py:meth:`evaluate ` method can handle the low-level data management, leaving you to focus on building your model. See :doc:`evaluating` for more information.