The Visit Sequence Archive#
Introduction#
An assortment of tools used in Rubin Observatory depend on access to tables describing sequences of visits, including both opsim output and the results of queries to the consdb. For example:
The generation of simulations for both the pre-night briefing and other predictions of future scheduler behavior depend on pre-loading the scheduler with visits already completed, for example as queried from the consdb.
The pre-night briefing report includes figures generated from opsim simulations of the night on which it is reporting.
obsloctapprovides predictions of visits to be scheduled for observing.Progress reports include maf metrics computed from this visit data.
The visit sequence archive is a service for storing and retrieving sequences of visits and ancilliary files associated with sequences of visits (e.g. tables of rewards), tracking metadata describing the sequences of visits, and searching for available sequences of visits based on this metadata.
Prototype#
An initial prototype (described in The Prototype OpSim Archive for schedview Dashboards) used for pre-night simulations and reports saved both data and metadata in an S3 bucket.
This prototype saved the visits themselves as sqlite3 database files as produced by rubin_scheduler, and metadata in yaml files for each sequence.
A python function provided by the prototype configured simulations, executed them (by calling rubin_scheduler.scheduler.sim_runner), wrote metadata to a yaml file, and added of results (including the opsim database output, metadata file, and ancilliary files) to the S3 bucket. and wrapped this in a command line call to run the simulation from with a shell script submitter as a batch job.
Other functions supported searching metadata in the yaml files for pre-night simulations for a given night and retrieving the corresponding visit tables and other files.
The prototype had two significant problems:
Saving metadata for each simulation in its own yaml file meant that searching for simulations according to metadata required retrieving the metadata yaml files for all simulations in the archive from the S3 bucket, which is not scalable. This was partially addressed through creation of an index file in the same S3 bucket which combined metadata from all simulations up to some date, but this was just a stop-gap measure.
The bundling of insertion of data into the archive with driving the execution of the simulation made the archive itself inflexible, making it awkward to modify either how the simulation were driven or data were archived separately.
The visit sequence archive addresses these concerns by:
Keeping the metadata in a separate
postgresqldatabase.Separating the archiving API from the simulation driver: commands that add data to the archive are independent of how the visit data are generated.
Top-level components#
The visit sequence archive has three major components:
A data store that contains the table of visits (as an
hdf5file) and ancilliary data files. This is implemented usinglsst.resources, and the data store used is set by a base URI relative to which files are stored. In production, on S3 bucket is used, while a directory in the local file system is used for testing and demonstrations.A
postgresqldatabase with tables of sequences of visits with metadata on provenance, comments, and statistics (but not the table of visits themselves).A python API and set of shell commands for adding, updating, and querying the archive.
The python API and shell commands#
The sim_archive submodule of rubin_sim holds the python API and shell commands that constitude the higher-level interface to the visit sequence archive.
It consists of four sub-sub-modules:
rubin_sim.sim_archive.vseqarchivecontains a collection of functions that support interaction with the archive as a whole, combining interaction with the metadata database and data store where it makes sense to do so. For example,
rubin_sim.sim_archive.vseqarchive.add_filecombines the addition of a file to the data store and making a record of that file in the metadata database. This submodule also defines thevsarchiveclick.groupand most functions within it are decorated withclickdecorators that place them in this group. As a result, these python functions each have corresponding shell commands that take the same arguments. (See the click documentation for more details.)rubin_sim.sim_archive.vseqmetadatadefines the
VisitSequenceArchiveMetadataclass, an API that manages queries to the database and provides methods that wrap queries for standard operations. In a typical use,pythoncode will instantiate an instance ofVisitSequenceArchiveMetadatawith database connection parameters. Then, it can query of modify the database either by calling methods of this class directly, or passing the instance as an argument to functions provided byrubin_sim.sim_archive.vsarchive. In corresponding shell commands created usingclick, theclick.groupdefinition automatically instantiates an instance using command line arguments and passes it to the subcommand.rubin_sim.sim_archive.sim_archiveprovides a handful of functions that replicate functions provided by the prototype implementation. For example, the
obsloctapservice use the prototypefetch_obsloctab_visitsfunction to retrieve visits from the best pre-night simultation for a night. This submodule therefore implementsfetch_obsloctap_visitshere for backwards compatibility.rubin_sim.sim_archive.prenightindexprovides tools that return inventories of pre-night simulations in the archive. Few users have credentials for that allow access to the metadata database, and connections are only possible on a very limited subnet. A handful of use cases require broader access, particularly by services running at the observatory. These use cases require access to the metadata for only a very limited number of predictable queries, in particular getting inventories of pre-night simulations run for specific nights, and statistics on these simulations. The services that need this already need and have access to the data store. So, to provide access to the required invertories, the
prenightindexsubmodule provides tools for querying the matadata database and placing the results in a predictable key in the data store, and functions that retrieve the needed data by first attempting to query the metadata database, but fall back on reading the pre-generated results from the data store if necessary.rubin_sim.sim_archive.prototypeContains the functions that implemented the prototype data archive. These are retained (for now) to provide access to data recorded by the prototype.
The data store#
The visit sequerce archive uses the lsst.resources package to save and retrieve data.
Each visit sequence is indentified by a UUID, and the archive store data at a URI according to a base URI for the data store, the telescope, the visit sequence UUID, the date of creation, and a file name:
${ARCHIVE_URI}/${TELESCOPE}/${CREATION_DATE}/${VISITSEQ_UUID}/${FILENAME}
Where the elements are:
- ARCHIVE_URI
is the base of the archive. The default is set to
s3://rubin:rubin-scheduler-prenight/opsim/vseq/by therubin_sim.sim_archive.vseqarchive.ARCHIVE_URLmodule-level variable. For testing, it is typically set to a temporary local directory (file:///some/tmp/dir) generated bypython’stempfilestandard library.- TELESCOPE
designates the relevant telescope, either
simonyiorauxtel- CREATION_DATE
is the creation date (in the UTC-12 time zone used by SITCOMTN-032 for
dayobs) of the visit sequence in ISO-8601 (YYYY-MM-DD) format. In the case of completed visits, this is the date on which the query was made. For simulations, it is the date on which the simulation was run. When this date is not available, thesim_archivetools default to the date on which the visit sequence was added to the archive.- FILENAME
The name of the file in which the data is stored on local disk.
So, a typical URI will look like this:
s3://rubin:rubin-scheduler-prenight/opsim/vseq/simonyi/2025-10-16/47ed5c53-ec5a-45a3-bdfe-6b93a3f67bf9/visits.h5
A URL with a FILENAME of visits.h5, if present, holds the data for visits themselves in HDF5 format, in the observations key, corresponding to the observations table in sqlite3 database produced by the rubin_scehduler simulations.
If the visits originated with the database produced by a rubin_scheduler simulation, other tables in this database will be saved as tables in corresponding keys in visits.h5.
The archive infrastructure does not limit the keys and file names of other data to be added, but other keys and filenames used can include:
rewards.h5An HDF5 containing reward data recorded by
rubin_schedulersimulations when called withrecord_rewards=True.opsim.dbThe
sqlite3file generated byrubin_schedulersimulations, as written byrubin_scheduler. In general, this should be redundant with thevisits.h5file.
The postgresql metadata database#
Tables of sequences of visits#
The central tables in metadata database are those that save metadata on the visit sequences themselves, with one row per visit sequence. There are three such tables:
simulationsThe
simulationstable stores metadata on sequences of simulated visits, for example as simulated byrubin_scheduler. Visit sequences in these tables should include only simulated visits. Sequences that are created using a combination of completed and simulated visits, for example a sequence that includes completed visits pre-leaded into the scheduler and then simulated thereafter, should be saved in themexedvisitseqtable instead.completedThe
completedtable stores metadata on sequences of actually completed visits, for example results of queries toconsdb.mixedThe
mixedtable stores metadata in sequences that combine visits from other sequences of visits. For example metadat on a set of visits that include completed visits up to some date and simulated visits thereafter would be recorded in themixedtable.
These tables have the following columns in common:
Column |
Type |
Default |
Description |
|---|---|---|---|
visitseq_uuid |
UUID |
|
Primary key – RFC 9562 Universally Unique Identifier. |
visitseq_sha256 |
BYTEA |
None |
SHA‑256 hash of bytes of the |
visitseq_label |
TEXT |
None |
Human‑readable label for plots and tables |
visitseq_url |
TEXT |
None |
URL to the full visit table (NULL if not available) |
telescope |
TEXT |
None |
Telescope used (e.g. “simonyi”, “auxtel”) |
first_day_obs |
DATE |
None |
Date (in the UTC-12 hour timezone) of the first night included in the sequence. |
last_day_obs |
DATE |
None |
Date (in the UTC-12 hour timezone) of the last night included in the sequence. |
creation_time |
TIMESTAMP WITH TIME ZONE |
|
When the simulation was run or (if not set) when the sequence was added to the archive. |
The values in first_day_obs and last_day_obs might not correspond to the dates of the first and last visits in the sequence, if the sequence covers dates on which there were no visits.
For example, if an entry in the completed table were created by querying consdb for visits between 2025-10-01 and 2025-10-31, but there no visits in consdb on 2025-10-01, the value of first_day_obs would still be 2025-10-01.
In such a case, a user can interpret such a record as a positive assertion that there were no visits on 2025-10-01 fitting the query criteria.
The visit tables for each type include extra columns.
simulations has the following additional columns:
Column |
Type |
Description |
|---|---|---|
scheduler_version |
TEXT |
Version of |
config_url |
TEXT |
URL of the configuration script, typically a URL for a specific commit of a specific file in github. |
conda_env_sha256 |
BYTEA |
SHA‑256 hash of the output of |
parent_visitseq_uuid |
UUID |
UUID of the visitseq loaded into the scheduler before running |
sim_runner_kwargs |
JSONB |
Arguments passed to the simulation runner as a JSON dictionary |
parent_last_day_obs |
DATE |
Date (in the UTC-12hrs time zone) of the last visit loaded into the scheduler before running |
The completed table has just one column (in addition to those all visit sequence tables have in common):
Column |
Type |
Description |
|---|---|---|
query |
TEXT |
Query used to select visits from |
The mixed table has additional columns describing how the parent visit sequences were combined:
Column |
Type |
Description |
|---|---|---|
last_early_day_obs |
DATE |
The last day_obs drawn from the early parent visit sequence |
first_late_day_obs |
DATE |
The first day_obs drawn from the late parent visit sequence |
early_parent_uuid |
UUID |
UUID of the early parent visit sequence |
late_parent_uuid |
UUID |
UUID of the late parent visit sequence |
These three tables are implemented in postgresql as childen of a single parent table, visitseq.
Therefore, queries of the visitseq table will include rows from all three of these tables, but only columns they all have in common.
Files#
The files table associates URIs of files with file types and visit sequences.
Column |
Type |
Description |
|---|---|---|
visitseq_uuid |
UUID |
Identifier of the visit sequence that the file belongs to |
file_type |
TEXT |
The type of file (e.g., |
file_sha256 |
BYTEA |
SHA‑256 hash of the file contents |
file_url |
TEXT |
URL where the file can be retrieved; may be |
Note that the visits file_type is special, and stored in the corresponding visits sequence table itself rather than in this files table.
conda environments#
The simulations table records the hash of the specifications for the conda environment (as reported by conda list --json) in which the simulations was run.
By itself, this record allows a user to identify which simulations were made with the same environment, but not what that environment was.
The conda_env table records the actual content of the conda list --json output, in a format that can be use with postgresql’s json tools.
Column |
Type |
Description |
|---|---|---|
conda_env_hash |
BYTEA |
Primary key – SHA‑256 hash of the output of |
conda_env |
JSONB |
Full JSON representation of the conda environment ( |
The conda_packages view supports querying this table as if each package were stored in its own row of a table.
For example, to get the astropy versions for all simulations for which the conda environment is recorded:
SET SEARCH_PATH TO vsmd;
SELECT creation_time, visitseq_uuid, package_version AS astropy_version FROM simulations NATURAL JOIN conda_packages WHERE package_name='astropy';
Nightly statistics#
The nightly_stats table can records basic statistics by night for any value for which each visit has an associated value.
Examples can be columns in the visits table referenced by visitseq_url, but may also be derived quentities such as those produced by maf stackers.
Column |
Type |
Description |
|---|---|---|
visitseq_uuid |
UUID |
Identifier of the visit sequence |
day_obs |
DATE |
The date (in the UTC-12hrs timezone, following SITCOMTN-032) of the night |
value_name |
TEXT |
Name of the metric or column being summarized |
accumulated |
BOOLEAN |
|
count |
INTEGER |
Number of values in the distribution |
mean |
DOUBLE PRECISION |
Arithmetic mean of the values |
std |
DOUBLE PRECISION |
Standard deviation of the values |
min |
DOUBLE PRECISION |
Minimum value |
p05 |
DOUBLE PRECISION |
5% quantile |
q1 |
DOUBLE PRECISION |
First quartile (25% quantile) |
median |
DOUBLE PRECISION |
Median (50% quantile) |
q3 |
DOUBLE PRECISION |
Third quartile (75% quantile) |
p95 |
DOUBLE PRECISION |
95% quantile |
max |
DOUBLE PRECISION |
Maximum value |
maf results#
Additional tables exist for possible future support of saving maf summary metrics in the visit sequence metadata database.
There are currently no tools to support their use.
These tables are:
maf_metricsrecords parameters used to run metrics. Columns are
maf_metric_name,rubin_sim_version,maf_constraint,metric_class_name,metric_args,slicer_class_name,slicer_argsmaf_summary_metricsrecords the values of summary metrics themselves for a given visit sequence. Columns are
visitseq_uuid,maf_metric_name,day_obs,accumulated,summary_value. The combination of theday_obsandaccumulatedcolumns support recording values from either visits only on (ifaccumulatedisfalse) a specific night (day_obs), or all visits (ifaccumulatedistrue) up to and including a specific night (day_obs).maf_metric_setsdefines sets of metrics, following the use of such sets in
rubin_sim.maf.run_comparison.maf_summaryis a view that makes it easy to get everything for the summary metrics for one metric set applied to runs with specified tags.
maf_healpix_statssupports recording of statistics of metric values when the metrics return healpix arrays.
Comments#
The
commentstable associates comments with visit sequences:Column
Type
Description
visitseq_uuid
UUID
Identifier of the visit sequence to which the comment belongs
comment_time
TIMESTAMP WITH TIME ZONE
When the comment was added (defaults to
NOW())author
TEXT
User or system that added the comment
comment
TEXT
The comment text (not nullable)