The `santaR`

package is designed for the detection of significantly altered time trajectories between study groups, in short time-series. The graphical user interface implements all of `santaR`

’s functions.

The GUI is to be prefered to understand the methodology, select the best parameters on a subset of the data before running the command line, or to visually explore results.

This vignette will: * Detail the step-by-step use of the graphical user interface using an example dataset

This vignette employ the `.csv`

and `.RData`

files generated from `acuteInflammation`

in the vignette how to prepare input data for santaR.

The graphical user interface is started as follow:

The graphical interface is divided in 4 main tabs,

Import,DF search,AnalysisandExport.

The first input format is a `.csv`

file containing as *rows* the observations (samples) and as *columns* the variables as well as metadata.

Columns corresponding to metadata are selected: the metadata describe the individual ID and collection time corresponding to each observation, with optionally class information for identification of inter-class differential trajectories.

Additionally, data previously imported as well as fitting results (in `.RData`

format) can be loaded for further analysis or plotting (see the **Export** section for more details).

Note:The single parameter to be set by the user is the number of

degrees of freedomto fit the spline model. The \(df\) controls how closely the curve models the input data-points.dfOnce the \(df\) is chosen for a dataset (a given number of time-points and missing values), it can be kept constant whichever the question to investigate (the metadata and group comparison).

Some indications based on simulated data and diverse datasets can guide the selection of \(df\):

- \(df\) controls the
“complexity”of the model employed. A substantial difference can be found when going from2to10, but very little change will take place when going from10to50(the model only gets more complex, but the general shape won’t change).- More time points do not automatically require a higher \(df\). More inflexions (more complex shape) could require a higher \(df\) if the number of points is sufficient (and the sampling frequency high).
- A lower \(df\) value is often more suited and generalisable (less over-fitted).
- If the \(df\) is for example
10, all individuals trajectories with less of 10 time-points cannot be fitted and will be rejected.- On simulated data, the results (\(p\)-values) are resilient to most values of \(df\), however the plots can look dramatically different.
Trying multiple values of \(df\) on a subset of variables (using the GUI) and then selecting the fit that approximate the time evolution the best without over-fitting:

- \(df=5\) is a good starting point in most cases (even more so if there is less than 10 time-points)
- If the number of time-points is large and the curves seem very under-fitted, \(df\) can be increased to
6,7or more. Values higher than10should rarely be required and will provide with a diminishing return.df=number of time-pointswill result in a curve passing through all points (over-fitted).- If the number of points is lower or the trajectories seem over-fitted, \(df\) can be decreased to
4or3.3will be similar to a second degree polynomial, while2will be a linear model)- If the plots
“look right”and don’t seem to“invent”information between measured data-points, the \(df\) is close to optimal.It does not seem to be possible to automatically select the degree of freedom. A choice based on visualisation of the splines while being careful with over-fitting, keeping in mind the

“expected”evolution of the underlying process seems the most reasonable approach.

Even if automated approaches cannot reliably select a number of degree of freedom to employ, **DF search** implements some of these approaches and multiple tools to help guide optimal \(df\) selection.

**Auto-Fit** uses principal component analysis (PCA) to extract latent trajectories and generate eigen-trajectories that are subsequently assessed for optimal \(df\) using various goodness-of-fit metrics.

**Parameter evolution** plots the evolution of these metrics across the range of possible \(df\) for each latent trajectory.

To select the most suitable \(df\) parameter, **Plot fit** generates a visualisation of the fit on each latent projection at automatically and manually selected \(df\) values.

Finally **Missing value** highlight the number of trajectories that would have to be excluded as they contain less time-points than the \(df\) selected.

With the data imported and a pertinent \(df\) value selected, **Analysis** regroups the fitting, visualisation and identification of variables significantly altered between groups.

`Fit`

handles parameter selection as well as downstream computation. Calculation of inter-group differential evolutions can be performed with either initial class information or an advance option generated new grouping (e.g., including / combining / excluding input groups). The user can control the number of permutations and bootstrap rounds for significance and group mean curve confidence band calculation. The sub-sampling or the area between group mean curves can be altered to favour calculation speed at the expense of numerical precision. `Parallelisation`

enables the selection of the number of CPU cores to employ for computation. `View Input`

presents the dataset as fitted.

`Plot`

enables the interactive visualisation of the raw data points, individual trajectories, group mean curves and confidence bands for all variables, which subsequently can be saved as an image figure to disk.

If inter-group differential evolution has been characterised, `P-value`

summarise in tables all significance testing - providing multiple options for false discovery correction (e.g., Benjamini-Hochberg, Benjamini-Yekutieli and Bonferroni) as well as confidence intervals on the \(p\)-values.