How to prepare input data for santaR
Arnaud Wolfer
2019-10-03
Source:vignettes/prepare-input-data.Rmd
prepare-input-data.Rmd
The santaR
package is designed for the detection of
significantly altered time trajectories between study groups, in short
time-series. It is robust to missing values and noisy measurements
without requiring synchronisation in time.
This vignette will:
- Detail the input format expected by the package
- Present the provided example dataset ‘acuteInflammation’
- Save ‘acuteInflammation’ in a
.csv
and.RData
files to be used as input for the graphical interface tutorial.
Data format
In short, for a given variable, each measurement (observation) is a row in a vector.
If more than one variable has been measured at a given time, multiple
measurement columns can be provided in a Data.Frame (data
)
with observations as rows and variables as columns.
For each data point (row), the following metadata vectors are
required (or can be stored in a Data.Frame metadata
):
-
time
, the time at which the observation has been taken. -
ind
identifying which subject (individual) is associated with the observation.
Optionally:
-
group
an identifier indicating to which study group the observation belongs.
All observations of a given individual need to be affected to the same group. If 2 groups exist, significantly altered time trajectories can be identified. If no group or more than 2 groups are provided, the trajectories can be plotted but significance cannot be calculated.
data
and metadata
information can be stored
as vectors, in one or in two separate Data.Frame. If a data-point is not
available (no data value for any variables) the row should be discarded.
If some of the variable measurements are missing for a given time-point,
the value can be replace by NaN
. Do not inpute data as the
package is explicitely designed to be robust to missing values.
Here is an example of 5
observations of 2
variables. Taken on 3
individual separated in
2
goups, covering 3
time-points:
# Metadata
ind | time | group |
---|---|---|
ind_1 | 0 | group_A |
ind_1 | 5 | group_A |
ind_2 | 0 | group_B |
ind_2 | 10 | group_B |
ind_3 | 5 | group_A |
# Data
variable1 | variable2 |
---|---|
1 | 110.2 |
3.5 | NA |
4 | 79.1 |
9.5 | 132 |
5 | 528.3 |
Introducing the dataset ‘acuteInflammation’
The santaR
package is designed for the analysis of short
noisy time-series as produced in most ‘-omics’ platforms, an
example of which is provided. This dataset referred to as
acuteInflammation
contains the concentrations of 22
mediators of inflammation over an episode of acute inflammation. The
mediators have been measured at 7 time-points on 8 subjects,
concentration values have been unit-variance scaled for each
variable.
acuteInflammation
is stored as two Data.Frame;
meta
for the 56 observations metadata, and
data
for the 22 variables measurements:
library(santaR)
## Metadata
# number of rows
nrow(acuteInflammation$meta)
# number of columns
ncol(acuteInflammation$meta)
# a subset
acuteInflammation$meta[12:20,]
##
## This is santaR version 1.2.3
[1] 56
[1] 3
time | ind | group | |
---|---|---|---|
12 | 4 | ind_4 | Group2 |
13 | 4 | ind_5 | Group1 |
14 | 4 | ind_6 | Group2 |
15 | 4 | ind_7 | Group1 |
16 | 4 | ind_8 | Group2 |
17 | 8 | ind_1 | Group1 |
18 | 8 | ind_2 | Group2 |
19 | 8 | ind_3 | Group1 |
20 | 8 | ind_4 | Group2 |
## Data
# number of rows
nrow(acuteInflammation$data)
# number of columns
ncol(acuteInflammation$data)
# a subset
acuteInflammation$data[12:20,1:4]
[1] 56
[1] 22
var_1 | var_2 | var_3 | var_4 | |
---|---|---|---|---|
12 | 2.498 | 1.307 | 0.08296 | 1.183 |
13 | -0.3399 | -0.6434 | 0.03206 | -0.8927 |
14 | 2.668 | 2.464 | 1.365 | 1.743 |
15 | -0.3002 | 0.05366 | 0.4509 | 0.01572 |
16 | 3.777 | 2.543 | 1.858 | 2.213 |
17 | -0.3275 | 0.1564 | 0.585 | 0.03299 |
18 | 0.708 | 0.4893 | -0.08219 | 0.9345 |
19 | -0.4101 | -0.03727 | -0.2914 | -0.7239 |
20 | -0.1577 | -0.6434 | -0.7398 | -0.2126 |
Preparing the csv input for the graphical user interface
While the command line functions accept Data.Frame and vectors as
input, the graphical user interface will read a .csv
file.
By concatenating acuteInflammation
’s data
and metadata
tables and saving them in a .csv
file, we can prepare the input dataset for the graphical user interface
tutorial:
library(santaR)
# Concatenate
outputTable <- cbind(acuteInflammation$meta, acuteInflammation$data)
# Save to disk
outputPath = file.path('path_to_my_output_folder', 'acuteInflammation_GUI_demo.csv')
write.csv(outputTable, file=outputPath, row.names=FALSE)
It is also possible to provide the data directly as 2 Data.Frames
stored in a .RData
file; containing the data in a DataFrame
named inData
and metadata in a DataFrame named
inMeta
: