FDZ Data Cleaning
Annegret Rucker, Benjamin Becker
FDZ_Data_Cleaning.Rmd
This vignette illustrates how the Research Data Centre (FDZ) at the Institute for Educational Quality Improvement (IQB) cleans and tidies data for secondary usage purposes. For secondary usage, data is expected to be of high quality, traceable, comprehensible, and anonymized. Furthermore, meaningful metadata should enable secondary users to understand the data. For this purpose, operations such as recoding of identifier variables, emptying variables, changing labels, creating grouping variables, and many more are performed and therefore discussed in this vignette. Overall, the following aspects should be adressed when cleaning data:
- Removing all non-ASCII characters
- Appropriate missing tags
- Non-informative identifier variables
- Transforming strings to labeled numeric variables
- Anonymization of variables
- Adding variable and value labels
The setup for this illustration is the following:
## if necessary, install eatGADS and eatFDZ
#install.packages(eatGADS)
#remotes::install_github("beckerbenj/eatFDZ")
# load eatGADS
library(eatGADS)
# path to example data set
sav_path <- system.file("extdata", "example_data.sav", package = "eatFDZ")
eatGADS
is the R package used by the FDZ at IQB for data
cleaning purposes. It is available on CRAN and
extensively documented via vignettes (e.g., the Comprehensive
Data Cleaning Guide).
For illustrative purposes, a small, artificial SPSS example data set
of secondary school students in Berlin is used. The data set is
intentionally flawed and messy. The example data set is part of the
eatFDZ
package and installed alongside.
Loading the data
dat <- import_spss(sav_path)
str(dat$dat) # for a small overview in the variables of the data set
## 'data.frame': 10 obs. of 14 variables:
## $ ID : chr "BY1342" "RP7763" "ST1213" "NW5763" ...
## $ ID_name : chr "Finn" "Lukas" "Malia" "Lennard" ...
## $ info : chr "Father Markus helps with homework" "family always goes to Cortina ice cream parlor after school" "-97" "big sister is student representative" ...
## $ sex : num 1 1 2 1 2 1 1 2 2 1
## $ subjfav : chr "Math" "Physical Education" "Chemistry" "Art" ...
## $ age : num 14 15 14 14 13 15 14 17 14 13
## $ siblings : num -9 0 2 1 35 0 1 0 3 1
## $ home : num 1 1 1 1 2 2 1 1 2 -98
## $ birth : chr "01/11/2009" "05/23/2007" "08/31/2008" "12/07/2008" ...
## $ books : num 6 4 -99 1 NA 2 3 2 4 5
## $ school : num 4 -99 -98 3 -99 2 3 -97 4 4
## $ grade_math: num 1 3 4 2 6 2 3 5 4 1
## $ grade_germ: num 2 3 2 6 5 4 2 1 1 2
## $ grade_eng : num 3 4 1 3 2 3 5 2 1 6
The data set is loaded via the import_spss()
command.
The data set contains 14 variables with 10 observations, including
demographic variables such as age and sex of the students, their family
situation and selected grades.
Note that the data set is imported as a GADSdat
object.
For detailed explanations on the corresponding object structure and
technical details, we refer readers to the eatGADS
documentation.
Removing all non-ASCII characters
Non-ASCII characters frequently lead to issues depending on the
operating system and statistical software used. If your data contains
non-ASCII characters, it might cause compatibility issues or errors when
processing the data. fixEncoding()
removes non-ASCII
characters from a character vector or a GADSdat
object.
dat$dat$ID_name
## [1] "Finn" "Lukas" "Malia" "Lennard" "Änne" "Gerrit" "Emil"
## [8] "Mila" "Clara" "Malek"
dat <- fixEncoding(dat, input = "other")
dat$dat$ID_name
## [1] "Finn" "Lukas" "Malia" "Lennard" "Aenne" "Gerrit" "Emil"
## [8] "Mila" "Clara" "Malek"
The non-ASCII character in the ID name has been changed to
ae
.
Define missings semi-automatically
A crucial meta data property for SPSS data are missing
tags. Sometimes not all missing values are correctly tagged as
missing. Via eatGADS
, missing tags can be corrected
semi-automatically. This can either be done based on wordings of
existing value labels (checkMissings()
) or based on a value
range of existing value labels
(checkMissingsByValues()
).
checkMissings()
tests whether a set of specified regular
expressions occurs in the value labels and compares the corresponding
missings tags. For instance the variable books
contains a
value label omitted
, which has not been tagged as
missing.
extractMeta(dat, vars = c("books")) # usually, we wouldn't know yet that this variable contains these value labels. This is for demonstration purposes to see the before and after effect of the function
## varName varLabel format display_width labeled value
## 12 books Number of books at home F10.0 NA yes -99
## 13 books Number of books at home F10.0 NA yes 1
## 14 books Number of books at home F10.0 NA yes 2
## 15 books Number of books at home F10.0 NA yes 3
## 16 books Number of books at home F10.0 NA yes 4
## 17 books Number of books at home F10.0 NA yes 5
## 18 books Number of books at home F10.0 NA yes 6
## valLabel missings
## 12 omitted valid
## 13 0-10 books valid
## 14 11-25 books valid
## 15 26-100 books valid
## 16 101-200 books valid
## 17 valid
## 18 more than 500 books valid
missinglabels <- paste("missing", "unknown",
"omitted",
sep = "|")
dat1 <- checkMissings(dat, missingLabel = missinglabels)
## The following variables have value labels including the term 'missing|unknown|omitted' which are not coded as missing:
## books, school
## 'miss' is inserted into column missings for 3 rows.
extractMeta(dat1, vars = c("books"))
## varName varLabel format display_width labeled value
## 12 books Number of books at home F10.0 NA yes -99
## 13 books Number of books at home F10.0 NA yes 1
## 14 books Number of books at home F10.0 NA yes 2
## 15 books Number of books at home F10.0 NA yes 3
## 16 books Number of books at home F10.0 NA yes 4
## 17 books Number of books at home F10.0 NA yes 5
## 18 books Number of books at home F10.0 NA yes 6
## valLabel missings
## 12 omitted miss
## 13 0-10 books valid
## 14 11-25 books valid
## 15 26-100 books valid
## 16 101-200 books valid
## 17 valid
## 18 more than 500 books valid
checkMissings()
has now automatically tagged the value
-99
with the label omitted
as missing
(miss
).
checkMissingsByValues()
checks whether value labels
exist for a specified range of numeric values and whether missing tags
are provided accordingly:
extractMeta(dat1, vars = c("school")) # usually, we wouldn't know yet that this variable contains these value labels. This is for demonstration purposes to see the before and after effect of the function
## varName varLabel format display_width labeled value valLabel
## 19 school XXX F8.2 NA yes -99 omitted
## 20 school XXX F8.2 NA yes -98 unclear answer
## 21 school XXX F8.2 NA yes -97 unknown
## 22 school XXX F8.2 NA yes 1 Hauptschule
## 23 school XXX F8.2 NA yes 2 Realschule
## 24 school XXX F8.2 NA yes 3 Integrierte Gesamtschule
## 25 school XXX F8.2 NA yes 4 Gymnasium
## missings
## 19 miss
## 20 valid
## 21 miss
## 22 valid
## 23 valid
## 24 valid
## 25 valid
dat2 <- checkMissingsByValues(dat1, missingValues = -50:-99)
## The following variables have values in the 'missingValues' range which are not coded as missing:
## school
## 'miss' is inserted into column missings for 1 rows.
extractMeta(dat2, vars = c("school"))
## varName varLabel format display_width labeled value valLabel
## 19 school XXX F8.2 NA yes -99 omitted
## 20 school XXX F8.2 NA yes -98 unclear answer
## 21 school XXX F8.2 NA yes -97 unknown
## 22 school XXX F8.2 NA yes 1 Hauptschule
## 23 school XXX F8.2 NA yes 2 Realschule
## 24 school XXX F8.2 NA yes 3 Integrierte Gesamtschule
## 25 school XXX F8.2 NA yes 4 Gymnasium
## missings
## 19 miss
## 20 miss
## 21 miss
## 22 valid
## 23 valid
## 24 valid
## 25 valid
The previous function had already tagged two of the missings
correctly, but the value label -98
with the value label
unclear answer
was not used. Now, it is contained in the
specified range and therefore tagged as a missing.
Define missing values by hand
At times, only individual missings need to be tagged.
changeMissings()
tags missing values as missings by
specifying them under value
and marks them as
miss
under missings.
extractMeta(dat2, vars = c("info"))
## varName varLabel format display_width labeled value valLabel missings
## 3 info General notes A8 47 no NA <NA> <NA>
dat3 <- changeMissings(dat2, varName = "info", value = c("-97", "-98", "-99"), missings = c("miss", "miss", "miss"))
extractMeta(dat3, vars = c("info"))
## varName varLabel format display_width labeled value valLabel missings
## 3 info General notes A8 47 yes -99 <NA> miss
## 4 info General notes A8 47 yes -98 <NA> miss
## 5 info General notes A8 47 yes -97 <NA> miss
The values -97
, -98
and -99
are now tagged as miss
.
Recode IDs
ID variables are recoded so no hidden identifiers such as state, school or even name codes allow conclusions to be drawn about individual persons. To ensure that the IDs remain the same across multiple subsets and describe the same person, they are recoded using a template. First, a path for the template is created.
f <- tempfile()
ID_recode_template <- file.path("f") # use your own file path instead of the tempfile
Then there are two options: Creating a new template (option 1 - for the first subset) or applying an existing template (option 2 - for the following subsets).
Option 1: IDs recoded with creation of a template as .csv
autoRecode()
creates a recoding template for the
variable ID
, saved as a .csv file, and in the course of
this a new variable with the variable suffix _FDZ
and an
explanatory label suffix.
dat3 <- autoRecode(dat3, var = "ID", var_suffix = "_FDZ", csv_path = ID_recode_template, label_suffix = " (recoded by FDZ)")
Option 2: IDs recoded with application of a template (e.g. created by a previously prepared, different data set).
Already existing templates can be used for following subsets.
read.csv()
imports the template, which can be applied with
autoRecode()
template <- read.csv(ID_recode_template)
dat3 <- autoRecode(dat3, var = "ID", suffix = "_FDZ", csv_path = ID_recode_template, template = template,
label_suffix = "(rekodiert FDZ)") # Apply template and overwrite
removeVars()
deletes variables from the data set. In
this case, the old ID variable is deleted.
dat3 <- removeVars(dat3, vars = c("ID"))
Convert string variables to numeric variables
For (semi)open variables there are two procedures. If it is possible (e.g., for a small data set) to go through all responses to check whether there are no strong privacy violations or insults included, the variables are converted into numeric ones and then emptied in the Scientific Use Files. Thereby, the information remain accessible on a meta-level, but can no longer be assigned to a person. If there are too many responses to check or privacy cannot be granted, the variable is emptied entirely in the Scientific Use Files. To convert the string variables into numeric ones, all string variables are identified first.
all_types <- do.call(rbind, lapply(namesGADS(dat3), function(nam) data.frame(varName = nam, type = class(dat3$dat[[nam]]))))
all_types[all_types$type == "character", ]
## varName type
## 1 ID character
## 2 ID_name character
## 3 info character
## 5 subjfav character
## 9 birth character
In this case the variables ID_name
, info
,
subjfav
and birth
were recognized as string
variables. In the next step they are converted.
Option 1: Convert all string variables automatically (overwrite variables)
multiChar2fac()
converts the string variables identified
in advance into numeric variables using a for loop.
charNames <- all_types[all_types$type == "character", "varName"]
for(charName in charNames) {
dat4 <- multiChar2fac(dat3, vars = charName, var_suffix = "", label_suffix = "")
}
Option 2: Convert string variable individually
multiChar2fac()
converts the string variables named in
vars
individually into numeric variables. If the option
var_suffix
is used, a new variable is created.
dat4 <- multiChar2fac(dat4, vars = c("ID_name", "info", "subjfav", "birth"), var_suffix = "_FDZ", label_suffix = "")
If the var_suffix
option is left blank, the existing
variable is overwritten.
dat4 <- multiChar2fac(dat4, vars = c("ID_name", "info", "subjfav", "birth"), var_suffix = "", label_suffix = "")
Empty variables already in IZA
Variables that may not even be offered via a remote computing system
due to strong data protection concerns are emptied for all data sets.
emptyTheseVariables()
empties the listed variables and adds
an explanation to the variable label on why the information was
removed.
extractMeta(dat4, vars = c("info", "ID_name"))
## varName varLabel format display_width labeled value valLabel missings
## 2 ID_name XXX A8 NA no NA <NA> <NA>
## 3 info General notes A8 47 yes -99 <NA> miss
## 4 info General notes A8 47 yes -98 <NA> miss
## 5 info General notes A8 47 yes -97 <NA> miss
dat4_empty <- c("info",
"ID_name")
dat5 <- emptyTheseVariables(dat4,
vars = dat4_empty,
label_suffix = " (Zur Anonymisierung geleert (FDZ))")
extractMeta(dat5, vars = c("info", "ID_name"))
## varName varLabel format
## 2 ID_name XXX (Zur Anonymisierung geleert (FDZ)) A8
## 3 info General notes (Zur Anonymisierung geleert (FDZ)) A8
## 4 info General notes (Zur Anonymisierung geleert (FDZ)) A8
## 5 info General notes (Zur Anonymisierung geleert (FDZ)) A8
## display_width labeled value valLabel missings
## 2 NA no NA <NA> <NA>
## 3 47 yes -99 <NA> miss
## 4 47 yes -98 <NA> miss
## 5 47 yes -97 <NA> miss
dat4
corresponds to the data set, vars
lists the variables to be emptied and label_suffix
automatically adds the explanation to the variable label indicating that
the data has been emptied for anonymization reasons.
Change variable and value labels
Variable and value labels help us with the traceability and
comprehensibility of a data set. ChangeVarLabels()
selects
the variable to be changed with varName
and assigns the
corresponding label with varLabel
.
ChangeValLabels()
works in a similar way, the corresponding
value is selected in value
and the label for this value is
created with valLabel
.
- Changing variable labels
extractMeta(dat5, vars = c("school"))
## varName varLabel format display_width labeled value valLabel
## 30 school XXX F8.2 NA yes -99 omitted
## 31 school XXX F8.2 NA yes -98 unclear answer
## 32 school XXX F8.2 NA yes -97 unknown
## 33 school XXX F8.2 NA yes 1 Hauptschule
## 34 school XXX F8.2 NA yes 2 Realschule
## 35 school XXX F8.2 NA yes 3 Integrierte Gesamtschule
## 36 school XXX F8.2 NA yes 4 Gymnasium
## missings
## 30 miss
## 31 miss
## 32 miss
## 33 valid
## 34 valid
## 35 valid
## 36 valid
dat6 <- changeVarLabels(dat5, varName = "school", varLabel = "School type")
extractMeta(dat6, vars = c("school"))
## varName varLabel format display_width labeled value
## 30 school School type F8.2 NA yes -99
## 31 school School type F8.2 NA yes -98
## 32 school School type F8.2 NA yes -97
## 33 school School type F8.2 NA yes 1
## 34 school School type F8.2 NA yes 2
## 35 school School type F8.2 NA yes 3
## 36 school School type F8.2 NA yes 4
## valLabel missings
## 30 omitted miss
## 31 unclear answer miss
## 32 unknown miss
## 33 Hauptschule valid
## 34 Realschule valid
## 35 Integrierte Gesamtschule valid
## 36 Gymnasium valid
- Changing value labels
extractMeta(dat6, vars = c("books"))
## varName varLabel format display_width labeled value
## 23 books Number of books at home F10.0 NA yes -99
## 24 books Number of books at home F10.0 NA yes 1
## 25 books Number of books at home F10.0 NA yes 2
## 26 books Number of books at home F10.0 NA yes 3
## 27 books Number of books at home F10.0 NA yes 4
## 28 books Number of books at home F10.0 NA yes 5
## 29 books Number of books at home F10.0 NA yes 6
## valLabel missings
## 23 omitted miss
## 24 0-10 books valid
## 25 11-25 books valid
## 26 26-100 books valid
## 27 101-200 books valid
## 28 valid
## 29 more than 500 books valid
dat7 <- changeValLabels(dat6, varName = "books", value = 5, valLabel = "201-500 books")
extractMeta(dat7, vars = c("books"))
## varName varLabel format display_width labeled value
## 23 books Number of books at home F10.0 NA yes -99
## 24 books Number of books at home F10.0 NA yes 1
## 25 books Number of books at home F10.0 NA yes 2
## 26 books Number of books at home F10.0 NA yes 3
## 27 books Number of books at home F10.0 NA yes 4
## 28 books Number of books at home F10.0 NA yes 5
## 29 books Number of books at home F10.0 NA yes 6
## valLabel missings
## 23 omitted miss
## 24 0-10 books valid
## 25 11-25 books valid
## 26 26-100 books valid
## 27 101-200 books valid
## 28 201-500 books valid
## 29 more than 500 books valid
Anonymization: Form grouping variable
Small case numbers pose a risk of anonymity. For variables with a
number of cases <6, which are not ratings, feelings or opinions and
that contain sensitive information about individuals, it is advisable to
combine small numbers of cases into meaningful groups. To do so,
cloneVariable()
creates a new _FDZ variable by cloning the
variable to be grouped. For varName
the original variable
name is entered, for new_varName
the variable name
including the appendix “_FDZ”. The variable label is automatically
adjusted with the option label_suffix
.
recodeGADS()
replaces the old values
(oldValues
) by new ones (newValues
) and
changeValLabels()
changes the value labels of the new
groupings.
extractMeta(dat7, vars = c("age"))
## varName varLabel format display_width labeled value valLabel missings
## 9 age Age F10.0 NA no NA <NA> <NA>
dat8 <- cloneVariable(dat7, varName = "age", new_varName = "age_FDZ", label_suffix = " (Zur Anonymisierung aggregiert (FDZ))")
dat8 <- recodeGADS(dat8, varName = "age_FDZ", oldValues = c(13, 14, 15, 17),
newValues = c(13, 13, 15, 15))
dat8 <- changeValLabels(dat8, varName = "age_FDZ", value = c(13, 15), valLabel = c("13-14", "15-17"))
extractMeta(dat8, vars = c("age_FDZ"))
## varName varLabel format display_width
## 40 age_FDZ Age (Zur Anonymisierung aggregiert (FDZ)) F10.0 NA
## 41 age_FDZ Age (Zur Anonymisierung aggregiert (FDZ)) F10.0 NA
## labeled value valLabel missings
## 40 yes 13 13-14 valid
## 41 yes 15 15-17 valid
Groupings for multiple variables can also be performed using a loop.
For this the variables are given a temporary name consisting of the
variable name and the suffix “_FDZ”. varname
represents
each individual variable, which was summarized before in
gr_var
. The individual steps (cloneVariable()
,
recodeGADS()
,changeValLabels()
) are therefore
performed for each of the variables in turn.
extractMeta(dat8, vars = c("grade_math"))
## varName varLabel format display_width labeled value valLabel
## 37 grade_math Grade mathematics F8.2 NA no NA <NA>
## missings
## 37 <NA>
dat9 <- dat8
gr_var <- c("grade_math", "grade_germ", "grade_eng")
for (varname in gr_var) {
tempnewname <- paste0(varname, "_FDZ")
dat9 <- cloneVariable(dat9, varName = varname,
new_varName = tempnewname, label_suffix = " (Zur Anonymisierung aggregiert (FDZ))")
dat9 <- recodeGADS(dat9, varName = tempnewname,
oldValues = c(1:6),
newValues = c(rep(1,4),rep(2,2)))
dat9 <- changeValLabels(dat9, varName = tempnewname, value = c(1, 2),
valLabel = c("passed", "not passed"))
}
extractMeta(dat9, vars = c("grade_math_FDZ"))
## varName varLabel
## 42 grade_math_FDZ Grade mathematics (Zur Anonymisierung aggregiert (FDZ))
## 43 grade_math_FDZ Grade mathematics (Zur Anonymisierung aggregiert (FDZ))
## format display_width labeled value valLabel missings
## 42 F8.2 NA yes 1 passed valid
## 43 F8.2 NA yes 2 not passed valid
Now two groups have been created. The first group includes the students with grades from 1-4, which means they have passed. In the other group, there are the students with grades 5 and 6, which means they have not passed. These groups were created for all three subjects Math, German and English at once.
Creating the version variable
To indicate the state of the data set, each data set is provided with
a version variable. It contains the version of the data record and the
date of the last change. createVariable()
creates this
variable whereas relocateVariable()
places it at the
beginning of the record so it has a consistent, visible place.
version_name <- "Version_v1_2023_11"
dat10 <- createVariable(dat9, varName = version_name)
dat10 <- relocateVariable(dat10, var = version_name, after = NULL)
Change variable order
Newly created variables are initially placed at the end of a data
set. orderLike()
rearranges the variables by specifying the
variables in their new order.
dat11 <- orderLike(dat10, newOrder = c("Version_v1_2023_11", "ID", "ID_name", "info", "home", "birth", "age", "age_FDZ", "sex", "siblings", "books", "school", "subjfav", "grade_math", "grade_math_FDZ", "grade_germ", "grade_germ_FDZ", "grade_eng", "grade_eng_FDZ"))
Often it is only a matter of individual variables that are supposed
to be sorted in, such as newly created grouping variables behind their
emptied original variable. relocateVariable()
sets the
variables after the selected one.
dat11 <- relocateVariable(dat11, var = "age_FDZ", after = "age")