| Title: | Diagnose, Visualize, and Aggregate Event Report Level Data |
|---|---|
| Description: | Diagnose, visualize, and aggregate event report level data to the event level. Users provide an event report level dataset, specify their aggregation rules, and the package produces a dataset aggregated at the event level. Also includes the Modes and Agents of Election-Related Violence in Côte d'Ivoire and Kenya (MAVERICK) dataset, an event report level dataset that records all documented instances of electoral violence from the first multiparty election to 2022 in Côte d'Ivoire (1995-2022) and Kenya (1992-2022). For more details see van Baalen and Höglund (2026) <doi:10.1093/isq/sqag014>. Users of the enclosed MAVERICK dataset should also cite van Baalen and Höglund (2026) <doi:10.1093/jopres/xjaf012>. |
| Authors: | Sebastian van Baalen [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-3098-5587>), Kristine Höglund [aut] (ORCID: <https://orcid.org/0000-0001-7167-609X>) |
| Maintainer: | Sebastian van Baalen <[email protected]> |
| License: | CC BY 4.0 |
| Version: | 0.1.2 |
| Built: | 2026-06-09 07:35:58 UTC |
| Source: | https://github.com/sebastianvanbaalen/eventreport |
This convenience function aggregates the MAVERICK event report data to the event level using the most-conservative aggregation model.
aggregate_maverick_con(data)aggregate_maverick_con(data)
data |
The MAVERICK event report level dataset. Already pre-loaded. |
Returns a dataframe of the most-conservative aggregation of the MAVERICK dataset.
maverick_conservative <- aggregate_maverick_con()maverick_conservative <- aggregate_maverick_con()
This convenience function aggregates the MAVERICK event report data to the event level using the most-informative aggregation model.
aggregate_maverick_inf(data)aggregate_maverick_inf(data)
data |
The MAVERICK event report level dataset. Already pre-loaded. |
Returns a dataframe of the most-informative aggregation of the MAVERICK dataset.
maverick_informative <- aggregate_maverick_inf()maverick_informative <- aggregate_maverick_inf()
This convenience function aggregates the MAVERICK event report data to the event level using the most-representative aggregation model.
aggregate_maverick_rep(data)aggregate_maverick_rep(data)
data |
The MAVERICK event report level dataset. Already pre-loaded. |
Returns a dataframe of the most-representative aggregation of the MAVERICK dataset
maverick_representative <- aggregate_maverick_rep()maverick_representative <- aggregate_maverick_rep()
This function combines strings from a character variable.
aggregate_strings(str_var)aggregate_strings(str_var)
str_var |
A character vector. |
Returns a single character string with unique strings concatenated by semicolons.
aggregate_strings(c("apple", "banana", "apple", "Unknown", "orange", " "))aggregate_strings(c("apple", "banana", "apple", "Unknown", "orange", " "))
This function aggregates event report data based on a specified grouping variable and various aggregation criteria.
aggregateData( data, group_var = "event_id", find_mode = NULL, find_mode_na_ignore = NULL, find_mode_bin = NULL, find_mode_date = NULL, find_mode_numeric = NULL, find_least_precise = NULL, find_most_precise = NULL, combine_strings = NULL, find_max = NULL, find_min = NULL, summarize_vars = NULL, aggregation_name = NULL, tie_break = "default_tie_break", second_tie_break = "default_tie_break" )aggregateData( data, group_var = "event_id", find_mode = NULL, find_mode_na_ignore = NULL, find_mode_bin = NULL, find_mode_date = NULL, find_mode_numeric = NULL, find_least_precise = NULL, find_most_precise = NULL, combine_strings = NULL, find_max = NULL, find_min = NULL, summarize_vars = NULL, aggregation_name = NULL, tie_break = "default_tie_break", second_tie_break = "default_tie_break" )
data |
A data frame containing the data to be aggregated. |
group_var |
A string specifying the variable to group by. Default is "event_id". |
find_mode |
A vector of variable names for which to find the mode. |
find_mode_na_ignore |
A vector of variable names for which to find the mode, ignoring NAs. |
find_mode_bin |
A vector of variable names for which to find the binary mode. |
find_mode_date |
A vector of variable names for which to find the mode for dates. |
find_mode_numeric |
A vector of variable names for which to find the mode for numeric values. |
find_least_precise |
A list of lists, each containing a variable name and its corresponding precision variable, to find the least precise value. |
find_most_precise |
A list of lists, each containing a variable name and its corresponding precision variable, to find the most precise value. |
combine_strings |
A vector of variable names for which to combine strings. |
find_max |
A vector of variable names for which to find the maximum value. |
find_min |
A vector of variable names for which to find the minimum value. |
summarize_vars |
A vector of variable names for which to sum all values. |
aggregation_name |
A string specifying the name of the aggregation. |
tie_break |
A string specifying the tie break column name. Default is "default_tie_break". |
second_tie_break |
A string specifying the second tie break column name. Default is "default_tie_break". |
A data frame with the aggregated results.
small_maverick_event_report %>% aggregateData(group_var = "event_id", find_mode = "city") %>% utils::head(10)small_maverick_event_report %>% aggregateData(group_var = "event_id", find_mode = "city") %>% utils::head(10)
This convenience function runs all six diagnostic functions in the package, mean divergence, normalized divergence, mean standard deviation, mean range, share of events with disagreement, and modal confidence, and returns a combined tibble with one row per variable.
aggregation_diagnostics(data, group_var, variables)aggregation_diagnostics(data, group_var, variables)
data |
A data frame containing event report level data. |
group_var |
A character string naming the column that uniquely identifies events (e.g., "event_id"). |
variables |
A character vector of column names to include in the diagnostics. |
The function handles mixed-type input: each diagnostic is only run on the subset of variables for which it is valid. Variables that do not apply to a particular diagnostic will have 'NA' in that column.
A tibble with one row per variable and columns:
The name of each variable.
Mean divergence score.
Normalized divergence score.
Mean within-event standard deviation (numeric variables only).
Mean within-event range (numeric variables only).
Share of events with any disagreement.
Average modal confidence per variable.
#' @importFrom dplyr full_join
small_maverick_event_report %>% aggregation_diagnostics( group_var = "event_id", variables = c("city", "deaths_best", "actor1") )small_maverick_event_report %>% aggregation_diagnostics( group_var = "event_id", variables = c("city", "deaths_best", "actor1") )
This function determines the mode of a variable 'x', filtered to entries with the maximum value of a specified precision vector 'precision_var'. It optionally resolves ties using one or two additional vectors for tie-breaking.
calc_max_precision(x, precision_var, tie_break = NULL, second_tie_break = NULL)calc_max_precision(x, precision_var, tie_break = NULL, second_tie_break = NULL)
x |
A vector of values for which to find the mode. |
precision_var |
A vector of precision values corresponding to 'x', used to filter to maximum values. |
tie_break |
Optional; a vector used as the first tie-break criterion. |
second_tie_break |
Optional; a vector used as the second tie-break criterion. |
Returns the mode of 'x' for entries with maximum 'precision_var' value. If no valid entries exist, returns an empty string.
x = c("apple", "apple", "banana", "banana") precision_var = c(1, 2, 1, 2) tie_break = c(1, 2, 1, 2) second_tie_break = c(1, 1, 2, 1) calc_max_precision(x, precision_var, tie_break, second_tie_break)x = c("apple", "apple", "banana", "banana") precision_var = c(1, 2, 1, 2) tie_break = c(1, 2, 1, 2) second_tie_break = c(1, 1, 2, 1) calc_max_precision(x, precision_var, tie_break, second_tie_break)
This function determines the mode of a variable 'x', filtered to entries with the minimum value of a specified precision vector 'precision_var'. It optionally resolves ties using one or two additional vectors for tie-breaking.
calc_min_precision(x, precision_var, tie_break = NULL, second_tie_break = NULL)calc_min_precision(x, precision_var, tie_break = NULL, second_tie_break = NULL)
x |
A vector of values for which to find the mode. |
precision_var |
A vector of precision values corresponding to 'x', used to filter to minimum values. |
tie_break |
Optional; a vector used as the first tie-break criterion. |
second_tie_break |
Optional; a vector used as the second tie-break criterion. |
Returns the mode of 'x' for entries with minimum 'precision_var' value. If no valid entries exist, returns an empty string.
x = c("apple", "apple", "banana", "banana") precision_var = c(1, 2, 1, 2) tie_break = c(1, 2, 1, 2) second_tie_break = c(1, 1, 2, 1) calc_min_precision(x, precision_var, tie_break, second_tie_break)x = c("apple", "apple", "banana", "banana") precision_var = c(1, 2, 1, 2) tie_break = c(1, 2, 1, 2) second_tie_break = c(1, 1, 2, 1) calc_min_precision(x, precision_var, tie_break, second_tie_break)
This function calculates the mode of a given vector and optionally resolves ties using one or two levels of tie-breaks.
calc_mode(x, tie_break = NULL, second_tie_break = NULL)calc_mode(x, tie_break = NULL, second_tie_break = NULL)
x |
A character vector for which to find the mode. |
tie_break |
An optional numeric vector used as the first tie-break criterion. |
second_tie_break |
An optional numeric vector used as the second tie-break criterion when the first is insufficient. |
Returns the mode of 'x'. If there are multiple modes and no tie-breaks are specified or they do not resolve the ties, returns "Indeterminate".
data <- c("apple", "apple", "banana", "banana") tie_break <- c(1, 2, 1, 2) second_tie_break <- c(1, 1, 2, 1) calc_mode(data) # Expect: "Indeterminate" calc_mode(data, tie_break) # Expect: "Indeterminate" calc_mode(data, tie_break, second_tie_break) # Expect: "banana"data <- c("apple", "apple", "banana", "banana") tie_break <- c(1, 2, 1, 2) second_tie_break <- c(1, 1, 2, 1) calc_mode(data) # Expect: "Indeterminate" calc_mode(data, tie_break) # Expect: "Indeterminate" calc_mode(data, tie_break, second_tie_break) # Expect: "banana"
Calculate mode of a binary numeric vector
calc_mode_binary(x)calc_mode_binary(x)
x |
A numeric vector consisting only of binary values (0 and 1). |
Returns a numeric vector representing the mode value. Returns 1 if there is a tie. Returns 'NA' if the vector is empty.
calc_mode_binary(c(0, 1, 1, 0, 1))calc_mode_binary(c(0, 1, 1, 0, 1))
Calculate mode of date vector
calc_mode_date(x)calc_mode_date(x)
x |
A character vector where each element is a date in "YYYY-MM-DD" format. |
Returns a date vector representing the modal date, or the mean of the modal dates if there is a tie.
calc_mode_date(c("2021-01-01", "2021-01-02", "2021-01-01"))calc_mode_date(c("2021-01-01", "2021-01-02", "2021-01-01"))
This function calculates the mode of a given vector, ignoring 'NA' and empty strings, and optionally resolves ties using one or two levels of tie-breaks. If all values are 'NA' or empty, the function returns 'NA'.
calc_mode_na_ignore(x, tie_break = NULL, second_tie_break = NULL)calc_mode_na_ignore(x, tie_break = NULL, second_tie_break = NULL)
x |
A character vector for which to find the mode. |
tie_break |
An optional numeric vector used as the first tie-break criterion. |
second_tie_break |
An optional numeric vector used as the second tie-break criterion when the first is insufficient. |
Returns the mode of 'x' ignoring 'NA' and empty strings. If the filtered vector is empty or all elements are 'NA' or empty, returns 'NA'.
data <- c("apple", "", "banana", NA) tie_break <- c(1, NA, 1, NA) second_tie_break <- c(1, NA, 2, NA) calc_mode_na_ignore(data) # Expect: "apple" calc_mode_na_ignore(data, tie_break) # Expect: "banana" calc_mode_na_ignore(data, tie_break, second_tie_break) # Expect: "banana"data <- c("apple", "", "banana", NA) tie_break <- c(1, NA, 1, NA) second_tie_break <- c(1, NA, 2, NA) calc_mode_na_ignore(data) # Expect: "apple" calc_mode_na_ignore(data, tie_break) # Expect: "banana" calc_mode_na_ignore(data, tie_break, second_tie_break) # Expect: "banana"
This function calculates the mode of a given numeric vector, and returns the smallest mode value if multiple modes exist.
calc_mode_numeric(x)calc_mode_numeric(x)
x |
A numeric vector. |
Returns a numeric vector representing the mode value. Returns the smallest mode value if multiple modes exist, and NA if the vector is empty or contains non-numeric elements.
calc_mode_numeric(c(1, 2, 2, 3, 4, 4))calc_mode_numeric(c(1, 2, 2, 3, 4, 4))
This function computes the mean number of unique values minus one for each specified variable within each group specified by the group_var. It is designed to provide insights into the variability of each variable while adjusting for the minimum possible unique count.
dscore(data, group_var, variables)dscore(data, group_var, variables)
data |
A dataframe containing the data to be analyzed. |
group_var |
A character string specifying the column name used for grouping the data. |
variables |
A character vector of column names in 'data' for which the mean number of unique values minus one is calculated. |
A tibble with each specified variable showing the mean of (unique values - 1) for each group. The data is grouped by the 'group_var' and returns the results in a wide format, where each variable is prefixed with "dscore_" to indicate the calculation.
df <- data.frame( group = c("A", "A", "B", "B", "B"), age = c(25, 25, 30, 35, 30), gender = c("Male", "Male", "Female", "Female", "Female"), income = c(50000, 50000, 60000, 65000, 60000) ) result <- dscore(df, "group", c("age", "gender", "income")) print(result)df <- data.frame( group = c("A", "A", "B", "B", "B"), age = c(25, 25, 30, 35, 30), gender = c("Male", "Male", "Female", "Female", "Female"), income = c(50000, 50000, 60000, 65000, 60000) ) result <- dscore(df, "group", c("age", "gender", "income")) print(result)
This function calculates the level of disagreement across event reports for each event and variable. For a given event and variable, it computes 1 minus the proportion of reports that agree with the modal value. A score of 0 indicates full agreement, while higher scores indicate greater disagreement.
event_level_disagreement(data, group_var, variables)event_level_disagreement(data, group_var, variables)
data |
A data frame containing event report level data. |
group_var |
A character string naming the column that uniquely identifies events (e.g., "event_id"). |
variables |
A character vector of column names to check for disagreement. |
The result is a wide-format tibble with one row per event and one column per variable.
A wide-format tibble where each row is an event and each column is a disagreement score for a variable.
df <- data.frame( event_id = c(1, 1, 2, 2, 3), actor1 = c("Actor A", "Actor B", "Actor B", "Actor B", "Actor C"), deaths_best = c(10, 10, 5, 15, 10) ) event_level_disagreement( df, group_var = "event_id", variables = c("actor1", "deaths_best") )df <- data.frame( event_id = c(1, 1, 2, 2, 3), actor1 = c("Actor A", "Actor B", "Actor B", "Actor B", "Actor C"), deaths_best = c(10, 10, 5, 15, 10) ) event_level_disagreement( df, group_var = "event_id", variables = c("actor1", "deaths_best") )
The Modes and Agents of Election-Related Violence in Côte d'Ivoire and Kenya (MAVERICK) is an event report level dataset of electoral violence incidents.
maverick_event_reportmaverick_event_report
A data frame with 3287 rows and 108 columns.
A unique event report identifier.
A unique event identifier assigned by the coders. Needed to aggregate event reports into events.
A character class variable that contains the name of the country in which the event took place.
A character class variable that contains the name of the election to which the event was most closely associated.
A numeric class variable that denotes the number of inclusion criteria that the event report fulfilled.
A integer class variable that denotes whether the reported event was inferred to be election-related because the event report or another event report explicitly identified the event as election-related.
A integer class variable that denotes whether the reported event was inferred to be election-related because at least one of the actors involved had explicit ties to a political party or was referred to by their party affiliation.
A integer class variable that denotes whether the reported event was inferred to be election-related because at least one of the targets was election-related, such as voters at a polling station, political candidates, election observers, security forces deployed to overlook the election, electoral material, or electoral infrastructure.
A integer class variable that denotes whether the reported event was inferred to be election-related because the reported purpose of the event was to influence an electoral process or outcome.
A integer class variable that denotes whether the reported event was inferred to be election-related because the event was part of an episode of electoral violence or occurred as a reaction to an earlier electoral violence event.
A integer class variable that denotes whether the the reported event was inferred to be election-related because it occurred at most 6 months prior to or after an election.
A character class variable that contains the earliest possible event date expressed in YYYY-MM-DD format.
A character class variable that contains the latest possible event date expressed in YYYY-MM-DD format.
A character class variable that contains the name of the city or village in which the event took place.
A character class variable that contains a text description of the most precise event location described in the report.
A numeric class variable that contains the latitude for the location indicated in location.
A numeric class variable that contains the longitude for the location indicated in location.
A numeric class variable that denotes how precisely the geo-coordinates are coded, ranging from the country level (1) to the exact street or building (6).
A character class variable that contains the name of the actor involved in the event.
A unique actor identifier assigned by the coders.
A character class variable that records the type of actor.
A character class variable that records the subtype of actor.
A character class variable that records the party affiliation of actor.
A character class variable that records all forms of violence used by the actor.
A numeric class variable that denotes how precisely the actor information is coded.
An integer class variable that denotes whether the actor was the initiator of the violence.
An integer class variable that denotes whether the actor was a perpetrator of the violence.
An integer class variable that denotes whether the actor was an intervener in the violence.
An integer class variable that denotes whether the actor was a passive bystander to the violence.
An integer class variable that denotes whether the actor was also a victim of the violence.
A character class variable that contains the name of the actor involved in the event.
A unique actor identifier assigned by the coders.
A character class variable that records the type of actor.
A character class variable that records the subtype of actor.
A character class variable that records the party affiliation of actor.
A character class variable that records all forms of violence used by the actor.
A numeric class variable that denotes how precisely the actor information is coded.
An integer class variable that denotes whether the actor was the initiator of the violence.
An integer class variable that denotes whether the actor was a perpetrator of the violence.
An integer class variable that denotes whether the actor was an intervener in the violence.
An integer class variable that denotes whether the actor was a passive bystander to the violence.
An integer class variable that denotes whether the actor was also a victim of the violence.
A character class variable that contains the name of the actor involved in the event.
A unique actor identifier assigned by the coders.
A character class variable that records the type of actor.
A character class variable that records the subtype of actor.
A character class variable that records the party affiliation of actor.
A character class variable that records all forms of violence used by the actor.
A numeric class variable that denotes how precisely the actor information is coded.
An integer class variable that denotes whether the actor was the initiator of the violence.
An integer class variable that denotes whether the actor was a perpetrator of the violence.
An integer class variable that denotes whether the actor was an intervener in the violence.
An integer class variable that denotes whether the actor was a passive bystander to the violence.
An integer class variable that denotes whether the actor was also a victim of the violence.
A character class variable that contains the name of the actor involved in the event.
A unique actor identifier assigned by the coders.
A character class variable that records the type of actor.
A character class variable that records the subtype of actor.
A character class variable that records the party affiliation of actor.
A character class variable that records all forms of violence used by the actor.
A numeric class variable that denotes how precisely the actor information is coded.
An integer class variable that denotes whether the actor was the initiator of the violence.
An integer class variable that denotes whether the actor was a perpetrator of the violence.
An integer class variable that denotes whether the actor was an intervener in the violence.
An integer class variable that denotes whether the actor was a passive bystander to the violence.
An integer class variable that denotes whether the actor was also a victim of the violence.
A character class variable that contains the name of the actor involved in the event.
A unique actor identifier assigned by the coders.
A character class variable that records the type of actor.
A character class variable that records the subtype of actor.
A character class variable that records the party affiliation of actor.
A character class variable that records all forms of violence used by the actor.
A numeric class variable that denotes how precisely the actor information is coded.
An integer class variable that denotes whether the actor was the initiator of the violence.
An integer class variable that denotes whether the actor was a perpetrator of the violence.
An integer class variable that denotes whether the actor was an intervener in the violence.
An integer class variable that denotes whether the actor was a passive bystander to the violence.
An integer class variable that denotes whether the actor was also a victim of the violence.
A character class variable that contains the name of the actor involved in the event.
A unique actor identifier assigned by the coders.
A character class variable that records the type of actor.
A character class variable that records the subtype of actor.
A character class variable that records the party affiliation of actor.
A character class variable that records all forms of violence used by the actor.
A numeric class variable that denotes how precisely the actor information is coded.
An integer class variable that denotes whether the actor was the initiator of the violence.
An integer class variable that denotes whether the actor was a perpetrator of the violence.
An integer class variable that denotes whether the actor was an intervener in the violence.
An integer class variable that denotes whether the actor was a passive bystander to the violence.
An integer class variable that denotes whether the actor was also a victim of the violence.
A character class variable that records the context in which the violence took place.
A character class variable that records the primary target of the violence
An integer class variable that records the best estimated number of deaths.
An integer class variable that records the lowest estimated number of deaths.
An integer class variable that records the highest estimated number of deaths.
An integer class variable that records the best estimated number of injured people.
An integer class variable that records the lowest estimated number of injured people.
An integer class variable that records the highest estimated number of injured people.
An integer class variable that denotes whether the event resulted in displacement.
An integer class variable that denotes whether the event resulted in material destruction.
A character class variable that records the source.
An integer class variable that records the number of sources the event is based on. Only relevant once the dataset is aggregated to the event level.
A character class variable that records the author of the source.
A character class variable that records the type of source.
An integer class variable that denotes how reputable the source is considered.
An integer class variable that denotes whether the report was sampled from Factiva or another secondary source.
A character class variable that records the unit of analysis.
A character class variable that records the chosen aggregation model. Only relevant once the data is aggregated to the event level.
The data set is based on newspaper articles identified through the Factiva news repository, as well as a range of human rights reports, election monitoring reports, and special commission reports.
This function calculates the mean divergence score for one or more variables grouped by an event identifier. The divergence score captures how often values for a given variable differ across event reports describing the same event.
mean_dscore(data, group_var, variables, normalize = FALSE, plot = FALSE)mean_dscore(data, group_var, variables, normalize = FALSE, plot = FALSE)
data |
A data frame containing event report level data. |
group_var |
A character string naming the column that uniquely identifies events (e.g., "event_id"). |
variables |
A character vector of column names to compute divergence scores for. |
normalize |
Logical, indicating whether to normalize the scores by the total number of unique values for each variable. |
plot |
Logical, indicating whether to return a ggplot object visualizing the scores. |
For each variable and event, the function computes the number of unique values reported, subtracts one, and averages these values across all events. This reflects how much inconsistency exists across sources. Optionally, the scores can be normalized by the total number of unique values observed for each variable across the dataset. The result is a long-format dataframe showing which variables are most sensitive to aggregation. A plotting option is also available.
Either a tibble or a ggplot object, depending on the value of plot.
If plot = FALSE, returns a tibble with two columns:
The name of each variable.
The mean divergence score or normalized score.
If plot = TRUE, returns a lollipop-style plot showing divergence scores by variable.
df <- data.frame( event_id = c(1, 1, 2, 2, 3), country = c("US", "US", "UK", "UK", "CA"), actor1 = c("Actor A", "Actor B", "Actor B", "Actor C", "Actor D"), deaths_best = c(10, 20, 5, 15, 10) ) mean_dscore(df, "event_id", c("country", "actor1", "deaths_best"), normalize = TRUE, plot = TRUE)df <- data.frame( event_id = c(1, 1, 2, 2, 3), country = c("US", "US", "UK", "UK", "CA"), actor1 = c("Actor A", "Actor B", "Actor B", "Actor C", "Actor D"), deaths_best = c(10, 20, 5, 15, 10) ) mean_dscore(df, "event_id", c("country", "actor1", "deaths_best"), normalize = TRUE, plot = TRUE)
This function calculates the mean range for one or more numeric variables grouped by an event identifier. It is useful for diagnosing aggregation sensitivity by assessing how much spread exists in numeric values reported across event reports concerning the same event.
mean_range(data, group_var, variables)mean_range(data, group_var, variables)
data |
A data frame containing event report level data. |
group_var |
A character string naming the column that uniquely identifies events (e.g., "event_id"). |
variables |
A character vector of column names to compute ranges for. All specified variables must be numeric. |
For each variable and event, the function computes the range (i.e., the difference between the maximum and minimum) of values reported across event reports. These values are then averaged across all events to produce a single score per variable. The result is a long-format dataframe that shows which numeric variables exhibit the widest event report level disagreement.
A tibble with two columns:
The name of each variable.
The mean range across events for that variable.
df <- data.frame( event_id = c(1, 1, 2, 2, 3), deaths_best = c(10, 20, 5, 15, 10) ) mean_range( df, group_var = "event_id", variables = c("deaths_best") )df <- data.frame( event_id = c(1, 1, 2, 2, 3), deaths_best = c(10, 20, 5, 15, 10) ) mean_range( df, group_var = "event_id", variables = c("deaths_best") )
This function calculates the mean standard deviation for one or more numeric variables grouped by an event identifier. It is useful for diagnosing aggregation sensitivity by assessing how much variation exists in numeric values reported across event reports concerning the same event.
mean_sd(data, group_var, variables)mean_sd(data, group_var, variables)
data |
A data frame containing event report level data. |
group_var |
A character string naming the column that uniquely identifies events (e.g., "event_id"). |
variables |
A character vector of column names to compute standard deviations for. All specified variables must be numeric. |
For each variable and event, the function computes the standard deviation of values reported across event reports These values are then averaged across all events to produce a single score per variable. The result is a long-format dataframe that shows which numeric variables exhibit the most event report level disagreement
A tibble with two columns:
The name of each variable.
The mean standard deviation across events for that variable.
df <- data.frame( event_id = c(1, 1, 2, 2, 3), country = c("US", "US", "UK", "UK", "CA"), actor1 = c("Actor A", "Actor B", "Actor B", "Actor C", "Actor D"), deaths_best = c(10, 20, 5, 15, 10) ) mean_sd( df, group_var = "event_id", variables = c("deaths_best") )df <- data.frame( event_id = c(1, 1, 2, 2, 3), country = c("US", "US", "UK", "UK", "CA"), actor1 = c("Actor A", "Actor B", "Actor B", "Actor C", "Actor D"), deaths_best = c(10, 20, 5, 15, 10) ) mean_sd( df, group_var = "event_id", variables = c("deaths_best") )
This function calculates the modal confidence score for one or more variables grouped by an event identifier. The modal confidence score captures how dominant the most common value is within each event — that is, the proportion of event reports that agree with the modal (most frequent) value for each variable.
modal_confidence(data, group_var, variables)modal_confidence(data, group_var, variables)
data |
A data frame containing event report level data. |
group_var |
A character string naming the column that uniquely identifies events (e.g., "event_id"). |
variables |
A character vector of column names to assess modal confidence for. |
For each variable and event, the function computes the share of event reports that match the modal value. These proportions are then averaged across all events to produce a single score per variable. The result is a long-format dataframe that shows which variables tend to exhibit the greatest agreement in reporting.
A tibble with two columns:
The name of each variable.
The average share of reports per event that match the modal value.
df <- data.frame( event_id = c(1, 1, 2, 2, 3), actor1 = c("A", "A", "B", "C", "D"), deaths_best = c(10, 10, 5, 15, 10) ) modal_confidence( df, group_var = "event_id", variables = c("actor1", "deaths_best") )df <- data.frame( event_id = c(1, 1, 2, 2, 3), actor1 = c("A", "A", "B", "C", "D"), deaths_best = c(10, 10, 5, 15, 10) ) modal_confidence( df, group_var = "event_id", variables = c("actor1", "deaths_best") )
This dataset contains 100 event reports from the MAVERICK event report dataset, arranged by 'event_id'. It is used for examples and vignettes in the 'eventreport' package.
small_maverick_event_reportsmall_maverick_event_report
A subset of the MAVERICK data frame with 100 rows and 10 columns:
A unique event report identifier.
A unique event identifier assigned by the coders. Needed to aggregate event reports into events.
A character class variable that contains the name of the country in which the event took place.
A character class variable that contains the earliest possible event date expressed in YYYY-MM-DD format.
A character class variable that contains the name of the city or village in which the event took place.
A character class variable that contains a text description of the most precise event location described in the report.
A character class variable that contains the name of the actor involved in the event.
An integer class variable that records the best estimated number of deaths.
An integer class variable that records the best estimated number of injured people.
A character class variable that records the source.
...
MAVERICK dataset