What is small number (cell size) issue in public health data analyses and dissemination?
Public health data when queried or displayed in web-based data tables can often have cells with a small number of individuals or events especially when the query is focused on small geographic areas (Zip codes), rare events, population subgroups, provider groups, payers, or other small samples. There are two primary concerns from queries in which the results contain small cell sizes or a small underlying population. First, rates based on small numbers may not be reliable. Second, small numbers are of great concern when reporting sensitive information that might lead to violation of individuals’ right to anonymity and privacy with respect to attributes that are typically stigmatized. Both small numbers of numerator events as well as denominator events can contribute to poor reliability. The likelihood of disclosure of personal health information is higher when there are relatively few people with knowable demographic characteristics such as sex, age, and race in a small community.
Are small numbers a bigger concern for web-based data dissemination?
While small cell size is a concern for most public health statistical publications, it is more acutely so in web-based data dissemination systems for several reasons. First, web-based data dissemination systems are particularly desirable for immediate answers to questions about the public’s health, and generally, the users of the systems are interested in data for small geographical areas and other small groups of individuals. Second, the information reaches a much broader audience than a paper publication, and often this includes individuals without statistical or epidemiologic training. Third, web-based systems generally provide less documentation on how to interpret the results than do paper publications which usually provide extensive bibliographies, appendices, footnotes, caveats, and so forth, and web-based systems are often likely to provide less basic information on certain conditions.
How can the disclosure risk be reduced when disseminating data on the Internet?
There are a variety of tools available for web-based data dissemination which can reduce disclosure risk. These tools vary from user-directed education and agreements to tools that alter the access to data on the system.
Data protection agreements: Public health agencies have been using data release agreements for years; these agreements may be quite restrictive. Use of a DPA establishes institutional control, is compliant with HIPAA Privacy provisions for a limited data set. In closed web-based systems, data protection agreements are generally required prior to password assignment and log-in.
Limited data sets: Most web-based data systems use reduction in the number and type of data elements. It is one of the methods suggested in the August 14, 2002, Federal Register Notice Updating 45 CFR Parts 160 and 164 of the HIPAA Privacy Rule for the release of health data by covered entities.
On-Line query systems limits: Web systems use data modification and alteration methods and rely on limited datasets to ensure protection of native files.
User authentication and access validation: It is possible to implement password protection to CD-ROM and public use files and for access to web query systems. Other less restrictive alternatives include simply requiring registration of the user for each use.
Education and training of public use file users: Some web-based systems are complex enough to recommend that there is appropriate training of users. This will limit the number of users of the system—unless the training mechanism is also a web-based system.
Pre-constructed tables and Pivot Tables: Some query systems are constructed to produce only those tables that have been pre-designed by the data agency. Others allow the user to implement the pivot functionality.
In addition to the institutional controls, what are some of the data modification techniques to reduce the disclosure risk?
Several statistical data modification techniques can be used to protect public health data. They include: (a) anonymizing/de-identifying data files, (b) cross-tabulations and micro-aggregation, (c) restricting geographic detail, (d) limiting the number of data elements in a micro file, (e) recoding into intervals and rounding, and (f) cell suppression methods
Why are rates based on small numbers considered questionable in terms of their reliability?
Rates based on small numbers are characterized by a lack of reliability. Reliability generally denotes the ability of an instrument (e.g. a question or an index) or a research technique to yield the same results when applied repeatedly. In this context, reliability refers to the stability of results (numbers, rates etc.) when based on small numbers so that they represent the true underlying results. Reliability of public health data containing a small number of events is a concern because the rates based on such numbers may not be representative of the true (underlying) rates of those events, owing primarily to the random variation. Sampling error is the primary source of such variation. Rates based on the entire population for a certain time period (e.g. year) are also subject to random variation because they are also based on a sample (of historical time).
While reliability alone does not ensure precision of any parameter estimate, it is a necessary condition. For instance, an instrument may be reliable-- i.e., it may generate the same results on repeated use-- the estimates may not be precise if the instrument is not valid – i.e., it does not measure what it aims to measure. The following sections outline various approaches to address the reliability problem in micro data in public health.
How can reliability of rates based on small numbers be improved?
Commonly used approaches for improving the reliability of rates based on small numbers include aggregation, and smoothing of rates. Aggregation is one of the simplest and most intuitive approaches in which attribute categories are combined to yield large enough cell count. For instance, if age is reported in five year groups and the number of cases for the event of interest is smaller than a minimum required to be reliable, then the age categories are combined in order to have an adequate number of events to yield more reliable and stable rates. The nature of aggregation including the requirement for a minimum number of cases depends upon a host of factors, including the type of test to be used for reliability, the desired level of significance (allowed margin of error due to chance alone), and the type of statistical measure computed, be it confidence interval or coefficient of variation. Data smoothing is the other widely used technique for addressing data problems. Various techniques are used for data smoothing, including maximum likelihood, weighted averages, moments methods and geographic smoothing. Geographic smoothing may rely on Bayesian or empirical Bayesian approaches. Recently, HCUPnet has utilized geographic smoothing in their risk adjustment methodology for reporting hospital indicators, partially to address the small cell size problem. Other systems using geographic smoothing include Utah Department of Health’s IBIS Query System, Washington State Department of Health’s EpiQMS: Epidemiologic Query and Mapping System, and Washington’s VistaPHw system of GIS and spatial epidemiology for community health assessment.
What if any are the standard rules for suppression of data reported to public?
The definition of “small” varies across political boundaries, the databases and states. The application often influences how the term “small” is defined, generally to protect confidentiality (as opposed to statistical reliability. Data suppression threshold for small numbers can be in terms of numerator events, denominator events, or a combination thereof.
The “Numerator Rule” is designed to prevent the release of information by suppressing a cell with fewer than “X” (e.g. 5) events, and complimentary cells.
The “Denominator Rule” is designed to prevent the display of information when the population under consideration is less than a certain size. Overtime a number of state agencies have used as a minimum population 30 cases/events. When the denominator is less than 30, the cell is suppressed.
“Numerator and Event Denominator Rule”, also referred to as the Garland Land Rule; only the margins of a table are displayed if any table cell subtracted from the number of total events in the same data file for the same characteristics yields a small number (e.g. less than 10). For example, a cell with one Black female aged 25-34 AIDS death would be published if there were 15 Black female aged 25-34 total deaths. The assumption is that it may be possible to identify the diagnosis of a person if there are fewer than 10 people with the same demographic characteristics and who had the same event (death, in this case, or perhaps birth or hospitalization). In addition, if less than two row or column totals are less than five then all the row or column totals are suppressed. This additional rule prohibits one from determining the identity of an individual when the margin totals are small.
“Numerator/Denominator-Based Suppression” Cell sizes based on a combination of denominator (population from which the health events arise) and numerator (health event) are suppressed. This type of strategy is identified in the Massachusetts Department of Public Health, Confidentiality Policy and Procedures document.
Aggregate data with denominator and numerator values greater than those indicated in the table may be considered sufficiently de-identified so as not to constitute confidential information, and may be disclosed.
|
Denominator |
Numerator |
Standard |
|
10-29 |
1-4 |
Suppress numerator and any other cells6 that would allow for the calculation of any other cells with values of 1-4 |
|
10-29 |
5-29 |
Suppress any cells that would allow for the calculation of any other cells6 with values of 1-4 |
|
0-8 |
0-9 |
Suppress numerator |
|
=N |
=D |
Suppress numerator unless privacy risk is minimal |
In many public health agencies, suppression standards are based on the specific database, mandates via funding organizations, history of the data release, preferences of specific data stewards. This can cause problems when databases are merged to answer specific questions, e.g., if a cancer registry has a “denominator” rule of 1000 and a hospital discharge system has a “numerator” suppression rule of <5 in a cell, both databases could “charge inappropriate release of information” when a merged file is created for web-based data dissemination. The solution is as stated above, prior approval by a Privacy Officer who can mediate the two alternative rules.
What is test of significance? Is it desirable in public health?
Statistical significance testing refers to theory and methods for determining the probability that a result from analysis of data based on a sample (as opposed to complete population count) could be due to chance variation in possible samples, as opposed to capturing the true underlying results. It also refers to statistical tests to determine whether the observed difference between sample statistics (e.g., rates, proportions, means etc.) could occur by chance in the populations from which the samples were selected.
Choice of level of significance denoted by α may vary based on the purpose of the reporting and whose interest is at stake (e.g., do we care more for patients safety at the risk of closing hospitals when in doubt or do we want to give benefit of the doubt to hospital, at the cost of patients’ complications/death). Level of significance, also known as probability Type I error, is the probability of incorrectly rejecting the null hypothesis, ie, inferring that results (statistics such as rates, ratios etc.) for two groups differ when actually two groups (e.g. hospital A and Hospital B) do not differ. It is a norm to use 95% confidence interval (α = 0.05), meaning that the margin of error is set so that 5 out of 100 samples may lead to conclusions based on chance alone. Use of 90%, 85%, or 99% confidence intervals, may also be considered. Various tests of significance have certain assumptions associated with them. For instance many of the tests require certain level of measurement (such as nominal ordinal, interval or ratio) or distribution of the variables (e.g. Normal or Poisson etc.). One should carefully examine if the data meets their assumptions before tests of significance are applied.
Some tests also assume that a minimum number of observations are present in the data. For instance, application of a z test may not be appropriate if the outcome does not follow normal distribution in case the number of observations in the data (or group) is less than 30 (or 20). In such situation, tests appropriate for the particular distribution (e.g., Poisson, Students t-distribution) should be applied.
Use of tests of significance in public health reporting has come under some criticism, but its use continues. Some critics suggest alternatives to the significance tests, including use of the Effect size and Statistical Power.
What are confidence intervals?
A confidence interval is a range of values (called lower limit and upper limit) around a statistic (such as mean, rate, ratios, frequency etc. based on a sample), the width of which indicates the degree of certainty that the statistic correctly estimates the true value of the parameter (such as true underlying mean, rate, ratios, frequency etc.) for a given margin of error. It can also be used to assess if values differ due to chance alone or are truly different.43 The most commonly used margin of error or level of significance is 5% or 0.05 which yields a 95% confidence interval. This gives a range of values so that 95% of the sample statistics falling within this range will accurately estimate the true underlying population parameter (rates, etc.). Simply speaking, one is 95% confident that the true parameter (rate, ratios, percentages, frequencies, etc.) is contained by this interval estimate called confidence interval.
Estimates based on small numbers tend to produce wider confidence intervals due to a greater chance-error, indicating lower reliability and precision of the estimates.44 The wider the confidence interval and the smaller the sample, the less precision there is in the estimate. Narrow confidence intervals suggest that the estimate is nearly precise, especially with large sample sizes, and that chance plays a smaller role in the outcome of interest.
How are confidence intervals used in public health data reporting?
Confidence intervals (or confidence limits) are widely used in public health to report the precision of the (point) estimates of the parameters such as rates, ratios and frequencies. When should one use confidence intervals? Confidence intervals could be used whenever there is a need to understand the uncertainty in a point estimate. That need often arises due to small cell sizes. Use of confidence intervals around health statistics can also help reduce the misinterpretation of random variation when cells are small. The State of Washington Department of Health has excellent documentation of the methods by which they produce confidence intervals for their web-based data systems on their website.
The confidence interval can also be used to test the significance of difference between the two statistics; they are significantly different if the two confidence intervals do not overlap or the parameter (or stable rate; e.g. national level infant mortality rates) does not fall within the confidence interval of statistic (e.g. confidence interval around the infant mortality rates in a small rural county). While confidence intervals offer a good indicator of statistical power, they should generally not be used to draw comparisons across cells because you cannot necessarily interpret the certainty of the statistical significance.45 Most of the events of interest in public health follow a Poisson distribution, but when the cell size is reasonably large (say above 20 or 30), a normal approximation of the Poisson distribution can be used.
What does the term “small area analysis” signify? How is small area analysis useful?
The term ‘small area’ is used to imply areas that are large enough to have a sufficient number of events of interest to yield stable rates, yet they are small enough to unmask variations in the rates and still convey a sense of community.
Public health policy has increasingly emphasized local, or community health assessment and planning. These efforts are often hampered by a dearth of relevant and meaningful information about the current health status and needs of local populations. Understanding community health status at the small area level can help policy makers improve community public health planning. Several functions of small area analysis render analyses at this level useful at various levels.
Small area analysis has emerged as a useful tool in health services research over the last two or three decades, however, the history of its use is more extensive. It is a useful tool to describe how rates of health care use and events vary over meaningfully defined geographic areas. The tool has been used to investigate variation in the rates of hospitalization for a large array of diseases and surgical procedures including: chronic obstructive lung disease, pneumonia, hypertension, and in surgical procedures, such as hysterectomy, cholecystectomy, and tonsillectomy.
When analysing and presenting hospital discharge data - is it standard to age adjust these data when doing comparisons between years and/or places? If so, what population standard is usually used? Is it the 2000 standard US population?
Whenever the purpose is comparisons across groups (e.g. years, places, providers etc.), it is not appropriate to make comparisons unless relevant adjustment by patient characteristics (age, gender, or severity of illness etc.) is applied. That is more like a standard in epidemiology, but is equally applicable to hospital discharge data. The standard population recommended by CDC/NCHS is U.S. 2000, and that is what is used.