scans to map the location and to measure the volume of the 2, bottom panel; Evans et al, b). This data set illustrated anatomical uncertainty of the boundaries of a particular brain structure is . The top panel shows labelled volumes of PAC-r in the left and right hemispheres of three .. date, Rademacher et al. For this South Asian population, the overall upper and lower FDs were to date is the inability to accurately measure horizontal forniceal obliteration by. In this study we use MRI to measure the anatomical and functional V1 Finally, these results provide the most compelling evidence to date that the . that the atlas prediction of V1 boundary location has low bias, with a slight hemisphere is shown with superior to the top and posterior to the left for left.
These were later renamed to common cause and special cause variation. Common cause variation is present in any process, is caused by phenomena that are always present within the system, makes the process predictable within limitsis also called random variation or noise. Special cause variation is present in some processes, is caused by phenomena that are not normally present in the system, makes the process unpredictable, is also called non-random variation or signal.
Figure 2 is an example of special cause variation. One data point, no. The presence of special cause variation makes the process unpredictable. I chart, special cause variation It is important to note that neither common nor special cause variation is in itself good or bad.
A stable process may function at an unsatisfactory level, and an unstable process may be moving in the right direction. But the end goal of improvement is always a stable process functioning at a satisfactory level. The standard deviation is the estimated standard deviation of the common cause variation in the process of interest, which depends on the theoretical distribution of data. Since the calculations of control limits depend on the type of data many types of control charts have been developed for specific purposes.
The qicharts package employs a handful of the classic Shewhart charts for measure and count data plus a couple of rare events charts. Together these charts cover the majority of control chart needs of healthcare quality improvement and control.
The formulas for calculation of control limits can be found in Montgomery and Provost C chart for count of defects To demonstrate the use of C, U and P charts for count data we will create a data frame mimicking the weekly number of hospital acquired pressure ulcers at a hospital that, on average, has patients with an average length of stay of four days. Date ''length. There is a subtle but important distinction between counting defects, e. Defects are expected to reflect the poisson distributionwhile defectives reflect the binomial distribution.
The correct control chart on the number of pressure ulcers is the C chart, which is based on the poisson distribution. C chart displaying the number of defects Figure 3 shows that the average weekly number of hospital acquired pressure ulcers is 66 and that anything between 41 and 90 would be within the expected range.
Therefore, if someone is weighed 10 times in succession on the same scale, you may observe slight differences in the number returned to you: If the scale is accurate and the only error is random, the average error over many trials will be 0, and the average observed weight will be pounds.
You can strive to reduce the amount of random error by using more accurate instruments, training your technicians to use them correctly, and so on, but you cannot expect to eliminate random error entirely. Two other conditions are assumed to apply to random error: The first condition means that the value of the error component of any measurement is not related to the value of the true score for that measurement. The second condition means that the error component of each score is independent and unrelated to the error component for any other score.
For instance, in a series of measurements, a pattern of the size of the error component should not be increasing over time so that later measurements have larger errors, or errors in a consistent direction, relative to earlier measurements. In contrast, systematic error has an observable pattern, is not due to chance, and often has a cause or causes that can be identified and remedied.
For instance, a scale might be incorrectly calibrated to show a result that is 5 pounds over the true weight, so the average of multiple measurements of a person whose true weight is pounds would be pounds, not Systematic error can also be due to human factors: If a pattern is detected with systematic error, for instance, measurements drifting higher over time so the error components are random at the beginning of the experiment, but later on are consistently highthis is useful information because we can intervene and recalibrate the scale.
A great deal of effort has been expended to identify sources of systematic error and devise methods to identify and eliminate them: Reliability and Validity There are many ways to assign numbers or categories to data, and not all are equally useful.
Two standards we commonly use to evaluate methods of measurement for instance, a survey or a test are reliability and validity. Ideally, we would like every method we use to be both reliable and valid.
Control Charts with qicharts for R
In reality, these qualities are not absolutes but are matters of degree and often specific to circumstance. For instance, a survey that is highly reliable when used with demographic groups might be unreliable when used with a different group.
For this reason, rather than discussing reliability and validity as absolutes, it is often more useful to evaluate how valid and reliable a method of measurement is for a particular purpose and whether particular levels of reliability and validity are acceptable in a specific context. Reliability Reliability refers to how consistent or repeatable measurements are.
For instance, if we give the same person the same test on two occasions, will the scores be similar on both occasions? If we train three people to use a rating scale designed to measure the quality of social interaction among individuals, then show each of them the same film of a group of people interacting and ask them to evaluate the social interaction exhibited, will their ratings be similar?
If we have a technician weigh the same part 10 times using the same instrument, will the measurements be similar each time? In each case, if the answer is yes, we can say the test, scale, or rater is reliable. Much of the theory of reliability was developed in the field of educational psychology, and for this reason, measures of reliability are often described in terms of evaluating the reliability of tests.
However, considerations of reliability are not limited to educational testing; the same concepts apply to many other types of measurements, including polling, surveys, and behavioral ratings.
Respiratory tract infection
The discussion in this chapter will remain at a basic level. There are three primary approaches to measuring reliability, each useful in particular contexts and each having particular advantages and disadvantages: Multiple-occasions reliability Multiple-forms reliability Internal consistency reliability Multiple-occasions reliability, sometimes called test-retest reliability, refers to how similarly a test or scale performs over repeated administration.
For this reason, it is sometimes referred to as an index of temporal stability, meaning stability over time. For instance, you might have the same person do two psychological assessments of a patient based on a videotaped interview, with the assessments performed two weeks apart, and compare the results. For this type of reliability to make sense, you must assume that the quantity being measured has not changed, hence the use of the same videotaped interview rather than separate live interviews with a patient whose psychological state might have changed over the two-week period.
A common technique for assessing multiple-occasions reliability is to compute the correlation coefficient between the scores from each occasion of testing; this is called the coefficient of stability. Multiple-forms reliability also called parallel-forms reliability refers to how similarly different versions of a test or questionnaire perform in measuring the same entity.
A common type of multiple-forms reliability is split-half reliability in which a pool of items believed to be homogeneous is created, then half the items are allocated to form A and half to form B. If the two or more forms of the test are administered to the same people on the same occasion, the correlation between the scores received on each form is an estimate of multiple-forms reliability. This correlation is sometimes called the coefficient of equivalence. Multiple-forms reliability is particularly important for standardized tests that exist in multiple versions.
For instance, different forms of the SAT Scholastic Aptitude Test, used to measure academic ability among students applying to American colleges and universities are calibrated so the scores achieved are equivalent no matter which form a particular student takes. Internal consistency reliability refers to how well the items that make up an instrument for instance, a test or survey reflect the same construct. To put it another way, internal consistency reliability measures how much the items on an instrument are measuring the same thing.
Unlike multiple-forms and multiple-occasions reliability, internal consistency reliability can be assessed by administering a single instrument on a single occasion. However, all these techniques depend primarily on the inter-item correlation, that is, the correlation of each item on a scale or a test with each other item. If such correlations are high, that is interpreted as evidence that the items are measuring the same thing, and the various statistics used to measure internal consistency reliability will all be high.
If the inter-item correlations are low or inconsistent, the internal consistency reliability statistics will be lower, and this is interpreted as evidence that the items are not measuring the same thing. Two simple measures of internal consistency are most useful for tests made up of multiple items covering the same topic, of similar difficulty, and that will be scored as a composite: To calculate the average inter-item correlation, you find the correlation between each pair of items and take the average of all these correlations.
To calculate the average item-total correlation, you create a total score by adding up scores on each individual item on the scale and then compute the correlation of each item with the total. The average item-total correlation is the average of those individual item-total correlations. Split-half reliability, described previously, is another method of determining internal consistency.
This method has the disadvantage that, if the items are not truly homogeneous, different splits will create forms of disparate difficulty, and the reliability coefficient will be different for each pair of forms. Validity Validity refers to how well a test or rating scale measures what it is supposed to measure. Some researchers describe validation as the process of gathering evidence to support the types of inferences intended to be drawn from the measurements in question.
Researchers disagree about how many types of validity there are, and scholarly consensus has varied over the years as different types of validity are subsumed under a single heading one year and then separated and treated as distinct the next. To keep things simple, this book will adhere to a commonly accepted categorization of validity that recognizes four types: The face validity, which is closely related to content validity, will also be discussed. Content validity refers to how well the process of measurement reflects the important content of the domain of interest and is of particular concern when the purpose of the measurement is to draw inferences about a larger domain of interest.
For instance, potential employees seeking jobs as computer programmers might be asked to complete an examination that requires them to write or interpret programs in the languages they would use on the job if hired. Due to time restrictions, only limited content and programming competencies may be included on such an examination, relative to what might actually be required for a professional programming job.
If this is the case, we may say the examination has content validity. A closely related concept to content validity is known as face validity. A measure with good face validity appears to a member of the general public or a typical person who may be evaluated by the measure to be a fair assessment of the qualities under study. For instance, if a high school geometry test is judged by parents of the students taking the test to be a fair test of algebra, the test has good face validity.
In addition, if students are told they are taking a geometry test that appears to them to be something else entirely, they might not be motivated to cooperate and put forth their best efforts, so their answers might not be a true reflection of their abilities. Concurrent validity refers to how well inferences drawn from a measurement can be used to predict some other behavior or performance that is measured at approximately the same time.
For instance, if an achievement test score is highly related to contemporaneous school performance or to scores on similar tests, it has high concurrent validity. Predictive validity is similar but concerns the ability to draw inferences about some event in the future.
To continue with the previous example, if the score on an achievement test is highly related to school performance the following year or to success on a job undertaken in the future, it has high predictive validity.
Triangulation Because every system of measurement has its flaws, researchers often use several approaches to measure the same thing. Measurements used for this purpose can include scores on standardized exams such as the SAT, high school grades, a personal statement or essay, and recommendations from teachers. This process of combining information from multiple sources to arrive at a true or at least more accurate value is called triangulation, a loose analogy to the process in geometry of determining the location of a point in terms of its relationship to two other known points.
The key idea behind triangulation is that, although a single measurement of a concept might contain too much error of either known or unknown types to be either reliable or valid by itself, by combining information from several types of measurements, at least some of whose characteristics are already known, we can arrive at an acceptable measurement of the unknown quantity.
We expect that each measurement contains error, but we hope it does not include the same type of error, so that through multiple types of measurement, we can get a reasonable estimate of the quantity or quality of interest.
Establishing a method for triangulation is not a simple matter. Their particular concern was to separate the part of a measurement due to the quality of interest from that part due to the method of measurement used.
Although their specific methodology is used less today and full discussion of the MTMM technique is beyond the scope of a beginning text, the concept remains useful as an example of one way to think about measurement error and validity.
1. Basic Concepts of Measurement - Statistics in a Nutshell, 2nd Edition [Book]
The MTMM is a matrix of correlations among measures of several concepts the traitseach measured in several ways the methods. Ideally, the same several methods will be used for each trait. Within this matrix, we expect different measures of the same trait to be highly related; for instance, scores of intelligence measured by several methods, such as a pencil-and-paper test, practical problem solving, and a structured interview, should all be highly correlated.
By the same logic, scores reflecting different constructs that are measured in the same way should not be highly related; for instance, scores on intelligence, deportment, and sociability as measured by pencil-and-paper questionnaires should not be highly correlated. Measurement Bias Consideration of measurement bias is important in almost every field, but it is a particular concern in the human sciences. Many specific types of bias have been identified and defined.
Most research design textbooks treat measurement bias in great detail and can be consulted for further discussion of this topic. The most important point is that the researcher must always be alert to the possibility of bias because failure to consider and deal with issues related to bias can invalidate the results of an otherwise exemplary study. Bias can enter studies in two primary ways: In either case, the defining feature of bias is that it is a source of systematic rather than random error.
The result of bias is that the data analyzed in a study is incorrect in a systematic fashion, which can lead to false conclusions despite the application of correct statistical procedures and techniques. The next two sections discuss some of the more common types of bias, organized into two major categories: Bias in Sample Selection and Retention Most studies take place on samples of subjects, whether patients with leukemia or widgets produced by a factory, because it would be prohibitively expensive if not entirely impossible to study the entire population of interest.
The sample needs to be a good representation of the study population the population to which the results are meant to apply for the researcher to be comfortable using the results from the sample to describe the population.
If the sample is biased, meaning it is not representative of the study population, conclusions drawn from the study sample might not apply to the study population.
Selection bias exists if some potential subjects are more likely than others to be selected for the study sample. This term is usually reserved for bias that occurs due to the process of sampling. For instance, telephone surveys conducted using numbers from published directories by design remove from the pool of potential respondents people with unpublished numbers or those who have changed phone numbers since the directory was published. Random-digit-dialing RDD techniques overcome these problems but still fail to include people living in households without telephones or who have only a cell mobile phone.
This is a problem for a research study because if the people excluded differ systematically on a characteristic of interest and this is a very common occurrencethe results of the survey will be biased. For instance, people living in households with no telephone service tend to be poorer than those who have a telephone, and people who have only a cell phone i.
If poverty or youth are related to the subject being studied, excluding these individuals from the sample will introduce bias into the study. Volunteer bias refers to the fact that people who volunteer to be in studies are usually not representative of the population as a whole. For this reason, results from entirely volunteer samples, such as the phone-in polls featured on some television programs, are not useful for scientific purposes unless, of course, the population of interest is people who volunteer to participate in such polls.
Multiple layers of nonrandom selection might be at work in this example. For instance, to respond, the person needs to be watching the television program in question. This means she is probably at home; hence, responses to polls conducted during the normal workday might draw an audience largely of retired people, housewives, and the unemployed. To respond, a person also needs to have ready access to a telephone and to have whatever personality traits would influence him to pick up the telephone and call a number he sees on the television screen.
The problems with telephone polls have already been discussed, and the probability that personality traits are related to other qualities being studied is too high to ignore. Nonresponse bias refers to the other side of volunteer bias. Just as people who volunteer to take part in a study are likely to differ systematically from those who do not, so people who decline to participate in a study when invited to do so very likely differ from those who consent to participate. You probably know people who refuse to participate in any type of telephone survey.
Do they seem to be a random selection from the general population? Survey of Health found not only different response rates for Canadians versus Americans but found nonresponse bias for nearly all major health status and health care access measures [results are summarized here ]. Informative censoring can create bias in any longitudinal study a study in which subjects are followed over a period of time. Suppose we are comparing two medical treatments for a chronic disease by conducting a clinical trial in which subjects are randomly assigned to one of several treatment groups and followed for five years to see how their disease progresses.
Thanks to our use of a randomized design, we begin with a perfectly balanced pool of subjects. However, over time, subjects for whom the assigned treatment is not proving effective will be more likely to drop out of the study, possibly to seek treatment elsewhere, leading to bias. If the final sample of subjects we analyze consists only of those who remain in the trial until its conclusion, and if those who drop out of the study are not a random selection of those who began it, the sample we analyze will no longer be the nicely randomized sample we began with.
Instead, if dropping out was related to treatment ineffectiveness, the final subject pool will be biased in favor of those who responded effectively to their assigned treatment. Information Bias Even if the perfect sample is selected and retained, bias can enter a study through the methods used to collect and record data. This type of bias is often called information bias because it affects the validity of the information upon which the study is based, which can in turn invalidate the results of the study.
When data is collected using in-person or telephone interviews, a social relationship exists between the interviewer and the subject for the course of the interview. This relationship can adversely affect the quality of the data collected. When bias is introduced into the data collected because of the attitudes or behavior of the interviewer, this is known as interviewer bias.
This type of bias might be created unintentionally when the interviewer knows the purpose of the study or the status of the individuals being interviewed. For instance, interviewers might ask more probing questions to encourage the subject to recall chemical exposures if they know the subject is suffering from a rare type of cancer related to chemical exposure.
Interviewer bias might also be created if the interviewer displays personal attitudes or opinions that signal to the subject that she disapproves of the behaviors being studied, such as promiscuity or drug use, making the subject less likely to report those behaviors.
Recall bias refers to the fact that people with a life experience such as suffering from a serious disease or injury are more likely to remember events that they believe are related to that experience. For instance, women who suffered a miscarriage are likely to have spent a great deal of time probing their memories for exposures or incidents that they believe could have caused the miscarriage.
Women who had a normal birth may have had similar exposures but have not given them as much thought and thus will not recall them when asked on a survey. Detection bias refers to the fact that certain characteristics may be more likely to be detected or reported in some people than in others. For instance, athletes in some sports are subject to regular testing for performance-enhancing drugs, and test results are publicly reported. World-class swimmers are regularly tested for anabolic steroids, for instance, and positive tests are officially recorded and often released to the news media as well.
Athletes competing at a lower level or in other sports may be using the same drugs but because they are not tested as regularly, or because the test results are not publicly reported, there is no record of their drug use. It would be incorrect to assume, for instance, that because reported anabolic steroid use is higher in swimming than in baseball, the actual rate of steroid use is higher in swimming than in baseball.
The observed difference in steroid use could be due to more aggressive testing on the part of swimming officials and more public disclosure of the test results. This often motivates them to give responses that they believe will please the person asking the question.
Note that this type of bias can operate even if the questioner is not actually present, for instance when subjects complete a pencil-and-paper survey. Social desirability bias is a particular problem in surveys that ask about behaviors or attitudes that are subject to societal disapproval, such as criminal behavior, or that are considered embarrassing, such as incontinence.
Problem What potential types of bias should you be aware of in each of the following scenarios, and what is the likely effect on the results? A program intended to improve scholastic achievement in high school students reports success because the 40 students who completed the year-long program of the who began it all showed significant improvement in their grades and scores on standardized tests of achievement.
A manager is concerned about the health of his employees, so he institutes a series of lunchtime lectures on topics such as healthy eating, the importance of exercise, and the deleterious health effects of smoking and drinking. He conducts an anonymous survey using a paper-and-pencil questionnaire of employees before and after the lecture series and finds that the series has been effective in increasing healthy behaviors and decreasing unhealthy behaviors.
Solution Selection bias and nonresponse bias, both of which affect the quality of the sample analyzed. The reported average annual salary is probably an overestimate of the true value because subscribers to the alumni magazine were probably among the more successful graduates, and people who felt embarrassed about their low salary were less likely to respond.