Conference PaperPDF Available

Consideration of Reliability and Validity Concerns within Assessments of Small and Medium Size Manufacturers

Authors:

Figures

Content may be subject to copyright.
Proceedings of the 2010 Industrial Engineering Research Conference
A. Johnson and J. Miller eds.
Consideration of Reliability and Validity Concerns within
Assessments of Small and Medium Size Manufacturers
Clayton T. Walden
Center for Advanced Vehicular Systems Extension
Mississippi State University, Canton, Mississippi 39046, USA
Allen G. Greenwood
Department of Industrial and Systems Engineering
Mississippi State University, Mississippi State, Mississippi 39762, USA
Abstract
Practitioners have relied for years on the use of enterprise-wide manufacturing assessments (e.g., Shingo Prize and
Malcolm Baldridge National Quality Award) to drive enterprise improvement. While these and other assessment
approaches have been enormously helpful to practitioners, the efficacy of these types of assessments is highly
dependent on the experience, expertise, and skill of the assessor. Therefore, it is important from a research
perspective, to define concerns about reliability and validity and to incorporate these concerns into the assessment
process. This paper defines reliability and validity within the context of the Taxonomy Based Assessment
Methodology (TBAM); an emerging assessment methodology that focuses on the needs of small and medium size
manufacturing enterprises (SMEs). A methodology is developed which yields a preliminary set of reliability and
validity measures, based on an examination process that utilizes a Case Study-Review Panel approach. Results from
this approach are presented and initial inferences are made regarding the reliability and validity measures based on
TBAM case studies.
Keywords
Assessment, validity, reliability, manufacturing
1. Introduction
The National Research Council has found that one of the major barriers impacting small and medium size
manufacturing enterprises (SME) is the “lack of access to high quality, unbiased advice and assistance.”
Interestingly, we have seen within the last several years increased use of assessment methodologies for use within
manufacturing enterprises. Many have successfully used these assessment tools, such as Malcolm Baldridge
National Quality Award (MBNQA) and the Shingo Prize, to drive enterprise wide improvement. However, few
publications deal with assessments methodologies as a subject of interest for researchers. Thus concerns regarding
validity and reliability within the domain of manufacturing assessments should be addressed. This paper develops
these concerns within the context of the Taxonomy Based Assessment Methodology (TBAM): an emerging
assessment methodology that focuses on the needs of SMEs [1].
This research argues since skilled assessors play a critical role in the assessments process, research into various
assessment methodologies should be concerned with issues of reliability and validity. In research involving human
subjects, the problems of reliability and validity are of primary importance to research design. Generally, reliability
deals with the degree of consistency in measurements produced by multiple observers [2]. In other words, reliability
is not so much concerned with whether or not the “right thing” is being measured, its primary interest lies in the
repeatability of these measurements. On the other hand, validity is concerned with the extent to which measurements
reflect the phenomena of interest. Thus, validity is concerned with whether or not we are measuring the right thing
[2]. These concerns are inherent within any research design, the best researchers can do is to acknowledge these
concerns and attempt to mitigate their impact.
Walden and Greenwood
The practice of assessments generally falls into one of two categories: evaluation driven and prescription driven
approaches. In order to develop a more holistic assessment, the TBAM methodology explicitly introduces
diagnosisto logically link the evaluation with prescription aspects. Therefore, TBAM contains a three stage
approach defined as follows.
Evaluation The identification of where a firm and its practices fit within an externally defined standard
or taxonomy.
Diagnosis - The determination of root cause(s) through mapping of cause and effect relationships so that
key barriers to increased performance are identified.
Prescription The identification of specific recommendations which if implemented target improved
enterprise performance.
The objective of TBAM is to enable a qualified assessor to rapidly conduct a thorough assessment of an SME that
targets the explicit development of recommendations. The objective is to complete the entire assessment cycle
within a timely manner, generally targeted at one week. Supporting this three stage Evaluation-Diagnosis-
Prescription rubric are two taxonomies, Manufacturing Enterprise Taxonomy (MET) and the Production System
Taxonomy (PST). The MET consists of 55 elements across 10 major attributes, which were developed based on a
literature summary of manufacturing performance publications. On the other hand, the PST provides a classification
structure for defining 91 recognized best practices [3].
The Evaluation Phase consists of an on-site visit and relies upon the MET based survey instrument, which is used to
quickly characterize the SME’s condition, including a prioritization of Undesirable Effects (UDEs). The Diagnosis
Phase’s objective is to identify the root cause(s) which limit the SME’s performance. This stage relies upon
information obtained during the evaluation stage to logically construct cause-effect relationships presented in the
form of a Current Reality Tree. The Prescription Phase uses the PST as a guide for developing recommendations
which target the elimination of root cause(s) for the purpose of improving the SME’s performance. The overall
TBAM assessment framework is described in Table 1.
Table 1: Framework for TBAM
Evaluation
Diagnosis
Prescription
Objective(s)
Characterization of the firm
and its competitive
environment and
identification of
Undesirable Effects (UDEs)
Capture cause and effect
relationships that explain
UDEs so that root cause(s)
to increased performance
are illustrated
Determine set of
recommendations which
target toot causes
Tool
Manufacturing Enterprise
Taxonomy (MET) survey
Goldratt’s Current Reality
Tree (CRT)
Production System
Taxonomy (PST)
2. Reliability and Validity
2.1 Defined within Manufacturing Assessments
In order to address concerns of Reliability (R) and Validity (V), these issues must be defined within the domain of
manufacturing assessments. The following definitions are offered.
Validity refers to the efficacy of the assessment methodology in terms of developing recommendations
which result in improving the performance of the SME.
Reliability is concerned with the level of repeatability in terms of the type of prescriptions resulting from
the TBAM approach using qualified assessors.
In this context, these concerns present daunting challenges to actually measure within a timely and resource feasible
manner. Perhaps, the best measure of validity would be to compare the performance of the enterprise before and
after the implementation of recommendations. However, this longitudinal approach requires sufficient time, perhaps
several months, to implement recommendations and observe their impact. Also this approach requires a sufficiently
large number of SMEs to participate in the study so impacts are evaluated against the many other factors that occur
over time. Some of these extraneous sources of variability include the following: changes in overall economy,
unexpected turnover of key employees, sudden shifts in business volume, and changes in the market. Finally, this
Walden and Greenwood
type of study may require a large number of qualified assessors and a relatively long time period just to complete the
number of needed assessments.
Similarly, determining the best measures of reliability also faces many practical difficulties. Typically reliability is
measured by raters or assessors making independent judgments of the same phenomenon, which is termed the inter-
rater reliability problem [2]. For this to occur within the manufacturing assessment problem, one might envision
teams of multiple assessors descending upon a SME, conducting parallel assessments using the same methodology.
This research argues that such an approach would cause significant disruption to the SME.
The ultimate purpose of the assessment is to provide guidelines and structure that enable qualified assessors to
develop effective recommendations. The literature and experience indicate that multiple approaches (i.e., a variety
of potential recommendations) are effective in terms of improving manufacturing performance [4]. Once a basic
level of reliability is reached, achieving higher levels of validity is more important than increasing reliability.
2.2 Research Design
This research uses a Case Study-Review Panel technique for estimating R and V. This required the implementation
of a TBAM assessment at participating SMEs and the subsequent documentation of a case study describing the
assessment field experience. These case studies were presented to a Review Panel (RP). The RP was comprised of
members with extensive experience in leading and driving performance improvements within SMEs.
Review Panel feedback from particular case studies was conducted in two rounds. The first round involved the
panel review of case study documentation resulting from the Evaluation and Diagnosis stages. The second round
consisted of the presentation of results from the Prescription stage. RP members were asked to provide feedback,
first individually and then collectively. Specifically, each RP member was asked to multi-vote the degree of
relationship between all elements of the PST and the root causes derived by the Assessment Team as a result of the
Diagnosis stage. Next, each panel member individually selected the same number of elements from the PST as the
field Assessment Team. These selections were done by the RP members, first individually and then as a group.
Results were compared to the selections made by the field Assessment Team. The last step was included the RP
scoring of the actual recommendations from the field assessment. This process is illustrated in Figure 1.
Panel Review Board
Sr. Management Rep.
Recommendations
Assessment Team
Case
Study #3
Case
Study #1
Case
Study #2
Reliability
Rating of PST Elements
Selection of PST
Validity:
Receptiveness of the client to
the recommendations
Agreement with Panel Review
Research Design
Figure 5.1 Overview of Research Design
2.3 Reliability and Validity Measures
The following discussion presents the manner in which R and V are measured. A specific Case Study-Review Panel
interaction is presented. The SME is referred to as Case Study Beta. Specifically, reliability (R1) measures the level
of agreement or consistency between appraisers. In this case, appraisers refer to both RP members as well as the
Field Assessment Team. Specifically, R1 is the total number of matches obtained when each RP member’s PST
selections are compared pair-wise. The PST selections made by the Assessment Team in the field are compared to
the PST selections of all RP members. The overall pair-wise matches among all appraisers, panel members and the
field, yield the measure R1. The number of matches for the two appraiser case (i.e., consensus from panel and the
field Assessment Team selections) provides a measure of validity (V1). Also the rating of the panel review members
regarding each specific recommendation was obtained, which provided an additional measure of validity (V3).
These measures are summarized in Table 2.
Walden and Greenwood
Table 2 Measures of Validity and Reliability
Concern
Measures
Reliability
R1
Validity
V1
V2
V3
3.0 Analysis of Reliability and Validity Data
3.1 General Approach
The basic problem is one of multiple appraisers evaluating an object of interest (i.e., the TBAM case study) and
making a selection from a larger set of possible prescriptions. The random variable (X) is the number of selection
matches based on pair-wise comparisons of appraisers. The parameters of the problem are: the number of appraisers
(A), the number of selections (S) allowed for each appraiser, and the set of possible prescriptions (N). Generally the
inter-rater reliability problem is to determine the level of consistency between “m” raters evaluating “n” objects [2].
However, the problem of interest to this research is the special case where the set of “n” objects is restricted to the
case, n=1, and the rating is a selection of prescriptions, rather than a rating from an anchored scale. This situation of
interest was not found to be addressed in the literature and is depicted in Figure 2.
Illustration of Inter-Rater Reliability
Problem: Appraiser Consistency
Appraisers Case
(Subject)
A1A1
A2A2
A3A3
11
Response
PST 1
PST 2 PST 5
PST 3
PST 4
PST 6
PST 1PST 1
PST 2PST 2 PST 5PST 5
PST 3PST 3
PST 4PST 4
PST 6PST 6
Prescriptions
Total Number of Pair-wise
Matches (X)
Range: 0 to 6
Parameters:
Size of Prescription Set (N=6)
Number of Appraisers (A=3)
Number of Selections (S=2) where S<N
3 Pairs: A1-A2, A1-A3, A2-A3
Maximum # 0f Matches/pair = 2
Thus max of 6 total matches
62 3
22
A
S
Figure 2 Illustration of Inter-Rater Reliability Problem: Appraiser Consistency
It can be shown that the total number of possible pair-wise matches, given the number of appraisers (A) and the
number of selections allowed (S) is:
2
___ A
SMatchesofNumberTotal
(1)
The challenge is to determine if the total number of pair-wise matches (X) is consistent with chance causes or not.
If the chance hypothesis can be rejected, then the appraisers are said to hold to at least a minimum threshold of
reliability. This random variable can be shown to be approximately binomially distributed [3].
xn
xpp
x
n
xP ˆ
1
ˆ
where
12
2
ˆN
A
S
p
(2)
3.2 Analysis of Case Study-Panel Review
For brevity, most of this discussion is focused on determining and evaluating the measures R1 and V1. Using Figure
3, the value of R1 was determined to be 91. This means that there were 91 pair-wise “matches” from the PST (i.e.,
taxonomy of best practices) counted across all 6 Appraisers (i.e., 5 PR Members and the field Assessment Team).
Walden and Greenwood
Number of Pairwise Matches Based on PST Selection
Including Panel Review and Field
Case Beta
PRM-1 PRM-2 PRM-3 PRM-4 PRM-5 Field
PRM-1
PRM-2 4
PRM-3 6 6
PRM-4 8 3 5
PRM-5 5 5 9 6
Field 6 7 8 5 8
Number of
Matches
29 21 22 11 8
Total Number
of Matches
91
Figure 3 Case Beta - Unique Pair-wise Matches Based on PST Selection
These matches are generated from each of the A=6 appraisers (i.e., 5 members of the RP and 1 Field Assessment
Team) making S=14 selections from the N=91 set of PST. The number of selections that RP could pick was limited
by the earlier choice of the Field Assessment Team. It should be noted that these R1=91 “matches” were obtained
from 210 possible matches, as per equation (1), i.e.,
210
2
6
14
2
____ A
SMatcheswisePairPossibleofNumber
It can be shown in Walden (2007) that the number of pair-wise matches is approximately a binomial random
variable [3]. As a result, the R1 value of 91 indicates a sufficiently high number of matches: higher than expected
under the purely chance hypothesis (i.e., approximate p-value <0.01). Therefore we can say that this test case
resulted in an R1 measure of reliability beyond a minimum threshold (i.e., defined merely chance causes).
In a similar fashion, the value of V1 is determined by counting the number of matches between the consensus PST
selections from the RP members and the corresponding PST selections made by the Field Assessment Team. In this
case the number of appraisers was 2 (i.e., A=2). This serves as a measure of validity because it is hypothesized that
the selections of the review panel acting collectively provides an objective and, at least to some degree, an unbiased
perspective on the field selections. For this case, which is illustrated in Figure 4, the number of pair-wise matches
observed between the PR team and the field Assessment Team was eight (i.e., V1=8). These 8 matches were out of a
total possible match set of 14 (determined by using equation (1) with S=14 and A=2). An approximate significance
test for V1 was constructed in a similar manner just described for R1. As a result, the V1=8 outcome is significantly
higher than the number of pair-wise matches that would be expected under chance causes (i.e., approximate p-value
< 0.01). Therefore, we can say that this test case resulted in a measure of validity, as defined by the RP, beyond what
would be expected if only chance causes were present.
The measure of V2 reflects the client’s rating for each of the field Assessment Team’s specific recommendations. As
shown in Figure 4, Case Beta resulted in the field Assessment Team issuing three recommendations. V2 values were
scored by the client for each of the three recommendations, both in terms of effectiveness and implementability. The
anchored score could vary from a “1” to a “5” (i.e., a “1” indicates the client strongly disagreed and a “5” indicates
the client strongly agreed with the recommendation). As illustrated in Figure 5, the Client rating of V2 ranged from
3.5 to 4.0 depending upon the recommendation. This generally reflected the client’s agreement that the
recommendations were both implementable and effective. The only exception from the client’s perspective may
have been Recommendation 1. This recommendation received a lower rating of “3” for implementability, but
maintained a “4” for effectiveness. After additional discussion this reflected a level of discomfort from the client in
terms of how to implement this recommendation within their environment.
Next, using a similar anchored scoring approach, the PR members’ ratings of V3 were slightly higher than the
client’s ratings. This reflects the notion that the PR “generally” agreed with these recommendations, more than the
client. This appeared to be the case for both aspects of implementability and effectiveness. The reason for this
apparent difference deserves additional study and additional insight regarding the validity concern.
Walden and Greenwood
Case Beta: Panel Consensus and Field
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Field
Selections
YesX2.B-3 JIT Inventory Control
NoX2.B-5 Logistics Management
NoX1.B-6 Value Engineering
NoX1.B-4 Design for Manufacturability
NoX1.D-5 New Process Development
No1.D-4 CAD and Engineering
No4.E-4 Culture Change
No 3.D-4 MRP/ERP
YesX3.C-3 Cellular Manufacturing
YesX1.C-4 LT Reduction
No4.B-7 Link Manufacturing to
Strategy
NoX4.B-6 Balanced Scorecard
YesX4.B-4 Time Based Management
YesX4.B-1 Lean Production
YesX1.B-3 Process Mapping
YesX1.B-2 JIT Production
No1.B-1 Reduced WIP
NoX4.A-1 Total Quality Management
No3.A-1 Quality Improvement Teams
YesX2.A-1 Supply Chain Partnering
Match
(Yes/No)
Panel
Selections
PST Element
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Field
Selections
YesX2.B-3 JIT Inventory Control
NoX2.B-5 Logistics Management
NoX1.B-6 Value Engineering
NoX1.B-4 Design for Manufacturability
NoX1.D-5 New Process Development
No1.D-4 CAD and Engineering
No4.E-4 Culture Change
No 3.D-4 MRP/ERP
YesX3.C-3 Cellular Manufacturing
YesX1.C-4 LT Reduction
No4.B-7 Link Manufacturing to
Strategy
NoX4.B-6 Balanced Scorecard
YesX4.B-4 Time Based Management
YesX4.B-1 Lean Production
YesX1.B-3 Process Mapping
YesX1.B-2 JIT Production
No1.B-1 Reduced WIP
NoX4.A-1 Total Quality Management
No3.A-1 Quality Improvement Teams
YesX2.A-1 Supply Chain Partnering
Match
(Yes/No)
Panel
Selections
PST Element
Validity (V1)
Number of
Matches (X) = 8
A=2
S=14
N=91
Case Study Beta Effectiveness Implement-ability
Rater
"The recommendation, if
implemented, would have a
substantially positive impact on
the manufacturing enterprise."
"The recommendation is
practical and implementable
without spending excessive
time and resources."
Please rate each
recommendation on the
folowing scale
Please rate each
recommendation on the
folowing scale
Score 1: Strongly Disagree Score 1: Strongly Disagree
Score 5: Strongly Agree Score 5: Strongly Agree
Recommendation #1
Develop ability to compare requirements with the capacity of
key workstations. This will enable the constraint to be identified
and appropriate operational measures to be tracked. This
should guide improvement actions for increasing system
capacity.
Comparison between Client Rating and the Rating from
the Review Panel (Average)
Recommendation #3
Develop a value stream map both “as is” and “to be” for lead-
time sensitive products. This should include the key activities
and the calculation of percent “value add” time for
comparison against world class performance. The “to be”
case establishes the vision for process excellence. The
mapping and transition effort should include a broad cross
section of team members.
Recommendation #2
Develop an overall business plan for establishing the value
of rapid lead-time capability. This includes exploring
partnerships with suppliers of key raw materials, reorganizing
production operations to facilitate flow, finding ways of
streamlining pre-production operations, and rationalizing
appropriate capital investments. Of particular promise are
ways to reduce design complexity (e.g., parametric CAD).
Review
Panel
(n=5)
4.4
4.8
Client
(n=1)
4
Client
(n=1)
4
Review
Panel
(n=5)
Review
Panel
(n=5)
Client
(n=1)
3
4.4
4.8
Overall
Score
4
4.0
4
4.6
4.0
3.5
4.2
4.2
4.5
4
4.8
Measures of V2 (client), V3 (panel)
Figure 4. Case Beta: PST Selection Matches (V1) and Review of Recommendations (V2, V3)
4. Conclusion
This paper has defined and developed initial measurements for Reliability and Validity within the context of the
TBAM approach for conducting assessments. A practical approach for measuring reliability and validity is presented
using a Case Study-Review Panel technique based on TBAM implementation. In this situation, an approximate test
of significance indicates that R1 and V1 were both higher than expected if only chance causes were present. This
initial set of R and V measures, are preliminary and in many ways represent only a minimum threshold. In addition,
this research may encourage the use of TBAM as a “qualified” research instrument for conducting practice-
performance studies across SMEs. Finally, additional study is needed to determine if the general R and V approach
implemented in this research is applicable to other assessment methodologies (e.g., MBNQA and Shingo Prize).
References
1. Walden, C.T. and Greenwood, A.G., 2009 "Assessing Small and Medium Manufacturing Enterprises: A
Taxonomy Based Approach", 2009 American Society of Engineering Management Conference
Proceedings, October 2009, Springfield, MO.
2. Heiman, G. W., 1998, Understanding Research Methods and Statistics: An Integrated Introduction for
Psychology, Houghton Mifflin Company, 62.
3. Walden, C.T., 2007, Taxonomy Based Assessment Methodology for Small and Medium Size
Manufacturers, Ph.D. Dissertation, Mississippi State University, 116-208.
4. Kathuria, R., 2000 “Competitive Priorities and Managerial Performance: a taxonomy of small
manufacturers”, Journal of Operations Management, Vol. 18, 638.
5. Walden, C. T., Greenwood, A.G., Babin, P. D., Moore, J.P., 2009, “Pilot Application of a Taxonomy Based
Assessment Methodology for Small and Medium Size Manufacturing Enterprises”, Proceedings of the
2009 Industrial Engineering Research Conference, May 2009, Miami, FL.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The National Research Council has concluded that substantial barriers exist to increasing the performance of small and medium size manufacturing enterprises (SMEs). Research has found these barriers to include " difficulty in obtaining high quality, unbiased advice. " A first generation Taxonomy Based Assessment Methodology (TBAM) was developed to address this problem. This paper describes a case study of the TBAM approach deployed at Mimeo.com, an internet based specialty printing company. Several of the TBAM derived recommendations were tracked through implementation. Based on this case study, feedback on the methodology was obtained and inferences drawn regarding validity of the TBAM approach.
Article
Full-text available
One of the major barriers facing small and medium size manufacturing enterprises (SME) is lack of access to obtaining "high quality and unbiased advice." This paper presents a broad overview of the Taxonomy Based Assessment Methodology (TBAM), which was developed to provide a rapid, unbiased, and enterprise-wide assessment of small and medium size manufacturing firms. The purpose of TBAM is to determine a set of recommendations which target improved performance. This methodology combines two taxonomies - one of manufacturing enterprise characteristics and the other of best practices - for use in a three-stage assessment rubric (evaluation, diagnosis, and prescription). The overall methodology and some of the theoretical foundation for this approach are described, along with a discussion of the feedback resulting from several field applications of TBAM. Copyright
Article
Full-text available
Much of the research in manufacturing strategy has focused on specific relationships between a few constructs, with relatively little emphasis on typologies and taxonomies [Bozarth, C., McDermott, C., 1998. Configurations in manufacturing strategy: a review and directions for future research. Journal of Operations Management 16 (4) 427–439]. Using data from 196 respondents in 98 manufacturing units, this study develops a taxonomy of small manufacturers based on their emphasis on several competitive priorities. The annual sales for 64% of the participating units in this study are below US$50 million, which is on the lower side as compared to other studies in this area [cf., Miller, J.G., Roth, A.V., 1994. A taxonomy of manufacturing strategies. Management Science 40 (3) 285–304]. The study findings indicate that different groups of manufacturers — Do All, Speedy Conformers, Efficient Conformers, and Starters — emphasize different sets of competitive priorities, even within the same industry. Further, the Do All types, who emphasize all four competitive priorities, seem to perform better on customer satisfaction than their counterparts in the Starters group. The above findings lend support to the sandcone model but contradict the traditional trade-off model.
Understanding Research Methods and Statistics: An Integrated Introduction for Psychology
  • G W Heiman
Heiman, G. W., 1998, Understanding Research Methods and Statistics: An Integrated Introduction for Psychology, Houghton Mifflin Company, 62.